Refactored tokenizer and parser output to help reduce the amount of memory needed.#7602
Refactored tokenizer and parser output to help reduce the amount of memory needed.#7602
Conversation
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
|
Thanks, will do. |
|
I'm going to do some perf analysis of the non pinned version to see if there's any perf impact. |
|
Perf analysis of the non-pinned version shows it has zero impact on a couple of different projects. I'd imagine the only real impact would be in having to GC more often. |
|
Thanks for the analysis. One thing that's still bothering me... Why are we seeing heap exhaustion in the first place? We have code in place to monitor heap usage and deallocate cached information (including all tokenizer output, parser output, and type cache data) when we exceed a high-water mark for memory usage. The only reason we should ever run out of heap space is if a single operation requires that much memory and does not "come up for air" to check whether the heap is approaching exhaustion, and I have a hard time believing that's happening here. What operation are you performing when you see this occur? The only other explanation is that we have some bug where we're missing a heap exhaustion check along some code path or we're failing to deallocate all of the cached information under certain circumstances. It would be good to understand this root cause. |
I believe the reason is because we have 3 threads running and the heap information doesn't accurately reflect the usage of the other threads. Each thread thinks it only has 1GB or so allocated. This seems to be a limitation that happens when using pointer compression (it only happens in VS code). Without using VS code's node exe, each thread gets its own heap and we don't run out of memory. The heap stats function in the VS code case doesn't report the usage of all threads. When I did the initial investigation on the microsoft/pylance-release#5440 bug, I had implemented something that would force a cache cleanup in the middle of binding and would share heap stats information using a SharedArrayBuffer. That did help, but it was slow and dumping parse results while binding other files would just make some parse trees not work. |
|
@heejaechang and I both did independent analysis of the shared memory problem. He had the same results. In VS code, the different worker threads don't each get their own 4GB, they share it. Meaning our heap high water mark is always wrong when there are multiple threads involved. So maybe we should also modify the heap analysis code to share the total memory allocated between threads? I had written something to do that and it seemed to work. That would help with this issue too. |
That sounds like a big problem. Without solving this, we'll continue to see heap exhaustion issues under certain conditions. I think we need to invest in a solution for this underlying problem rather than playing whack-a-mole. Reducing memory usage is good in general, but it's not addressing the core problem here. |
Okay I'll submit a change that uses a SharedArrayBuffer to share memory usage between threads to see what you think. |
we introduced a new option - https://github.com/microsoft/pylance-release/blob/main/TROUBLESHOOTING.md#pylance-is-crashing - that lets users to use their own we didn't make it a default since 1. using |
|
I don't think it makes sense to say "if pylance is crashing, try this option". Pylance should not crash by default. The heap usage monitoring logic should prevent heap exhaustion if the information it receives is accurate. The SharedArrayBuffer approach sounds promising. |
|
I also did some investigation on microsoft/pylance-release#5440 using chrome/edge's heap snapshot. among all heaps usage, about so, @rich is trying to reduce token's memory footprint. and I am also experimenting on reducing text's footprint. ... about on so the issue is the existing cache dump only works between top level operations, it can't dump stuff from memory in the middle of import look ups. |
with probably, best option is convincing vscode to not do |
By "text", I assume you mean the contents of source files that have been tokenized & parsed? We could use the same technique that I implemented in this branch but apply it to the file contents in addition to the tokenizer output. This would require that we reload the text on demand (from disk) if and when it's needed.
Ah yes, I remember this one now. We had talked about eliminating the need to follow imports transitively at the expense of not having the correct icon in the completion suggestion menu. That seems like a good tradeoff to me. You mentioned in this issue that you fixed the problem, but you never responded to my question about which fix you implemented. |
by the way, the experiment I am doing is introducing according to the experiment I did, I looked through code where so, there is no reason to hold onto the text in the managed heap consuming 25% of total heap. my plan is putting those text in the array buffer and convert to string when needed for closed files (for things like |
ah, I linked to the code in and
this is just one code path that reaches that code path all potentially can hit the same issue. so if we want to fix the root cause, we probably need to figure out a way to dump memory in the middle of this import look up chain. |
I think this is a bit dangerous since now text and parse tree's position info can get out of sync. I think putting it in the array buffer and its own heap is safer since it can guarantee data integrity. otherwise, we just need to live with possible race or put some kind of other way to check data integrity such as content hash somewhere and we also need to deal with invalid case. so putting it array buffer seems simpler. also, another reason we want |







@rchiodo, this is an alternate approach to optimizing the tokenizer data structures. With this change, the tokens for a file are not cached in memory unless the file is currently open in the editor. They are recomputed on demand if/when needed. This should theoretically reduce the memory usage more than your proposed change. I think this change is less complex than yours — and therefore easier to maintain.
Could you please run your memory benchmarks on this branch to confirm the memory savings and quantify the savings relative to your proposed change? We can then use this data to inform which change is preferable.