Refactored tokenizer and parser output to help reduce the amount of memory needed.#7602

erictraut · 2024-04-02T05:44:43Z

@rchiodo, this is an alternate approach to optimizing the tokenizer data structures. With this change, the tokens for a file are not cached in memory unless the file is currently open in the editor. They are recomputed on demand if/when needed. This should theoretically reduce the memory usage more than your proposed change. I think this change is less complex than yours — and therefore easier to maintain.

Could you please run your memory benchmarks on this branch to confirm the memory savings and quantify the savings relative to your proposed change? We can then use this data to inform which change is preferable.

…emory needed.

github-actions · 2024-04-02T05:51:35Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

rchiodo · 2024-04-02T16:07:34Z

Thanks, will do.

rchiodo · 2024-04-02T17:17:26Z

So I renamed the token TextRangeCollection to a TokenCollection class so I could compare them (this allows it to show up in the heap snapshot output).

In the worst case scenario where I open the file that has a ton of tokens, the compressed version is better:

Non pinned tokens (just one reference for the single open file):

Compressed tokens (11 references for all the files parsed):

However if I close that file and open a different one, the amount of memory goes way down (non-pinned version):

Whereas the compressed version still holds onto all compressed tokens.

I'm going to try with Pylance to see what the difference there is.

rchiodo · 2024-04-02T18:00:40Z

With pylance I tried loading the example project that would crash before.

The non-pinned version would get close to the memory limit but would gc and move back away from crashing.

The compressed version never got close to the memory limit. However it did hold onto a lot more memory.

Token collection retained size in the non pinned version:

Token collection retained size in compressed version:

Non pinned version seems better in an actual project.

You can see the total memory allocated is smaller:

Non pinned tokens:

Token compression:

rchiodo · 2024-04-02T18:01:05Z

I'm going to do some perf analysis of the non pinned version to see if there's any perf impact.

rchiodo · 2024-04-02T18:50:20Z

Perf analysis of the non-pinned version shows it has zero impact on a couple of different projects. I'd imagine the only real impact would be in having to GC more often.

erictraut · 2024-04-02T19:28:52Z

Thanks for the analysis.

One thing that's still bothering me... Why are we seeing heap exhaustion in the first place? We have code in place to monitor heap usage and deallocate cached information (including all tokenizer output, parser output, and type cache data) when we exceed a high-water mark for memory usage. The only reason we should ever run out of heap space is if a single operation requires that much memory and does not "come up for air" to check whether the heap is approaching exhaustion, and I have a hard time believing that's happening here. What operation are you performing when you see this occur? The only other explanation is that we have some bug where we're missing a heap exhaustion check along some code path or we're failing to deallocate all of the cached information under certain circumstances. It would be good to understand this root cause.

rchiodo · 2024-04-02T19:48:34Z

Thanks for the analysis.

One thing that's still bothering me... Why are we seeing heap exhaustion in the first place? We have code in place to monitor heap usage and deallocate cached information (including all tokenizer output, parser output, and type cache data) when we exceed a high-water mark for memory usage. The only reason we should ever run out of heap space is if a single operation requires that much memory and does not "come up for air" to check whether the heap is approaching exhaustion, and I have a hard time believing that's happening here. What operation are you performing when you see this occur? The only other explanation is that we have some bug where we're missing a heap exhaustion check along some code path or we're failing to deallocate all of the cached information under certain circumstances. It would be good to understand this root cause.

I believe the reason is because we have 3 threads running and the heap information doesn't accurately reflect the usage of the other threads. Each thread thinks it only has 1GB or so allocated. This seems to be a limitation that happens when using pointer compression (it only happens in VS code). Without using VS code's node exe, each thread gets its own heap and we don't run out of memory. The heap stats function in the VS code case doesn't report the usage of all threads.

When I did the initial investigation on the microsoft/pylance-release#5440 bug, I had implemented something that would force a cache cleanup in the middle of binding and would share heap stats information using a SharedArrayBuffer. That did help, but it was slow and dumping parse results while binding other files would just make some parse trees not work.

rchiodo · 2024-04-02T19:53:13Z

@heejaechang and I both did independent analysis of the shared memory problem. He had the same results. In VS code, the different worker threads don't each get their own 4GB, they share it. Meaning our heap high water mark is always wrong when there are multiple threads involved.

So maybe we should also modify the heap analysis code to share the total memory allocated between threads?

I had written something to do that and it seemed to work. That would help with this issue too.

erictraut · 2024-04-02T19:58:16Z

Meaning our heap high water mark is always wrong when there are multiple threads involved.

That sounds like a big problem. Without solving this, we'll continue to see heap exhaustion issues under certain conditions. I think we need to invest in a solution for this underlying problem rather than playing whack-a-mole. Reducing memory usage is good in general, but it's not addressing the core problem here.

rchiodo · 2024-04-02T20:01:18Z

Meaning our heap high water mark is always wrong when there are multiple threads involved.

That sounds like a big problem. Without solving this, we'll continue to see heap exhaustion issues under certain conditions. I think we need to invest in a solution for this underlying problem rather than playing whack-a-mole. Reducing memory usage is good in general, but it's not addressing the core problem here.

Okay I'll submit a change that uses a SharedArrayBuffer to share memory usage between threads to see what you think.

heejaechang · 2024-04-02T20:05:20Z

That sounds like a big problem. Without solving this, we'll continue to see heap exhaustion issues under certain conditions. I think we need to invest in a solution for this underlying problem rather than playing whack-a-mole. Reducing memory usage is good in general, but it's not addressing the core problem here.

we introduced a new option - https://github.com/microsoft/pylance-release/blob/main/TROUBLESHOOTING.md#pylance-is-crashing - that lets users to use their own node for language server rather than using vscode as node to run language server. that should solve the issue generally for users that has big workspace/workspaces needing more than 4GB.

we didn't make it a default since 1. using vscode as node is generally more convenient and 2. uses less memory and 3. a bit faster (10%~20%) through pointer compression. and we believe most of users, 4GB is enough.

erictraut · 2024-04-02T20:09:06Z

I don't think it makes sense to say "if pylance is crashing, try this option". Pylance should not crash by default.

The heap usage monitoring logic should prevent heap exhaustion if the information it receives is accurate. The SharedArrayBuffer approach sounds promising.

heejaechang · 2024-04-02T20:09:19Z

I also did some investigation on microsoft/pylance-release#5440 using chrome/edge's heap snapshot.

among all heaps usage, about 50% is text and tokens. about 25% is used by text. and another 25% is used by tokens.

so, @rich is trying to reduce token's memory footprint. and I am also experimenting on reducing text's footprint.

...

about on why our memory pressure didn't work for the issue is that, that particular library has a lot of imports transitively. so import the top module caused hundreds of files to be parsed and bound and they have several over 20MB texts.

so the issue is the existing cache dump only works between top level operations, it can't dump stuff from memory in the middle of import look ups.

heejaechang · 2024-04-02T20:19:02Z

I don't think it makes sense to say "if pylance is crashing, try this option". Pylance should not crash by default.

with pointer compression that put hard 4GB limit, crash will still happen for some people. or even if they don't crash, UX will be bad since we will continuously dump and reparse/bind files.

probably, best option is convincing vscode to not do pointer compression but that comes with its own drawbacks such as using more memory for the most of people that 4GB is enough for them.

erictraut · 2024-04-02T20:21:43Z

about 50% is text

By "text", I assume you mean the contents of source files that have been tokenized & parsed? We could use the same technique that I implemented in this branch but apply it to the file contents in addition to the tokenizer output. This would require that we reload the text on demand (from disk) if and when it's needed.

that particular library has a lot of imports transitively

Ah yes, I remember this one now. We had talked about eliminating the need to follow imports transitively at the expense of not having the correct icon in the completion suggestion menu. That seems like a good tradeoff to me. You mentioned in this issue that you fixed the problem, but you never responded to my question about which fix you implemented.

heejaechang · 2024-04-02T20:26:57Z

I am also experimenting on reducing text's footprint

by the way, the experiment I am doing is introducing ISourceText and making ParseResults and AnalyzeNodeFileInfo to hold onto the ISourceText rather than string directly.

according to the experiment I did, node with pointer compression has 4GB limit for managed heap. but array buffer heap gets its own 4GB heap. so I am planning to put text in the array buffer heap instead of managed heap where regular js objects live.

I looked through code where text is being used once it is parsed/bound. and this is only place - https://github.com/microsoft/pyright/blob/main/packages/pyright-internal/src/analyzer/typeEvaluator.ts#L26689 - except in LS side.

so, there is no reason to hold onto the text in the managed heap consuming 25% of total heap.

my plan is putting those text in the array buffer and convert to string when needed for closed files (for things like find all references and etc) and cache string for open files.

heejaechang · 2024-04-02T20:36:29Z

You mentioned in microsoft/pylance-release#4241 (comment) that you fixed the problem, but you never responded to my question about which fix you implemented.

ah, I linked to the code in pyrx and I guess that isn't visible to you. this is where I have put it - https://github.com/microsoft/pyright/blob/main/packages/pyright-internal/src/languageService/completionProvider.ts#L637

and

We had talked about eliminating the need to follow imports transitively at the expense of not having the correct icon in the completion suggestion menu

this is just one code path that reaches import look up through resolve alias call. there are many code path to import look up such as importing symbols through * in binder or resolving imported symbols in type evals.

that code path all potentially can hit the same issue. so if we want to fix the root cause, we probably need to figure out a way to dump memory in the middle of this import look up chain.

heejaechang · 2024-04-02T20:42:52Z

This would require that we reload the text on demand (from disk) if and when it's needed.

I think this is a bit dangerous since now text and parse tree's position info can get out of sync. I think putting it in the array buffer and its own heap is safer since it can guarantee data integrity. otherwise, we just need to live with possible race or put some kind of other way to check data integrity such as content hash somewhere and we also need to deal with invalid case. so putting it array buffer seems simpler.

also, another reason we want ISourceText abstraction is so that pylance can abstract it out for other features (such as synthesize text)

Refactored tokenizer and parser output to help reduce the amount of m…

e55d860

…emory needed.

rchiodo approved these changes Apr 2, 2024

View reviewed changes

erictraut merged commit 5fd8830 into main Apr 2, 2024

erictraut deleted the reduceTokenOverhead branch April 3, 2024 14:20

Conversation

erictraut commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 2, 2024

Uh oh!

rchiodo commented Apr 2, 2024

Uh oh!

rchiodo commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rchiodo commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rchiodo commented Apr 2, 2024

Uh oh!

rchiodo commented Apr 2, 2024

Uh oh!

erictraut commented Apr 2, 2024

Uh oh!

rchiodo commented Apr 2, 2024

Uh oh!

rchiodo commented Apr 2, 2024

Uh oh!

erictraut commented Apr 2, 2024

Uh oh!

rchiodo commented Apr 2, 2024

Uh oh!

heejaechang commented Apr 2, 2024

Uh oh!

erictraut commented Apr 2, 2024

Uh oh!

heejaechang commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heejaechang commented Apr 2, 2024

Uh oh!

erictraut commented Apr 2, 2024

Uh oh!

heejaechang commented Apr 2, 2024

Uh oh!

heejaechang commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heejaechang commented Apr 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erictraut commented Apr 2, 2024 •

edited

Loading

rchiodo commented Apr 2, 2024 •

edited

Loading

rchiodo commented Apr 2, 2024 •

edited

Loading

heejaechang commented Apr 2, 2024 •

edited

Loading

heejaechang commented Apr 2, 2024 •

edited

Loading