scip-syntax: implements parallel processing for workspace and tar commands#63536
scip-syntax: implements parallel processing for workspace and tar commands#63536kritzcreek merged 8 commits intomainfrom
Conversation
6e954dd to
9f3f53d
Compare
Yeah, that's fine. We should default to max parallelism.
No. |
Thinking about it, I think that might not be possible for the archive case, because doing 'producer/consumer' one a single thread means I'd read the entire archive into memory before starting to index. |
There was a problem hiding this comment.
Will this short-circuit if there is a failure when indexing a single document? We should try to return partial results as much as possible, and log the errors that we got. If all results were errors, then it makes sense to mark the indexing process as an overall failure. I would expect behavior like:
- 0 valid filepaths -> exit success
- N > 0 filepaths, 0 Documents -> exit failure
- N > 0 filepaths, M Documents, 0 < M < N -> exit success and log up to max K errors (say K = 10?). For debugging, maybe we should allow K to be configurable from the outside.
There was a problem hiding this comment.
Will this short-circuit if there is a failure when indexing a single document?
Yes it will (but it also did before this PR)
We should try to return partial results as much as possible, and log the errors that we got.
I agree, maybe we could produce a machine readably "summary" log message?
If all results were errors, then it makes sense to mark the indexing process as an overall failure
I don't know if there's a great default K/N to pick. Definitely worth discussing, but should probably be its own issue and PR?
There was a problem hiding this comment.
I agree, maybe we could produce a machine readably "summary" log message?
yeah, my plan was to output a json result somewhere controlled by the caller - at the very least so we can send some internal CLI metrics to prometheus as well, when invoked from the worker
There was a problem hiding this comment.
OK, I've filed a couple of follow-up issues:
https://linear.app/sourcegraph/issue/GRAPH-716/scip-syntax-support-observability
https://linear.app/sourcegraph/issue/GRAPH-717/scip-syntax-gracefully-handle-partial-failures
d9da9ae to
1ddbd23
Compare
Can we measure the memory overhead of collecting all file paths? Ideally we won't have to, as those paths have a lot of duplicated prefixes as strings, and I imagine on really large repos it can be significant. That said, we're mainly interested in tar mode, workspace is there mainly for testing and experimentation |
|
Note to self. Use a bounded queue for tar processing ✅ |
There was a problem hiding this comment.
Does rayon support threadpool injection?
Will this global state cause problems in integration tests where we are likely to call this twice?
There was a problem hiding this comment.
Does rayon support threadpool injection?
You can create scoped threadpools, but I didn't dig too much into this.
Will this global state cause problems in integration tests where we are likely to call this twice?
Only if the integration test calls with Some(worker_count) twice. If you want to set the job count you'd need to set it on the first call, and then pass None going forward.
Actually I was able to use |
4d9f40e to
2075eca
Compare
|
Okay I went for one final round of changes:
Currently errors in the producers cause files to be skipped, while errors in the worker are collected and returned. My current preference would be to just log the errors in the producer (when we introduce tracing) as they're essentially all caused by non-Utf8-ness. Sorry for the churn... Should be ready to review and merge now. |
2075eca to
de23211
Compare
…63564) Closes https://linear.app/sourcegraph/issue/GRAPH-718/dont-parse-every-file-twice-in-scip-syntax While working on parallel scip-syntax I noticed we're re-parsing every file for both globals and locals generation. This PR makes it so we use the same parse tree for both. I've also made it so we don't re-initialize the language parsers for every file. The thread-local-storage aspect of this matters after merging #63536 ## Test plan ``` spring-framework on main [?] ❯ hyperfine --warmup 1 'git archive HEAD | scip-syntax-main index tar - -l java -o main.scip' 'git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip' Benchmark 1: git archive HEAD | scip-syntax-main index tar - -l java -o main.scip Time (mean ± σ): 6.894 s ± 0.033 s [User: 6.877 s, System: 0.093 s] Range (min … max): 6.814 s … 6.921 s 10 runs Benchmark 2: git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip Time (mean ± σ): 4.675 s ± 0.008 s [User: 4.693 s, System: 0.084 s] Range (min … max): 4.663 s … 4.688 s 10 runs Summary git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip ran 1.47 ± 0.01 times faster than git archive HEAD | scip-syntax-main index tar - -l java -o main.scip spring-framework on main [?] ❯ scip-syntax scip-evaluate --ground-truth main.scip --candidate single-parse.scip {"precision_percent":"100.0","recall_percent":"100.0","true_positives":"569982.0","false_positives":"0.0","false_negatives":"0.0"} ```
de23211 to
a6ff38f
Compare
This way we can do the processing on the main thread, and the code looks more "linear"
This way we don't need to set a global thread pool
a6ff38f to
e6cf871
Compare
keynmol
left a comment
There was a problem hiding this comment.
Nice. I really like the unified model for all three modes
Closes https://linear.app/sourcegraph/issue/GRAPH-694/process-documents-in-parallel-in-scip-syntax
Parallelizes syntactic indexing by splitting the different indexing modes into job producers and then consuming them from a pool of workers using the
rayonlibrary.scip-syntaxwill use the maximum available parallelism by default. The-j/--jobsargument can be used to control the number of worker threads.Benchmarking on
spring-project/spring-framework:Comparison against
main:Test plan
Compare the output of
scip-syntaxfrommainagainst the output of this branch withscip-syntax scip-evaluate