scip-syntax: implements parallel processing for workspace and tar commands#63536

kritzcreek · 2024-06-28T06:30:41Z

Closes https://linear.app/sourcegraph/issue/GRAPH-694/process-documents-in-parallel-in-scip-syntax

Parallelizes syntactic indexing by splitting the different indexing modes into job producers and then consuming them from a pool of workers using the rayon library.

scip-syntax will use the maximum available parallelism by default. The -j/--jobs argument can be used to control the number of worker threads.

Benchmarking on spring-project/spring-framework:

spring-framework on main
❯ hyperfine -L num_threads 1,4,8,10  'git archive HEAD | scip-syntax index tar - -j {num_threads} -l java' 'scip-syntax index workspace ${PWD} -l java -j {num_threads}'
Benchmark 1: git archive HEAD | scip-syntax index tar - -j 1 -l java
  Time (mean ± σ):      7.171 s ±  0.195 s    [User: 7.165 s, System: 0.149 s]
  Range (min … max):    7.047 s …  7.714 s    10 runs

Benchmark 2: scip-syntax index workspace ${PWD} -l java -j 1
  Time (mean ± σ):      7.100 s ±  0.061 s    [User: 6.965 s, System: 0.381 s]
  Range (min … max):    6.983 s …  7.184 s    10 runs

Benchmark 3: git archive HEAD | scip-syntax index tar - -j 4 -l java
  Time (mean ± σ):      1.927 s ±  0.044 s    [User: 7.366 s, System: 0.144 s]
  Range (min … max):    1.884 s …  2.034 s    10 runs

Benchmark 4: scip-syntax index workspace ${PWD} -l java -j 4
  Time (mean ± σ):      1.884 s ±  0.029 s    [User: 7.110 s, System: 0.342 s]
  Range (min … max):    1.855 s …  1.960 s    10 runs

Benchmark 5: git archive HEAD | scip-syntax index tar - -j 8 -l java
  Time (mean ± σ):      1.040 s ±  0.015 s    [User: 7.594 s, System: 0.162 s]
  Range (min … max):    1.025 s …  1.078 s    10 runs

Benchmark 6: scip-syntax index workspace ${PWD} -l java -j 8
  Time (mean ± σ):      1.010 s ±  0.007 s    [User: 7.275 s, System: 0.490 s]
  Range (min … max):    1.001 s …  1.023 s    10 runs

Benchmark 7: git archive HEAD | scip-syntax index tar - -j 10 -l java
  Time (mean ± σ):     962.7 ms ±   2.8 ms    [User: 8510.0 ms, System: 167.0 ms]
  Range (min … max):   959.0 ms … 966.7 ms    10 runs

Benchmark 8: scip-syntax index workspace ${PWD} -l java -j 10
  Time (mean ± σ):     935.7 ms ±   3.6 ms    [User: 8171.8 ms, System: 490.2 ms]
  Range (min … max):   930.9 ms … 943.5 ms    10 runs

Comparison against main:

spring-framework on  main [?]
❯ hyperfine 'git archive HEAD | scip-syntax index tar - -l java -o main.scip' 'git archive HEAD | scip-syntax-par index tar - -l java -o par.scip'
Benchmark 1: git archive HEAD | scip-syntax index tar - -l java -o main.scip
  Time (mean ± σ):      6.969 s ±  0.214 s    [User: 6.896 s, System: 0.096 s]
  Range (min … max):    6.806 s …  7.444 s    10 runs

Benchmark 2: git archive HEAD | scip-syntax-par index tar - -l java -o par.scip
  Time (mean ± σ):     974.7 ms ± 182.5 ms    [User: 9022.8 ms, System: 167.5 ms]
  Range (min … max):   907.1 ms … 1493.8 ms    10 runs

Summary
  git archive HEAD | scip-syntax-par index tar - -l java -o par.scip ran
    7.15 ± 1.36 times faster than git archive HEAD | scip-syntax index tar - -l java -o main.scip

spring-framework on  main [?]
❯ scip-syntax scip-evaluate --ground-truth main.scip --candidate par.scip
{"precision_percent":"100.0","recall_percent":"100.0","true_positives":"569982.0","false_positives":"0.0","false_negatives":"0.0"}

Test plan

Compare the output of scip-syntax from main against the output of this branch with scip-syntax scip-evaluate

varungandhi-src · 2024-06-28T07:08:11Z

How would we like to configure this. Maybe a -j flag to control the amount of workers?

Yeah, that's fine. We should default to max parallelism.

Should I keep a non-rayon codepath for -j 1?

No.

kritzcreek · 2024-06-28T07:16:26Z

Should I keep a non-rayon codepath for -j 1?

No.

Thinking about it, I think that might not be possible for the archive case, because doing 'producer/consumer' one a single thread means I'd read the entire archive into memory before starting to index.

varungandhi-src · 2024-06-28T07:16:30Z

docker-images/syntax-highlighter/crates/scip-syntax/src/index.rs

Will this short-circuit if there is a failure when indexing a single document? We should try to return partial results as much as possible, and log the errors that we got. If all results were errors, then it makes sense to mark the indexing process as an overall failure. I would expect behavior like:

0 valid filepaths -> exit success

N > 0 filepaths, 0 Documents -> exit failure

N > 0 filepaths, M Documents, 0 < M < N -> exit success and log up to max K errors (say K = 10?). For debugging, maybe we should allow K to be configurable from the outside.

Will this short-circuit if there is a failure when indexing a single document?

Yes it will (but it also did before this PR)

We should try to return partial results as much as possible, and log the errors that we got.

I agree, maybe we could produce a machine readably "summary" log message?

If all results were errors, then it makes sense to mark the indexing process as an overall failure

I don't know if there's a great default K/N to pick. Definitely worth discussing, but should probably be its own issue and PR?

I agree, maybe we could produce a machine readably "summary" log message?

yeah, my plan was to output a json result somewhere controlled by the caller - at the very least so we can send some internal CLI metrics to prometheus as well, when invoked from the worker

OK, I've filed a couple of follow-up issues:

https://linear.app/sourcegraph/issue/GRAPH-716/scip-syntax-support-observability
https://linear.app/sourcegraph/issue/GRAPH-717/scip-syntax-gracefully-handle-partial-failures

keynmol · 2024-06-28T08:31:01Z

For the workspace version I checked, and collecting all filepaths in spring-framework is <5% of the entire running time, so I don't it's necessary to parallelize the walk. (I could use a queue like I do for the tar processing, so that we can start indexing while walking the file system, but I'm not sure that's worth it)

Can we measure the memory overhead of collecting all file paths? Ideally we won't have to, as those paths have a lot of duplicated prefixes as strings, and I imagine on really large repos it can be significant.

That said, we're mainly interested in tar mode, workspace is there mainly for testing and experimentation

kritzcreek · 2024-06-28T08:43:28Z

Note to self. Use a bounded queue for tar processing ✅

keynmol · 2024-06-28T11:20:32Z

docker-images/syntax-highlighter/crates/scip-syntax/src/main.rs

Does rayon support threadpool injection?

Will this global state cause problems in integration tests where we are likely to call this twice?

Does rayon support threadpool injection?

You can create scoped threadpools, but I didn't dig too much into this.

Will this global state cause problems in integration tests where we are likely to call this twice?

Only if the integration test calls with Some(worker_count) twice. If you want to set the job count you'd need to set it on the first call, and then pass None going forward.

kritzcreek · 2024-06-28T12:41:44Z

Can we measure the memory overhead of collecting all file paths? Ideally we won't have to, as those paths have a lot of duplicated prefixes as strings, and I imagine on really large repos it can be significant.

That said, we're mainly interested in tar mode, workspace is there mainly for testing and experimentation

Actually I was able to use par_bridge here as well. So now we should only be holding on to however many filepaths we're currently processing.

kritzcreek · 2024-07-01T02:20:16Z

Okay I went for one final round of changes:

I've unified all indexing modes to be split into producer/consumer. There's now only a single function for consuming IndexJobs constructed by the different indexing modes. This means the files subcommand is now also parallelized.
This allowed me to also push the ThreadPool creation further down, so we no longer create a global thread pool (@keynmol)

Currently errors in the producers cause files to be skipped, while errors in the worker are collected and returned. My current preference would be to just log the errors in the producer (when we introduce tracing) as they're essentially all caused by non-Utf8-ness.

Sorry for the churn... Should be ready to review and merge now.

…63564) Closes https://linear.app/sourcegraph/issue/GRAPH-718/dont-parse-every-file-twice-in-scip-syntax While working on parallel scip-syntax I noticed we're re-parsing every file for both globals and locals generation. This PR makes it so we use the same parse tree for both. I've also made it so we don't re-initialize the language parsers for every file. The thread-local-storage aspect of this matters after merging #63536 ## Test plan ``` spring-framework on main [?] ❯ hyperfine --warmup 1 'git archive HEAD | scip-syntax-main index tar - -l java -o main.scip' 'git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip' Benchmark 1: git archive HEAD | scip-syntax-main index tar - -l java -o main.scip Time (mean ± σ): 6.894 s ± 0.033 s [User: 6.877 s, System: 0.093 s] Range (min … max): 6.814 s … 6.921 s 10 runs Benchmark 2: git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip Time (mean ± σ): 4.675 s ± 0.008 s [User: 4.693 s, System: 0.084 s] Range (min … max): 4.663 s … 4.688 s 10 runs Summary git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip ran 1.47 ± 0.01 times faster than git archive HEAD | scip-syntax-main index tar - -l java -o main.scip spring-framework on  main [?] ❯ scip-syntax scip-evaluate --ground-truth main.scip --candidate single-parse.scip {"precision_percent":"100.0","recall_percent":"100.0","true_positives":"569982.0","false_positives":"0.0","false_negatives":"0.0"} ```

This way we can do the processing on the main thread, and the code looks more "linear"

This way we don't need to set a global thread pool

keynmol

Nice. I really like the unified model for all three modes

kritzcreek requested review from keynmol and varungandhi-src June 28, 2024 06:30

cla-bot bot added the cla-signed label Jun 28, 2024

github-actions bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels Jun 28, 2024

kritzcreek force-pushed the christoph/parallel-scip-syntax branch 2 times, most recently from 6e954dd to 9f3f53d Compare June 28, 2024 06:46

varungandhi-src reviewed Jun 28, 2024

View reviewed changes

kritzcreek force-pushed the christoph/parallel-scip-syntax branch from d9da9ae to 1ddbd23 Compare June 28, 2024 07:26

keynmol reviewed Jun 28, 2024

View reviewed changes

kritzcreek force-pushed the christoph/parallel-scip-syntax branch 2 times, most recently from 4d9f40e to 2075eca Compare July 1, 2024 02:14

kritzcreek force-pushed the christoph/parallel-scip-syntax branch from 2075eca to de23211 Compare July 1, 2024 02:32

kritzcreek mentioned this pull request Jul 1, 2024

perf(scip-syntax): uses the same parse tree for globals and locals #63564

Merged

kritzcreek force-pushed the christoph/parallel-scip-syntax branch from de23211 to a6ff38f Compare July 1, 2024 04:57

Christoph Hegemann added 8 commits July 1, 2024 09:09

implements parallel processing for workspace and tar commands

8b09c40

moves reading the archive to a different thread

0c4e4e8

This way we can do the processing on the main thread, and the code looks more "linear"

use a bounded queue to cap the amount of in-memory file contents

b5078dc

make worker count configurable

2f5229a

don't read file contents for files with a wrong extension

d35e5c2

use par_bridge to simplify parallel workspace processing

7a0f5ed

unifies producer/consumer setup

7070c00

This way we don't need to set a global thread pool

adds an IndexJob struct to make things a little clearer

e6cf871

kritzcreek force-pushed the christoph/parallel-scip-syntax branch from a6ff38f to e6cf871 Compare July 1, 2024 07:10

kritzcreek requested a review from keynmol July 1, 2024 07:41

keynmol approved these changes Jul 1, 2024

View reviewed changes

kritzcreek merged commit b1d87c6 into main Jul 1, 2024

kritzcreek deleted the christoph/parallel-scip-syntax branch July 1, 2024 09:48

Conversation

kritzcreek commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

varungandhi-src commented Jun 28, 2024

Uh oh!

kritzcreek commented Jun 28, 2024

Uh oh!

varungandhi-src Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

kritzcreek Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

keynmol Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

varungandhi-src Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keynmol commented Jun 28, 2024

Uh oh!

kritzcreek commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keynmol Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

kritzcreek Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

kritzcreek commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kritzcreek commented Jul 1, 2024

Uh oh!

keynmol left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kritzcreek commented Jun 28, 2024 •

edited

Loading

varungandhi-src Jun 28, 2024 •

edited

Loading

kritzcreek commented Jun 28, 2024 •

edited

Loading

kritzcreek commented Jun 28, 2024 •

edited

Loading