Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

scip-syntax: implements parallel processing for workspace and tar commands#63536

Merged
kritzcreek merged 8 commits intomainfrom
christoph/parallel-scip-syntax
Jul 1, 2024
Merged

scip-syntax: implements parallel processing for workspace and tar commands#63536
kritzcreek merged 8 commits intomainfrom
christoph/parallel-scip-syntax

Conversation

@kritzcreek
Copy link
Contributor

@kritzcreek kritzcreek commented Jun 28, 2024

Closes https://linear.app/sourcegraph/issue/GRAPH-694/process-documents-in-parallel-in-scip-syntax

Parallelizes syntactic indexing by splitting the different indexing modes into job producers and then consuming them from a pool of workers using the rayon library.

scip-syntax will use the maximum available parallelism by default. The -j/--jobs argument can be used to control the number of worker threads.

Benchmarking on spring-project/spring-framework:

spring-framework on main
❯ hyperfine -L num_threads 1,4,8,10  'git archive HEAD | scip-syntax index tar - -j {num_threads} -l java' 'scip-syntax index workspace ${PWD} -l java -j {num_threads}'
Benchmark 1: git archive HEAD | scip-syntax index tar - -j 1 -l java
  Time (mean ± σ):      7.171 s ±  0.195 s    [User: 7.165 s, System: 0.149 s]
  Range (min … max):    7.047 s …  7.714 s    10 runs

Benchmark 2: scip-syntax index workspace ${PWD} -l java -j 1
  Time (mean ± σ):      7.100 s ±  0.061 s    [User: 6.965 s, System: 0.381 s]
  Range (min … max):    6.983 s …  7.184 s    10 runs

Benchmark 3: git archive HEAD | scip-syntax index tar - -j 4 -l java
  Time (mean ± σ):      1.927 s ±  0.044 s    [User: 7.366 s, System: 0.144 s]
  Range (min … max):    1.884 s …  2.034 s    10 runs

Benchmark 4: scip-syntax index workspace ${PWD} -l java -j 4
  Time (mean ± σ):      1.884 s ±  0.029 s    [User: 7.110 s, System: 0.342 s]
  Range (min … max):    1.855 s …  1.960 s    10 runs

Benchmark 5: git archive HEAD | scip-syntax index tar - -j 8 -l java
  Time (mean ± σ):      1.040 s ±  0.015 s    [User: 7.594 s, System: 0.162 s]
  Range (min … max):    1.025 s …  1.078 s    10 runs

Benchmark 6: scip-syntax index workspace ${PWD} -l java -j 8
  Time (mean ± σ):      1.010 s ±  0.007 s    [User: 7.275 s, System: 0.490 s]
  Range (min … max):    1.001 s …  1.023 s    10 runs

Benchmark 7: git archive HEAD | scip-syntax index tar - -j 10 -l java
  Time (mean ± σ):     962.7 ms ±   2.8 ms    [User: 8510.0 ms, System: 167.0 ms]
  Range (min … max):   959.0 ms … 966.7 ms    10 runs

Benchmark 8: scip-syntax index workspace ${PWD} -l java -j 10
  Time (mean ± σ):     935.7 ms ±   3.6 ms    [User: 8171.8 ms, System: 490.2 ms]
  Range (min … max):   930.9 ms … 943.5 ms    10 runs

Comparison against main:

spring-framework on  main [?]
❯ hyperfine 'git archive HEAD | scip-syntax index tar - -l java -o main.scip' 'git archive HEAD | scip-syntax-par index tar - -l java -o par.scip'
Benchmark 1: git archive HEAD | scip-syntax index tar - -l java -o main.scip
  Time (mean ± σ):      6.969 s ±  0.214 s    [User: 6.896 s, System: 0.096 s]
  Range (min … max):    6.806 s …  7.444 s    10 runs

Benchmark 2: git archive HEAD | scip-syntax-par index tar - -l java -o par.scip
  Time (mean ± σ):     974.7 ms ± 182.5 ms    [User: 9022.8 ms, System: 167.5 ms]
  Range (min … max):   907.1 ms … 1493.8 ms    10 runs

Summary
  git archive HEAD | scip-syntax-par index tar - -l java -o par.scip ran
    7.15 ± 1.36 times faster than git archive HEAD | scip-syntax index tar - -l java -o main.scip

spring-framework on  main [?]
❯ scip-syntax scip-evaluate --ground-truth main.scip --candidate par.scip
{"precision_percent":"100.0","recall_percent":"100.0","true_positives":"569982.0","false_positives":"0.0","false_negatives":"0.0"}

Test plan

Compare the output of scip-syntax from main against the output of this branch with scip-syntax scip-evaluate

@cla-bot cla-bot bot added the cla-signed label Jun 28, 2024
@github-actions github-actions bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels Jun 28, 2024
@kritzcreek kritzcreek force-pushed the christoph/parallel-scip-syntax branch 2 times, most recently from 6e954dd to 9f3f53d Compare June 28, 2024 06:46
@varungandhi-src
Copy link
Contributor

How would we like to configure this. Maybe a -j flag to control the amount of workers?

Yeah, that's fine. We should default to max parallelism.

Should I keep a non-rayon codepath for -j 1?

No.

@kritzcreek
Copy link
Contributor Author

Should I keep a non-rayon codepath for -j 1?

No.

Thinking about it, I think that might not be possible for the archive case, because doing 'producer/consumer' one a single thread means I'd read the entire archive into memory before starting to index.

Comment on lines 169 to 170
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this short-circuit if there is a failure when indexing a single document? We should try to return partial results as much as possible, and log the errors that we got. If all results were errors, then it makes sense to mark the indexing process as an overall failure. I would expect behavior like:

  1. 0 valid filepaths -> exit success
  2. N > 0 filepaths, 0 Documents -> exit failure
  3. N > 0 filepaths, M Documents, 0 < M < N -> exit success and log up to max K errors (say K = 10?). For debugging, maybe we should allow K to be configurable from the outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this short-circuit if there is a failure when indexing a single document?

Yes it will (but it also did before this PR)

We should try to return partial results as much as possible, and log the errors that we got.

I agree, maybe we could produce a machine readably "summary" log message?

If all results were errors, then it makes sense to mark the indexing process as an overall failure

I don't know if there's a great default K/N to pick. Definitely worth discussing, but should probably be its own issue and PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, maybe we could produce a machine readably "summary" log message?

yeah, my plan was to output a json result somewhere controlled by the caller - at the very least so we can send some internal CLI metrics to prometheus as well, when invoked from the worker

Copy link
Contributor

@varungandhi-src varungandhi-src Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kritzcreek kritzcreek force-pushed the christoph/parallel-scip-syntax branch from d9da9ae to 1ddbd23 Compare June 28, 2024 07:26
@keynmol
Copy link
Contributor

keynmol commented Jun 28, 2024

For the workspace version I checked, and collecting all filepaths in spring-framework is <5% of the entire running time, so I don't it's necessary to parallelize the walk. (I could use a queue like I do for the tar processing, so that we can start indexing while walking the file system, but I'm not sure that's worth it)

Can we measure the memory overhead of collecting all file paths? Ideally we won't have to, as those paths have a lot of duplicated prefixes as strings, and I imagine on really large repos it can be significant.

That said, we're mainly interested in tar mode, workspace is there mainly for testing and experimentation

@kritzcreek
Copy link
Contributor Author

kritzcreek commented Jun 28, 2024

Note to self. Use a bounded queue for tar processing ✅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rayon support threadpool injection?

Will this global state cause problems in integration tests where we are likely to call this twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rayon support threadpool injection?

You can create scoped threadpools, but I didn't dig too much into this.

Will this global state cause problems in integration tests where we are likely to call this twice?

Only if the integration test calls with Some(worker_count) twice. If you want to set the job count you'd need to set it on the first call, and then pass None going forward.

@kritzcreek
Copy link
Contributor Author

kritzcreek commented Jun 28, 2024

Can we measure the memory overhead of collecting all file paths? Ideally we won't have to, as those paths have a lot of duplicated prefixes as strings, and I imagine on really large repos it can be significant.

That said, we're mainly interested in tar mode, workspace is there mainly for testing and experimentation

Actually I was able to use par_bridge here as well. So now we should only be holding on to however many filepaths we're currently processing.

@kritzcreek kritzcreek force-pushed the christoph/parallel-scip-syntax branch 2 times, most recently from 4d9f40e to 2075eca Compare July 1, 2024 02:14
@kritzcreek
Copy link
Contributor Author

Okay I went for one final round of changes:

  1. I've unified all indexing modes to be split into producer/consumer. There's now only a single function for consuming IndexJobs constructed by the different indexing modes. This means the files subcommand is now also parallelized.
  2. This allowed me to also push the ThreadPool creation further down, so we no longer create a global thread pool (@keynmol)

Currently errors in the producers cause files to be skipped, while errors in the worker are collected and returned. My current preference would be to just log the errors in the producer (when we introduce tracing) as they're essentially all caused by non-Utf8-ness.

Sorry for the churn... Should be ready to review and merge now.

@kritzcreek kritzcreek force-pushed the christoph/parallel-scip-syntax branch from 2075eca to de23211 Compare July 1, 2024 02:32
kritzcreek pushed a commit that referenced this pull request Jul 1, 2024
…63564)

Closes
https://linear.app/sourcegraph/issue/GRAPH-718/dont-parse-every-file-twice-in-scip-syntax

While working on parallel scip-syntax I noticed we're re-parsing every
file for both globals and locals generation. This PR makes it so we use
the same parse tree for both. I've also made it so we don't
re-initialize the language parsers for every file. The
thread-local-storage aspect of this matters after merging #63536

## Test plan

```
spring-framework on main [?]
❯ hyperfine --warmup 1 'git archive HEAD | scip-syntax-main index tar - -l java -o main.scip' 'git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip'
Benchmark 1: git archive HEAD | scip-syntax-main index tar - -l java -o main.scip
  Time (mean ± σ):      6.894 s ±  0.033 s    [User: 6.877 s, System: 0.093 s]
  Range (min … max):    6.814 s …  6.921 s    10 runs

Benchmark 2: git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip
  Time (mean ± σ):      4.675 s ±  0.008 s    [User: 4.693 s, System: 0.084 s]
  Range (min … max):    4.663 s …  4.688 s    10 runs

Summary
  git archive HEAD | scip-syntax index tar - -l java -o single-parse.scip ran
    1.47 ± 0.01 times faster than git archive HEAD | scip-syntax-main index tar - -l java -o main.scip

spring-framework on  main [?]
❯ scip-syntax scip-evaluate --ground-truth main.scip --candidate single-parse.scip
{"precision_percent":"100.0","recall_percent":"100.0","true_positives":"569982.0","false_positives":"0.0","false_negatives":"0.0"}
```
@kritzcreek kritzcreek force-pushed the christoph/parallel-scip-syntax branch from de23211 to a6ff38f Compare July 1, 2024 04:57
@kritzcreek kritzcreek force-pushed the christoph/parallel-scip-syntax branch from a6ff38f to e6cf871 Compare July 1, 2024 07:10
@kritzcreek kritzcreek requested a review from keynmol July 1, 2024 07:41
Copy link
Contributor

@keynmol keynmol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I really like the unified model for all three modes

@kritzcreek kritzcreek merged commit b1d87c6 into main Jul 1, 2024
@kritzcreek kritzcreek deleted the christoph/parallel-scip-syntax branch July 1, 2024 09:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants