perf(lexer): use portable-SIMD to speed up whitespace scanning#26

Boshen · 2023-02-20T09:16:24Z

No description provided.

github-actions · 2023-02-20T09:26:47Z

Parser Benchmark Results

group                    main                                   pr
-----                    ----                                   --
parser/babylon.max.js    1.02    164.1±0.57ms    62.9 MB/sec    1.00    160.8±0.52ms    64.2 MB/sec
parser/d3.js             1.04     19.5±0.06ms    27.9 MB/sec    1.00     18.8±0.37ms    29.1 MB/sec
parser/lodash.js         1.04      6.7±0.13ms    76.3 MB/sec    1.00      6.5±0.15ms    79.4 MB/sec
parser/pdf.js            1.12     11.7±0.42ms    34.5 MB/sec    1.00     10.5±0.11ms    38.5 MB/sec
parser/typescript.js     1.03    161.3±0.48ms    59.6 MB/sec    1.00    155.9±0.67ms    61.7 MB/sec

closes #13

strager · 2023-02-23T22:53:18Z

crates/oxc_parser/src/lexer/simd.rs

+    fn check_chunk(&mut self, chunk: &[u8]) {
+        let s = SimdVec::from_slice(chunk);
+
+        let any_newline = s.simd_eq(self.lf) | s.simd_eq(self.cr);


This doesn't seem to handle non-ASCII newlines (U+2028 and U+2029). Is that handled elsewhere?

These two are so rare so I left them in the scalar version

https://github.com/Boshen/oxc/blob/fe677d4909c49f6540a359218f00b773ed8bb23c/crates/oxc_parser/src/lexer/mod.rs#L452

Otherwise it'll drag down this simd version.

strager · 2023-02-23T22:56:39Z

crates/oxc_parser/src/lexer/simd.rs

+        }
+
+        if !remainder.is_empty() {
+            // Align the last chunk for avoiding the use of a scalar version


If you ensure that the input data is padded (e.g. suffixed with 64 0x00 bytes), then you don't need this special case.

remainder isn't aligned, it's a slice of the bytes input.

If the input is 20 bytes, then the first 16 bytes are handled in the for chunk in chunks loop, and the remaining 4 bytes are handled by this code (which adds 12 zero bytes at the end). I assume this is what you mean when you say "remainder isn't aligned".

You can guarantee that remainder is always empty or unimportant (thus you don't need to check for it) if you add enough zeros at the end of bytes. I do this in quick-lint-js when I load the file from the filesystem, so everything can safely assume that the extra bytes are present. (I have a dedicated type for this called padded_string_view which doesn't allow breaking this invariant.)

This is clever! I'll try it out.

Boshen force-pushed the simd-space branch 3 times, most recently from 178c0ed to d8df5bb Compare February 20, 2023 10:36

perf(lexer): use portable-SIMD to speed up whitespace scanning

c46ca07

closes #13

Boshen force-pushed the simd-space branch from d8df5bb to c46ca07 Compare February 20, 2023 10:46

Boshen merged commit ab68cea into main Feb 20, 2023

Boshen deleted the simd-space branch February 20, 2023 11:03

Boshen added this to the AST / Lexer / Parser milestone Feb 21, 2023

strager reviewed Feb 23, 2023

View reviewed changes

Boshen mentioned this pull request Feb 24, 2023

Refactor: try aligned string for simd in lexer #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

perf(lexer): use portable-SIMD to speed up whitespace scanning#26