Skip to content

Commit 97f5b89

Browse files
authored
regex.pcre: optimise for speed the PCRE implementation (#26609)
1 parent eac0e94 commit 97f5b89

3 files changed

Lines changed: 700 additions & 879 deletions

File tree

‎vlib/regex/pcre/README.md‎

Lines changed: 70 additions & 220 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,54 @@
11
# regex.pcre Module Documentation
22

3-
The `regex.pcre` module provides a **Virtual Machine (VM)** based regular expression engine with
4-
UTF-8 support.
5-
Unlike recursive engines, this implementation uses an explicit heap stack,
6-
making it safe for complex patterns and long strings without risking stack overflows.
3+
The `regex.pcre` module is a high-performance **Virtual Machine (VM)**
4+
based regular expression engine for V.
75

8-
It supports compilation of patterns, searching, full matching, global replacement, named groups,
9-
and iterative searching.
6+
### Key Features
7+
- **Non-recursive VM**: Safe execution that avoids stack overflows on complex patterns.
8+
- **Zero-Allocation Search**: Uses a pre-allocated `Machine` workspace for search operations.
9+
- **Fast ASCII Path**: Optimized path for characters < 128 to bypass heavy UTF-8 decoding.
10+
- **Bitmap Lookups**: ASCII character classes use a 128-bit bitset for $O(1)$ matching.
11+
- **Instruction Merging**: Consecutive character matches are merged
12+
into string blocks for faster execution.
1013

1114
## Supported Syntax
1215

1316
| Feature | Syntax | Description |
1417
| :--- | :--- | :--- |
15-
| **Literals** | `abc` | Matches exact characters. |
18+
| **Literals** | `abc` | Matches exact characters (UTF-8 supported). |
1619
| **Wildcard** | `.` | Matches any character (excluding `\n` unless `(?s)` flag is used). |
1720
| **Alternation** | `|` | Matches the left OR right expression (e.g., `cat|dog`). |
18-
| **Quantifiers** | `*` | Matches 0 or more times. |
19-
| **Non-greedy quantifiers** | `*?`, `+?`, `??` | Avoid to consume as much as possible. |
20-
| | `+` | Matches 1 or more times. |
21-
| | `?` | Matches 0 or 1 time. |
22-
| | `{m}` | Matches exactly `m` times. |
23-
| | `{m,n}` | Matches between `m` and `n` times. |
21+
| **Quantifiers** | `*`, `+`, `?` | Matches 0+, 1+, or 0-1 times. |
22+
| **Lazy** | `*?`, `+?`, `??` | Non-greedy versions of the above. |
23+
| **Repetition** | `{m,n}` | Matches between `m` and `n` times. `{m,}` for m or more. |
2424
| **Groups** | `(...)` | Capturing group. |
2525
| | `(?:...)` | Non-capturing group. |
2626
| | `(?P<name>...)` | Named capturing group. |
27-
| **Anchors** | `^` | Matches start of string (or line start with `(?m)`). |
28-
| | `$` | Matches end of string (or line end with `(?m)`). |
29-
| | `\b` | Matches a word boundary (start/end of word). |
30-
| | `\B` | Matches a non-word boundary. |
31-
| **Classes** | `[abc]` | Matches any character in the set. |
32-
| | `[^abc]` | Matches any character NOT in the set. |
33-
| | `[a-z]` | Matches a range of characters. |
34-
| | `\w`, `\W` | Word / Non-word character (`[a-zA-Z0-9_]`). |
27+
| **Anchors** | `^`, `$` | Start/End of string (or line with `(?m)`). |
28+
| | `\b`, `\B` | Word boundary and Non-word boundary. |
29+
| **Classes** | `[abc]`, `[^abc]` | Character set and Negated character set. |
30+
| | `[a-z]` | Range of characters. |
31+
| | `\w`, `\W` | Word / Non-word (`[a-zA-Z0-9_]`). |
3532
| | `\d`, `\D` | Digit / Non-digit. |
36-
| | `\s`, `\S` | Whitespace / Non-whitespace. |
37-
| | `\a` | Lowercase character (`[a-z]`). |
38-
| | `\A` | Uppercase character (`[A-Z]`). |
39-
| **Escapes** | `\xHH` | Matches 1-byte hex value. |
40-
| | `\XHHHH` | Matches 2-byte hex value. |
33+
| | `\s`, `\S` | Whitespace / Non-whitespace (` \t\n\r\v\f`). |
34+
| | `\a`, `\A` | Lowercase / Uppercase ASCII character class. |
4135
| **Flags** | `(?i)` | Case-insensitive matching. |
4236
| | `(?m)` | Multiline mode (`^` and `$` match start/end of lines). |
43-
| | `(?s)` | Dot-all mode (`.` matches `\n`). |
37+
| | `(?s)` | Dot-all mode (`.` matches newlines). |
38+
39+
---
4440

4541
## Structs
4642

4743
### Regex
48-
The compiled regular expression object containing the VM bytecode.
44+
The compiled regular expression object.
4945
```v ignore
5046
pub struct Regex {
5147
pub:
52-
pattern string
53-
total_groups int
54-
// Internal VM bytecode...
48+
pattern string // The original pattern
49+
prog []Inst // Compiled VM bytecode
50+
total_groups int // Number of capture groups
51+
group_map map[string]int // Map for named groups
5552
}
5653
```
5754

@@ -61,9 +58,9 @@ Represents the result of a successful search.
6158
pub struct Match {
6259
pub:
6360
text string // The full substring that matched
64-
start int // Start index in the source text
65-
end int // End index in the source text
66-
groups []string // List of captured groups
61+
start int // Byte index where match starts
62+
end int // Byte index where match ends
63+
groups []string // Text captured by each group
6764
}
6865
```
6966

@@ -72,229 +69,82 @@ pub:
7269
## Core Functions
7370

7471
### `compile`
75-
76-
Compiles a regular expression pattern string into a `Regex` object. Returns an error if the syntax
77-
is invalid (e.g., unclosed groups).
78-
72+
Compiles a pattern into a `Regex` object.
7973
```v ignore
8074
fn compile(pattern string) !Regex
8175
```
8276

83-
**Example:**
84-
```v ignore
85-
import regex.pcre
86-
87-
fn main() {
88-
// Compile a pattern to match a word followed by digits
89-
// The '?' after pcre.compile handles the result option
90-
r := pcre.compile(r'\w+\d+') or { panic(err) }
91-
}
92-
```
93-
94-
---
95-
9677
### `find`
97-
98-
Scans the text for the **first** occurrence of the pattern. Returns a `Match` object if found,
99-
or `none` if not.
100-
78+
Finds the first match in the text. Returns `none` if no match is found.
10179
```v ignore
10280
fn (r Regex) find(text string) ?Match
10381
```
10482

105-
**Example:**
106-
```v ignore
107-
r := pcre.compile(r'(\d+)')!
108-
text := "item 123, item 456"
109-
110-
if m := r.find(text) {
111-
println('Found: ${m.text}') // Output: 123
112-
println('Index: ${m.start}') // Output: 5
113-
println('Group 1: ${m.groups[0]}') // Output: 123
114-
}
115-
```
116-
117-
> **Note:** This function stops immediately after finding the leftmost match.
118-
119-
---
120-
12183
### `find_all`
122-
123-
Returns a list of **all non-overlapping** matches in the string. This is useful for extracting
124-
multiple tokens.
125-
84+
Returns all non-overlapping matches in a string.
12685
```v ignore
12786
fn (r Regex) find_all(text string) []Match
12887
```
12988

130-
**Example:**
131-
```v ignore
132-
r := pcre.compile(r'\d+')!
133-
text := "10, 20, 30"
134-
135-
matches := r.find_all(text)
136-
for m in matches {
137-
println(m.text)
138-
}
139-
// Output:
140-
// 10
141-
// 20
142-
// 30
143-
```
144-
145-
> **Note:** If a pattern matches an empty string (e.g., `a*` on `"b"`), the engine automatically
146-
advances the cursor by 1 to prevent infinite loops.
147-
148-
---
149-
150-
### `find_from`
151-
152-
Behaves like `find`, but starts scanning from a specific byte index. Useful for building lexers or
153-
parsing text iteratively.
154-
155-
```v ignore
156-
fn (r Regex) find_from(text string, start_index int) ?Match
157-
```
158-
159-
**Example:**
160-
```v
161-
import regex.pcre
162-
163-
r := pcre.compile(r'test')!
164-
text := 'test test test'
165-
166-
// Skip the first 5 characters
167-
if m := r.find_from(text, 5) {
168-
println('Found at: ${m.start}') // Output: Found at: 5
169-
}
170-
```
171-
172-
> **Note:** If `start_index` is out of bounds (< 0 or > len), it returns `none`.
173-
174-
---
175-
176-
### `fullmatch`
177-
178-
Checks if the **entire** string matches the pattern from start to end.
179-
180-
```v ignore
181-
fn (r Regex) fullmatch(text string) ?Match
182-
```
183-
184-
**Example:**
185-
```v ignore
186-
r := pcre.compile(r'\d{3}')!
187-
188-
println(r.fullmatch('123')) // Match
189-
println(r.fullmatch('1234')) // none (too long)
190-
println(r.fullmatch('a123')) // none (starts with char)
191-
```
192-
193-
---
194-
19589
### `replace`
196-
197-
Finds the **first** occurrence of the pattern and replaces it with the replacement string.
198-
199-
Supported backreferences:
200-
* `$1`, `$2`, etc. refer to captured groups.
201-
* `$0` is currently not supported.
202-
90+
Replaces the first match in `text` with `repl`.
91+
Supports backreferences like `$1`, `$2`.
20392
```v ignore
20493
fn (r Regex) replace(text string, repl string) string
20594
```
20695

207-
**Example:**
208-
```v
209-
import regex.pcre
210-
211-
r := pcre.compile(r'(\w+), (\w+)')!
212-
text := 'Doe, John'
213-
214-
// Swap groups
215-
result := r.replace(text, '$2 $1')
216-
println(result) // Output: "John Doe"
217-
```
218-
219-
> **Note:** This function currently replaces only the *first* match found.
220-
To replace all occurrences,
221-
you would need to loop using `replace` or reconstruct the string using `find_all` ranges.
222-
223-
---
224-
225-
### `group_by_name`
226-
227-
Retrieves the captured text for a specific named group defined with `(?P<name>...)`.
228-
229-
```v ignore
230-
fn (r Regex) group_by_name(m Match, name string) string
231-
```
232-
233-
**Example:**
96+
### `change_stack_depth`
97+
Updates the maximum backtracking depth for the VM.
98+
Default is 1024.
99+
Use this if your pattern is extremely complex and returns `none` prematurely.
234100
```v ignore
235-
import regex.pcre
236-
237-
r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
238-
m := r.find('Date: 2025-01') or {pcre.Match{}}
239-
240-
year := r.group_by_name(m, 'year')
241-
println(year) // Output: 2025
101+
fn (mut r Regex) change_stack_depth(depth int)
242102
```
243103

244104
---
245105

246-
## Advanced Usage
247-
248-
### Non-greedy Matching
249-
By default, quantifiers like `*` and `+` are **greedy**, meaning they match
250-
as much text as possible. Adding a `?` makes them **non-greedy** (or lazy),
251-
matching the shortest possible string.
106+
## Named Groups Example
252107

253-
**Example:**
254108
```v
255109
import regex.pcre
256110
257-
text := '<div>content</div>'
258-
259-
// Greedy: Matches everything from the first '<' to the last '>'
260-
r_greedy := pcre.compile(r'<.*>')!
261-
println(r_greedy.find(text)?.text) // Output: <div>content</div>
111+
fn main() {
112+
r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
113+
m := r.find('Date: 2026-02') or { return }
262114
263-
// Non-greedy: Matches only until the first '>'
264-
r_lazy := pcre.compile(r'<.*?>')!
265-
println(r_lazy.find(text)?.text) // Output: <div>
115+
year := r.group_by_name(m, 'year')
116+
month := r.group_by_name(m, 'month')
117+
println('Year: ${year}, Month: ${month}') // Year: 2026, Month: 02
118+
}
266119
```
267120

268-
### VM Stability (No Stack Overflow)
269-
Because this engine uses a VM with a heap-allocated stack, it can handle patterns that typically
270-
crash recursive engines due to stack overflow.
271-
272-
**Example:**
273-
```v
274-
import regex.pcre
275-
// A pattern that causes catastrophic backtracking in some recursive engines
276-
// or deep recursion depth.
121+
---
277122

278-
r := pcre.compile(r'(a+)+b')!
279-
text := 'a'.repeat(5000) // Very long string of 'a's
123+
## PCRE Compatibility Layer
280124

281-
// This will safely return 'none' without crashing the program
282-
r.find(text)
283-
```
125+
To facilitate easier migration from other engines, a compatibility layer is provided:
284126

285-
### Using Flags
286-
Flags can be embedded to change matching behavior locally.
127+
| Function | Equivalent To |
128+
| :--- | :--- |
129+
| `new_regex(pattern, flags)` | `compile(pattern)` |
130+
| `r.match_str(text, start, flags)` | `r.find_from(text, start)` |
131+
| `m.get(idx)` | Retrieves match text (`0`) or capture group (`1+`). |
132+
| `m.get_all()` | Returns `[full_match, group1, group2, ...]` |
287133

288134
**Example:**
289135
```v
290136
import regex.pcre
291-
// (?i) Case insensitive
292137
293-
r := pcre.compile(r'(?i)apple')!
294-
println(r.find('APPLE')) // Matches
138+
r := pcre.new_regex(r'(\w+) (\w+)', 0)!
139+
if m := r.match_str('hello world', 0, 0) {
140+
println(m.get(0)?) // "hello world"
141+
println(m.get(1)?) // "hello"
142+
println(m.get(2)?) // "world"
143+
}
144+
```
295145

296-
// (?m) Multiline: ^ matches start of line, $ matches end of line
297-
r_multi := pcre.compile(r'(?m)^Log:')!
298-
text := 'Error: 1\nLog: Something happened'
299-
println(r_multi.find(text)) // Matches 'Log:' on the second line
300-
```
146+
## Performance Note
147+
The engine automatically detects literal prefixes (e.g., in `abc.*`) and uses
148+
a fast-skip optimization to bypass the VM until the prefix is found in the
149+
input string.
150+
This makes it extremely fast for searching specific patterns in large files.

0 commit comments

Comments
 (0)