11# regex.pcre Module Documentation
22
3- The ` regex.pcre ` module provides a ** Virtual Machine (VM)** based regular expression engine with
4- UTF-8 support.
5- Unlike recursive engines, this implementation uses an explicit heap stack,
6- making it safe for complex patterns and long strings without risking stack overflows.
3+ The ` regex.pcre ` module is a high-performance ** Virtual Machine (VM)**
4+ based regular expression engine for V.
75
8- It supports compilation of patterns, searching, full matching, global replacement, named groups,
9- and iterative searching.
6+ ### Key Features
7+ - ** Non-recursive VM** : Safe execution that avoids stack overflows on complex patterns.
8+ - ** Zero-Allocation Search** : Uses a pre-allocated ` Machine ` workspace for search operations.
9+ - ** Fast ASCII Path** : Optimized path for characters < 128 to bypass heavy UTF-8 decoding.
10+ - ** Bitmap Lookups** : ASCII character classes use a 128-bit bitset for $O(1)$ matching.
11+ - ** Instruction Merging** : Consecutive character matches are merged
12+ into string blocks for faster execution.
1013
1114## Supported Syntax
1215
1316| Feature | Syntax | Description |
1417| :--- | :--- | :--- |
15- | ** Literals** | ` abc ` | Matches exact characters. |
18+ | ** Literals** | ` abc ` | Matches exact characters (UTF-8 supported) . |
1619| ** Wildcard** | ` . ` | Matches any character (excluding ` \n ` unless ` (?s) ` flag is used). |
1720| ** Alternation** | `| ` | Matches the left OR right expression (e.g., `cat| dog`). |
18- | ** Quantifiers** | ` * ` | Matches 0 or more times. |
19- | ** Non-greedy quantifiers** | ` *? ` , ` +? ` , ` ?? ` | Avoid to consume as much as possible. |
20- | | ` + ` | Matches 1 or more times. |
21- | | ` ? ` | Matches 0 or 1 time. |
22- | | ` {m} ` | Matches exactly ` m ` times. |
23- | | ` {m,n} ` | Matches between ` m ` and ` n ` times. |
21+ | ** Quantifiers** | ` * ` , ` + ` , ` ? ` | Matches 0+, 1+, or 0-1 times. |
22+ | ** Lazy** | ` *? ` , ` +? ` , ` ?? ` | Non-greedy versions of the above. |
23+ | ** Repetition** | ` {m,n} ` | Matches between ` m ` and ` n ` times. ` {m,} ` for m or more. |
2424| ** Groups** | ` (...) ` | Capturing group. |
2525| | ` (?:...) ` | Non-capturing group. |
2626| | ` (?P<name>...) ` | Named capturing group. |
27- | ** Anchors** | ` ^ ` | Matches start of string (or line start with ` (?m) ` ). |
28- | | ` $ ` | Matches end of string (or line end with ` (?m) ` ). |
29- | | ` \b ` | Matches a word boundary (start/end of word). |
30- | | ` \B ` | Matches a non-word boundary. |
31- | ** Classes** | ` [abc] ` | Matches any character in the set. |
32- | | ` [^abc] ` | Matches any character NOT in the set. |
33- | | ` [a-z] ` | Matches a range of characters. |
34- | | ` \w ` , ` \W ` | Word / Non-word character (` [a-zA-Z0-9_] ` ). |
27+ | ** Anchors** | ` ^ ` , ` $ ` | Start/End of string (or line with ` (?m) ` ). |
28+ | | ` \b ` , ` \B ` | Word boundary and Non-word boundary. |
29+ | ** Classes** | ` [abc] ` , ` [^abc] ` | Character set and Negated character set. |
30+ | | ` [a-z] ` | Range of characters. |
31+ | | ` \w ` , ` \W ` | Word / Non-word (` [a-zA-Z0-9_] ` ). |
3532| | ` \d ` , ` \D ` | Digit / Non-digit. |
36- | | ` \s ` , ` \S ` | Whitespace / Non-whitespace. |
37- | | ` \a ` | Lowercase character (` [a-z] ` ). |
38- | | ` \A ` | Uppercase character (` [A-Z] ` ). |
39- | ** Escapes** | ` \xHH ` | Matches 1-byte hex value. |
40- | | ` \XHHHH ` | Matches 2-byte hex value. |
33+ | | ` \s ` , ` \S ` | Whitespace / Non-whitespace (` \t\n\r\v\f ` ). |
34+ | | ` \a ` , ` \A ` | Lowercase / Uppercase ASCII character class. |
4135| ** Flags** | ` (?i) ` | Case-insensitive matching. |
4236| | ` (?m) ` | Multiline mode (` ^ ` and ` $ ` match start/end of lines). |
43- | | ` (?s) ` | Dot-all mode (` . ` matches ` \n ` ). |
37+ | | ` (?s) ` | Dot-all mode (` . ` matches newlines). |
38+
39+ ---
4440
4541## Structs
4642
4743### Regex
48- The compiled regular expression object containing the VM bytecode .
44+ The compiled regular expression object.
4945``` v ignore
5046pub struct Regex {
5147pub:
52- pattern string
53- total_groups int
54- // Internal VM bytecode...
48+ pattern string // The original pattern
49+ prog []Inst // Compiled VM bytecode
50+ total_groups int // Number of capture groups
51+ group_map map[string]int // Map for named groups
5552}
5653```
5754
@@ -61,9 +58,9 @@ Represents the result of a successful search.
6158pub struct Match {
6259pub:
6360 text string // The full substring that matched
64- start int // Start index in the source text
65- end int // End index in the source text
66- groups []string // List of captured groups
61+ start int // Byte index where match starts
62+ end int // Byte index where match ends
63+ groups []string // Text captured by each group
6764}
6865```
6966
@@ -72,229 +69,82 @@ pub:
7269## Core Functions
7370
7471### ` compile `
75-
76- Compiles a regular expression pattern string into a ` Regex ` object. Returns an error if the syntax
77- is invalid (e.g., unclosed groups).
78-
72+ Compiles a pattern into a ` Regex ` object.
7973``` v ignore
8074fn compile(pattern string) !Regex
8175```
8276
83- ** Example:**
84- ``` v ignore
85- import regex.pcre
86-
87- fn main() {
88- // Compile a pattern to match a word followed by digits
89- // The '?' after pcre.compile handles the result option
90- r := pcre.compile(r'\w+\d+') or { panic(err) }
91- }
92- ```
93-
94- ---
95-
9677### ` find `
97-
98- Scans the text for the ** first** occurrence of the pattern. Returns a ` Match ` object if found,
99- or ` none ` if not.
100-
78+ Finds the first match in the text. Returns ` none ` if no match is found.
10179``` v ignore
10280fn (r Regex) find(text string) ?Match
10381```
10482
105- ** Example:**
106- ``` v ignore
107- r := pcre.compile(r'(\d+)')!
108- text := "item 123, item 456"
109-
110- if m := r.find(text) {
111- println('Found: ${m.text}') // Output: 123
112- println('Index: ${m.start}') // Output: 5
113- println('Group 1: ${m.groups[0]}') // Output: 123
114- }
115- ```
116-
117- > ** Note:** This function stops immediately after finding the leftmost match.
118-
119- ---
120-
12183### ` find_all `
122-
123- Returns a list of ** all non-overlapping** matches in the string. This is useful for extracting
124- multiple tokens.
125-
84+ Returns all non-overlapping matches in a string.
12685``` v ignore
12786fn (r Regex) find_all(text string) []Match
12887```
12988
130- ** Example:**
131- ``` v ignore
132- r := pcre.compile(r'\d+')!
133- text := "10, 20, 30"
134-
135- matches := r.find_all(text)
136- for m in matches {
137- println(m.text)
138- }
139- // Output:
140- // 10
141- // 20
142- // 30
143- ```
144-
145- > ** Note:** If a pattern matches an empty string (e.g., ` a* ` on ` "b" ` ), the engine automatically
146- advances the cursor by 1 to prevent infinite loops.
147-
148- ---
149-
150- ### ` find_from `
151-
152- Behaves like ` find ` , but starts scanning from a specific byte index. Useful for building lexers or
153- parsing text iteratively.
154-
155- ``` v ignore
156- fn (r Regex) find_from(text string, start_index int) ?Match
157- ```
158-
159- ** Example:**
160- ``` v
161- import regex.pcre
162-
163- r := pcre.compile(r'test')!
164- text := 'test test test'
165-
166- // Skip the first 5 characters
167- if m := r.find_from(text, 5) {
168- println('Found at: ${m.start}') // Output: Found at: 5
169- }
170- ```
171-
172- > ** Note:** If ` start_index ` is out of bounds (< 0 or > len), it returns ` none ` .
173-
174- ---
175-
176- ### ` fullmatch `
177-
178- Checks if the ** entire** string matches the pattern from start to end.
179-
180- ``` v ignore
181- fn (r Regex) fullmatch(text string) ?Match
182- ```
183-
184- ** Example:**
185- ``` v ignore
186- r := pcre.compile(r'\d{3}')!
187-
188- println(r.fullmatch('123')) // Match
189- println(r.fullmatch('1234')) // none (too long)
190- println(r.fullmatch('a123')) // none (starts with char)
191- ```
192-
193- ---
194-
19589### ` replace `
196-
197- Finds the ** first** occurrence of the pattern and replaces it with the replacement string.
198-
199- Supported backreferences:
200- * ` $1 ` , ` $2 ` , etc. refer to captured groups.
201- * ` $0 ` is currently not supported.
202-
90+ Replaces the first match in ` text ` with ` repl ` .
91+ Supports backreferences like ` $1 ` , ` $2 ` .
20392``` v ignore
20493fn (r Regex) replace(text string, repl string) string
20594```
20695
207- ** Example:**
208- ``` v
209- import regex.pcre
210-
211- r := pcre.compile(r'(\w+), (\w+)')!
212- text := 'Doe, John'
213-
214- // Swap groups
215- result := r.replace(text, '$2 $1')
216- println(result) // Output: "John Doe"
217- ```
218-
219- > ** Note:** This function currently replaces only the * first* match found.
220- To replace all occurrences,
221- you would need to loop using ` replace ` or reconstruct the string using ` find_all ` ranges.
222-
223- ---
224-
225- ### ` group_by_name `
226-
227- Retrieves the captured text for a specific named group defined with ` (?P<name>...) ` .
228-
229- ``` v ignore
230- fn (r Regex) group_by_name(m Match, name string) string
231- ```
232-
233- ** Example:**
96+ ### ` change_stack_depth `
97+ Updates the maximum backtracking depth for the VM.
98+ Default is 1024.
99+ Use this if your pattern is extremely complex and returns ` none ` prematurely.
234100``` v ignore
235- import regex.pcre
236-
237- r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
238- m := r.find('Date: 2025-01') or {pcre.Match{}}
239-
240- year := r.group_by_name(m, 'year')
241- println(year) // Output: 2025
101+ fn (mut r Regex) change_stack_depth(depth int)
242102```
243103
244104---
245105
246- ## Advanced Usage
247-
248- ### Non-greedy Matching
249- By default, quantifiers like ` * ` and ` + ` are ** greedy** , meaning they match
250- as much text as possible. Adding a ` ? ` makes them ** non-greedy** (or lazy),
251- matching the shortest possible string.
106+ ## Named Groups Example
252107
253- ** Example:**
254108``` v
255109import regex.pcre
256110
257- text := '<div>content</div>'
258-
259- // Greedy: Matches everything from the first '<' to the last '>'
260- r_greedy := pcre.compile(r'<.*>')!
261- println(r_greedy.find(text)?.text) // Output: <div>content</div>
111+ fn main() {
112+ r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
113+ m := r.find('Date: 2026-02') or { return }
262114
263- // Non-greedy: Matches only until the first '>'
264- r_lazy := pcre.compile(r'<.*?>')!
265- println(r_lazy.find(text)?.text) // Output: <div>
115+ year := r.group_by_name(m, 'year')
116+ month := r.group_by_name(m, 'month')
117+ println('Year: ${year}, Month: ${month}') // Year: 2026, Month: 02
118+ }
266119```
267120
268- ### VM Stability (No Stack Overflow)
269- Because this engine uses a VM with a heap-allocated stack, it can handle patterns that typically
270- crash recursive engines due to stack overflow.
271-
272- ** Example:**
273- ``` v
274- import regex.pcre
275- // A pattern that causes catastrophic backtracking in some recursive engines
276- // or deep recursion depth.
121+ ---
277122
278- r := pcre.compile(r'(a+)+b')!
279- text := 'a'.repeat(5000) // Very long string of 'a's
123+ ## PCRE Compatibility Layer
280124
281- // This will safely return 'none' without crashing the program
282- r.find(text)
283- ```
125+ To facilitate easier migration from other engines, a compatibility layer is provided:
284126
285- ### Using Flags
286- Flags can be embedded to change matching behavior locally.
127+ | Function | Equivalent To |
128+ | :--- | :--- |
129+ | ` new_regex(pattern, flags) ` | ` compile(pattern) ` |
130+ | ` r.match_str(text, start, flags) ` | ` r.find_from(text, start) ` |
131+ | ` m.get(idx) ` | Retrieves match text (` 0 ` ) or capture group (` 1+ ` ). |
132+ | ` m.get_all() ` | Returns ` [full_match, group1, group2, ...] ` |
287133
288134** Example:**
289135``` v
290136import regex.pcre
291- // (?i) Case insensitive
292137
293- r := pcre.compile(r'(?i)apple')!
294- println(r.find('APPLE')) // Matches
138+ r := pcre.new_regex(r'(\w+) (\w+)', 0)!
139+ if m := r.match_str('hello world', 0, 0) {
140+ println(m.get(0)?) // "hello world"
141+ println(m.get(1)?) // "hello"
142+ println(m.get(2)?) // "world"
143+ }
144+ ```
295145
296- // (?m) Multiline: ^ matches start of line, $ matches end of line
297- r_multi := pcre.compile(r'(?m)^Log:')!
298- text := 'Error: 1\nLog: Something happened'
299- println(r_multi.find(text)) // Matches 'Log:' on the second line
300- ```
146+ ## Performance Note
147+ The engine automatically detects literal prefixes (e.g., in ` abc.* ` ) and uses
148+ a fast-skip optimization to bypass the VM until the prefix is found in the
149+ input string.
150+ This makes it extremely fast for searching specific patterns in large files.
0 commit comments