Skip to content

HFTrader/regex-performance

 
 

Repository files navigation

Regex Performance

Build Status

Introduction

Regular expressions are commonly used in pattern search algorithms. This tool is based on the work of John Maddock (See his own regex comparison here) and the sljit project (See their regex comparison here).

Requirements

Modern Clang 19.1.6 Toolchain (Recommended)

When using the modern toolchain build, all dependencies are automatically handled by the build script. Just ensure you have access to the toolchain environment:

  • Access to /ssd/hblib-installer/ubuntu-20.04/tools/sourceme.sh
  • CMake 3.24.2 (included in toolchain)
  • Clang 19.1.6 (included in toolchain)
  • All regex engine dependencies built automatically

Legacy System Requirements

dependency version
Cmake >=3.0
Ragel 6.9
Python >=3.0
Boost (*2) >=1.57
Pcap >=0.8
Autoconf 2.69 (*)
Automake 1.15 (*)
Autopoint 0.19.7 (*)
Gettext 0.19.7 (*)
Libtool 2.4.6 (*)
Git 2.11.0 (*)

(*) Tested with named version only. Older versions may work too. (*2) Needs boost-regex to be installed as a component

Supported engines

The following regex engines are supported and covered by the tool:

The engines are built from their sources. In the case an installed engine should be used, the corresponding cmake variable INCLUDE_<name> has to be set to system. The configuration script tries to locate the library and uses the library for linking the benchmark. The same variable can be set to disabled to exclude an engine.

The configuration script distinguishes between nightly and other Rust toolchains to enable the SIMD-feature which is currently available in the nightly built only. The SIMD-feature improves the throughput of the regex crate for defined expressions.

Building the tool

The different engines have different requirements which are not described here. Please see the related project documentations.

On Ubuntu 20.04 these were necessary installs to get the build done from a stock AWS box

$ apt install build-essential cmake rustc cargo automake autoconf autopoint autogen \
   libtool libprotobuf-dev libprotobuf-c-dev protobuf-compiler ninja-build \
   ragel libpcap pcaputils pkg-config libboost-dev flex bison

Modern Clang 19.1.6 Toolchain Build (Recommended)

For optimal performance with the latest toolchain, use the automated build script with the modern Clang 19.1.6 toolchain:

# Source the modern toolchain environment
source /ssd/hblib-installer/ubuntu-20.04/tools/sourceme.sh

# Clean build from scratch
./build_deps_simple.sh

# Build the main project
mkdir -p build && cd build
CC=clang CXX=clang++ cmake ..
make -j$(nproc)

This approach:

  • Uses Clang 19.1.6 compiler with LLVM tools
  • Builds all dependencies with modern toolchain
  • Includes latest RE2 with Abseil dependencies
  • Supports all 11 regex engines with optimal performance

Legacy Build Method

In the case all depencies are fulfilled, just configure and build the cmake based project:

mkdir build && cd build
cmake ..
make

The make command will build all engines and the test tool regex_perf.

To build the test tool or a library only, call make with corresponding target, i.e.:

make regex_perf

Usage

The test tool calls each engine with a defined set of different regular expression on a given file. The repository contains a ~16Mbyte large text file (3200.txt) which can be used for measuring.

# When using modern toolchain, source the environment first
source /ssd/hblib-installer/ubuntu-20.04/tools/sourceme.sh
build/src/regex_perf -f 3200.txt

For legacy builds:

./src/regex_perf -f ./3200.txt

By default, the tool repeats each test 5 times and prints the best time of each test. The overall time to process each regular expression is measured and accounted. The scoring algorithhm distributes the fastest engine 5 points, the second fastest 4 points and so on. The score points help to limit the impact of a slow regular expression eninge test in comparision to the absolut time value.

You can specify a file to write the test results per expression and engine:

./src/regex_perf -f ./3200.txt -o ./results.csv

The test tool writes the results in a csv-compatible format.

Spreadsheet generator

We included a spreadsheet generator for easy visualization of the results. Once you have ran the results and obtained the results.csv file, you can create a spreadsheet with (assuming you are still in the build directory)

python3 ../genspreadsheet.py results.csv

It will save an Excel spreadsheet with the name regex-results-YYYYMMDD-HHMMSS.xlsx in the current directory.

Compiling with clang + libc++

Unfortunately it is not possible to run both standard C++ from GCC/stdlibc++ and clang+libc++ at the same time, it is just the way that cmake selects a single compiler.

To run with clang+libc++ use the following recipe:

mkdir build && cd build
cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_EXE_LINKER_FLAGS="-lc++abi -lc++"  \
    -DCMAKE_CXX_COMPILER=/usr/local/bin/clang++ \
    -DCMAKE_C_COMPILER=/usr/local/bin/clang \
    -DCMAKE_CXX_FLAGS_INIT="-std=c++20 -stdlib=libc++ -march=native -mtune=native" \
    -G Ninja ..

Current Build Status

Latest Update (2025-09-28): Successfully rebuilt from scratch with modern Clang 19.1.6 toolchain

  • ✅ All 11 regex engines working: CTRE, Boost, C++ std, PCRE (3 variants), RE2, Oniguruma, TRE, Rust regex (2 variants)
  • ✅ Latest RE2 with Abseil dependencies properly linked
  • ✅ Performance tests running successfully
  • ⚠️ CTRE has known issues with case-insensitive patterns and word boundaries
  • 🚀 Best performers: Rust regex, PCRE-JIT, RE2

Results

These results were obtained in an AMD Threadripper 3960X (Zen2) at 3.8 GHz running Ubuntu 20.04.5 LTS.

Updated Performance Results

IceLake Xeon Platinum 8375C @ 2.90GHz (AWS C6i instance) - no mitigations

IceLake Server

About

Performance comparison of regular expression engines.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 63.4%
  • Shell 10.4%
  • C++ 9.0%
  • CMake 8.1%
  • Python 3.6%
  • Rust 2.7%
  • Other 2.8%