Page cover
For the complete documentation index, see llms.txt. This page is also available as Markdown.

Reproducible Execution Environment (REE)

Run AI model inference in a machine-agnostic environment where the same model and inputs produce the same outputs across supported hardware.

Overview

REE (Reproducible Execution Environment) is Gensyn's toolchain for executing AI model inference in a machine-agnostic, bitwise-reproducible fashion.

It packages everything needed to run a model: [1] export, [2] compilation, [3] inference, and [4] output decoding, into a containerized pipeline that produces bitwise-identical results regardless of which hardware it runs on.

REE is comprised of three main components:

  1. Gensyn SDK: The engine that orchestrates the end-to-end pipeline: export, compilation, inference, and output decoding.

The Gensyn SDK also exposes higher-level primitives for reusable inference sessions and tool-augmented inference workflows.

  1. Gensyn Compiler: An MLIR-based, multi-stage compiler that converts ONNX models into PyTorch modules, optionally replacing standard kernels with reproducible ones.

  2. RepOp Kernels: Purpose-built CPU kernels and GPU operators that guarantee bitwise-identical outputs across different hardware, parallelism configurations, and run orders.

The Gensyn SDK also exposes higher-level primitives for reusable inference sessions, tool definitions in chat-template-based inference (v0.3.0), and thinking-mode control via chat templates (v0.4.0).

You interact with all of these through the REE TUI, a terminal interface that lets you configure and run generations without touching the underlying CLI directly, unless you're interested in advanced usage.

Why Reproducibility?

Standard GPU execution is inherently non-deterministic, meaning the same model with the same inputs can produce different outputs each time you run it.

This happens because of how GPUs handle mathematical operations: they split work across many parallel processors to run faster, but this parallelization can happen in slightly different orders between runs. Even tiny differences in the order of operations can accumulate through the many layers of a neural network, eventually leading to noticeably different results.

Existing solutions like PyTorch's deterministic mode only solve part of the problem. They can make your results consistent on the same GPU across multiple runs, but they break down when you switch to different hardware. For example, an A100 and an H100 will still produce different outputs. These tools also have limited coverage and can't account for the fact that different GPU architectures implement mathematical functions differently at the hardware level.

REE solves this because reproducibility is essential for verifiable AI inference.

When third parties need to independently verify that a computation was performed correctly, such as in decentralized compute networks or prediction markets, they must be able to run the same model on their own hardware and get exactly the same result.

REE achieves this through RepOps, custom operators that use careful mathematical techniques (fixed reduction ordering, correctly rounded functions, and extended precision) to guarantee identical outputs across any hardware, without sacrificing too much performance.

Operation Modes

REE supports three operation modes, which you can set via the Extra Args field in the TUI:

Mode
Behavior
Cross-run determinism
Cross-hardware determinism

default

Uses standard PyTorch kernels. No determinism guarantees.

deterministic

Uses PyTorch deterministic algorithms. Reproducible on the same hardware across runs.

reproducible

Uses Gensyn RepOp kernels. Bitwise-identical results across any supported hardware.

There are different use cases for the three operation modes:

  • Use reproducible when results must be independently verifiable by a third party on different hardware.

  • Use deterministic when you need repeatable results on your own machine.

  • Use default for development and testing where speed matters more than reproducibility.

Tool-Augmented Inference

The Gensyn SDK supports tool-call workflows (first introduced in v0.3.0).

Tool definitions can be passed to InferenceSession.complete() when using chat-style messages. REE forwards them to the model tokenizer's chat template when supported.

REE does not ship built-in tools or execute tool calls. Instead, applications define tools, parse model output, and run the tool loop themselves.

External tool results are only reproducible if the application records and replays the exact outputs passed back to the model.

Thinking Mode

REE v0.4.0 adds enable_thinking on InferenceSession.complete() for models whose chat templates support a thinking/reasoning toggle (for example, Qwen3).

  • SDK only: This is not exposed on the gensyn-sdk CLI or TUI.

  • Requires messages: use chat-style messages, not a plain prompt string.

  • Default: None. When omitted, REE does not pass the kwarg and behavior matches prior releases.

  • Explicit values: enable_thinking=True or enable_thinking=False are forwarded to apply_chat_template only when the tokenizer accepts the parameter.

For CLI/TUI one-shot runs, use --short-circuit-length / --short-circuit-token to bound thinking tokens, or migrate to the SDK for direct on/off control.

Last updated