Compare similarity of Python code.
uvx --from "git+https://github.com/sanand0/codesimilarity.git" codesimilarity a.py b.py c/This computes pairwise Jaccard overlap of k-token phrases between a.py, b.py and all .py files under c/ (treated as a single concatenated document).
- Tokenizes Python via
tokenize, ignoring whitespace, comments, and indentation. - Builds shingles: all
k-length sequences of tokens per input. - Computes Jaccard similarity:
|A ∩ B| / |A ∪ B|on shingle sets.
uvx --from "git+https://github.com/sanand0/codesimilarity.git" \
codesimilarity PATH [PATH ...] \
--k 5 \
--lexical \
--nearest 0 \
--threshold 0.0 \
--csv out.csvPATH handling:
- Python files (
.py) are processed. - Directories are processed as a single document made by concatenating all
.pyfiles under them (recursively). - Non-Python files are ignored with a warning:
Warning: ignoring non-Python file: <path>. - Missing paths warn:
Warning: path not found: <path>.
Options:
--ksets phrase length. Largerkreduces incidental matches but misses re-orderings. 5-8 is good.--lexicalcompare exact literals. Treats string / number changes as different.--thresholdfilters out pairs below the given overlap (applies only in pairwise mode).--nearest Noutputs only the top-N nearest matches per input (disables pairwise mode).--csv out.csvwrites CSV output to the given file; otherwise prints TSV to stdout.
Output modes:
- Pairwise (default): rows for all ordered pairs, with columns
left,right,overlap.--csv out.csvwrites CSV with 6-decimal floats; otherwise prints TSV to stdout with 3 decimals.
- Nearest summary (
--nearest N): one row per input showing the nearest matches. Columns are:- max_overlap
- mean_overlap
- nearest_1,... nearest_N
- overlap_1,... overlap_N.
By default, replaces all strings with STRING and all numbers with NUMBER. Identifies (variable/function/class names) are not normalized.
So, even if someone changes a string or number value, it won't affect the similarity score.
Example:
# a.py
def f(a, b):
s = "hello"
return a + b
# b.py
def f(a, b):
s = "world"
return a + b- With default normalization, the overlap is 1.0.
- With
--lexical, the overlap is 0.375 because the string literals differ.
Larger k reduces incidental matches. For short snippets, very large k can lead to zero overlap.
Example:
# mul.py
def mul(a, b=2):
return a * bWith --k 2, we find some incidental overlap between add1.py and mul.py because they share short phrases like def <NAME> and ( <NAME>.
uvx --from "git+https://github.com/sanand0/codesimilarity.git" codesimilarity *.py --k 2| left | right | jaccard |
|---|---|---|
| add1.py | add2.py | 0.714 |
| add1.py | mul.py | 0.333 |
| add2.py | add1.py | 0.714 |
| add2.py | mul.py | 0.368 |
| mul.py | add2.py | 0.368 |
| mul.py | add1.py | 0.333 |
With --k 5, the incidental overlap disappears, and add1.py and mul.py have no 5-token phrases in common.
uvx --from "git+https://github.com/sanand0/codesimilarity.git" codesimilarity *.py --k 5| left | right | jaccard |
|---|---|---|
| add1.py | add2.py | 0.286 |
| add1.py | mul.py | 0.000 |
| add2.py | mul.py | 0.286 |
| add2.py | add1.py | 0.053 |
| mul.py | add2.py | 0.053 |
| mul.py | add1.py | 0.000 |
- Normalize f-strings to
"FSTRING". Treat their{expr}parts as"ID". - Drop module/class/function docstrings via an AST pass.
- For large cohorts, replace full shingle sets with a winnowed fingerprint (hash k-grams, slide a window, keep minima). This keeps recall for longer matches while drastically shrinking memory/time. Then compare fingerprints. For simplicity, hash each k-gram with
blake2b(digest_size=8)and winniw with window size w ~ k.
This repository is not published on PyPI. Develop and test locally:
- Clone:
git clone https://github.com/sanand0/codesimilarity && cd codesimilarity - Create env:
uv venv && source .venv/bin/activate - Install deps:
uv pip install -e .[dev] - Run tests:
uv run pytest
Notes:
- The CLI is exposed via the
codesimilarityentry point (Typer). Useuvx codesimilarity ...as shown above.