Skip to content

Simplify IBAN regex pattern and fix trailing character handling#1818

Merged
SharonHart merged 8 commits into
mainfrom
copilot/fix-iban-regex-pattern
Dec 16, 2025
Merged

Simplify IBAN regex pattern and fix trailing character handling#1818
SharonHart merged 8 commits into
mainfrom
copilot/fix-iban-regex-pattern

Conversation

Copilot AI commented Dec 15, 2025

Copy link
Copy Markdown
Contributor

Change Description

The IBAN recognizer used a complex regex with 8 capture groups and variable-length matching (3-5 characters). Replaced with a simpler pattern using consistent 4-character groups and 3 capture groups.

Pattern Changes

Before:

r"\b([A-Z]{2}[ \-]?[0-9]{2})(?=(?:[ \-]?[A-Z0-9]){9,30})((?:[ \-]?[A-Z0-9]{3,5}){2})"
r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?"
r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{1,3})?\b"

After:

r"(?<![A-Z0-9])([A-Z]{2}[0-9]{2}(?:[ -]?[A-Z0-9]{4}){2,6})"
r"((?:[ -]?[A-Z0-9]{4})?)((?:[ -]?[A-Z0-9]{1,3})?)(?![A-Z0-9])"

Key Improvements

  • Boundary detection: Word boundaries (\b) replaced with negative lookahead/lookbehind to prevent mid-IBAN matching
  • Consistent grouping: Fixed 4-character groups instead of variable 3-5 character groups
  • Validation fallback: 3 capture groups enable trying progressively shorter matches when validation fails (e.g., rejecting trailing " X" after valid IBAN)
  • Documentation: Added inline comments explaining pattern structure and fallback mechanism

Behavior

# Correctly excludes trailing non-IBAN characters
"VG96 VPVG 0000 0123 4567 8901 X"  # Matches IBAN only, not " X"
"DE89370400440532013000 2"         # Matches IBAN only, not " 2"

# Still matches valid short segments
"BH67 BMAG 0000 1299 1234 56"      # Matches including " 56"

Issue reference

Issue tracking handled separately

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required
Original prompt

validate_result can return false on some edge cases with extra checks

The user has attached the following files from their workspace:

  • presidio_analyzer/predefined_recognizers/generic/iban_recognizer.py

TITLE: IBAN Regex Pattern Fix for Trailing Character Matching

USER INTENT: Fix a bug where the IBAN regex pattern incorrectly matches trailing characters (like 'X') after a space, causing checksum validation to fail.

TASK DESCRIPTION: The user identified that in the string 'VG96VPVG0000012345678901 X', the X character was being incorrectly matched as part of the IBAN, which then fails checksum validation. The regex pattern needed to be modified to properly handle IBAN boundaries when followed by spaces and single characters.

EXISTING:

  • /Users/shhart/dev/presidio/presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/iban_recognizer.py - IBAN recognizer implementation with regex pattern
  • /Users/shhart/dev/presidio/presidio-analyzer/tests/test_iban_recognizer.py - Test file with many commented-out test cases (lines 330-372)

PENDING:

  • The validate_result method can return false on some edge cases with extra checks (user's final note)
  • Many test cases in the test file remain commented out and need to be enabled/verified

CODE STATE:
The IBAN regex pattern was updated from various iterations to:

PATTERNS = [
    Pattern(
        "IBAN Generic",
        r"(?<![A-Z0-9])([A-Z]{2}\d{2}(?:[ -]?[A-Z0-9]{4}){2,7}[A-Z0-9]{0,3})(?![A-Z0-9])",
        0.5,
    ),
]

RELEVANT CODE/DOCUMENTATION SNIPPETS:
Key pattern components:

  • (?<![A-Z0-9]) - negative lookbehind ensures we don't start mid-IBAN
  • [A-Z]{2}\d{2} - country code (2 letters) + check digits (2 numbers)
  • (?:[ -]?[A-Z0-9]{4}){2,7} - 2-7 groups of 4 alphanumerics with optional space/dash prefix
  • [A-Z0-9]{0,3} - trailing 0-3 alphanumerics with NO leading separator (the key fix)
  • (?![A-Z0-9]) - negative lookahead ensures we don't end mid-IBAN

Test case that prompted the fix (line 337-341):

(
    "this is an iban VG96 VPVG 0000 0123 4567 8901 X in a sentence",
    1,
    ((16, 45),),
),

OTHER NOTES:

  • The fix prevents matching X because the trailing [A-Z0-9]{0,3} does NOT allow a leading separator
  • The pattern should handle: IBANs without separators, with spaces, with dashes, in sentences, multiple IBANs, and IBANs followed by unrelated single characters
  • There are warnings about unused imports (RecognizerResult, NlpArtifacts) and commented code in the file

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 5 commits December 15, 2025 10:52
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix IBAN regex pattern for trailing character matching Simplify IBAN regex pattern and fix trailing character handling Dec 15, 2025
Copilot AI requested a review from SharonHart December 15, 2025 11:05
@github-actions

Copy link
Copy Markdown

Coverage report (presidio-anonymizer)

Image Image

This PR does not seem to contain any modification to coverable code.

@github-actions

Copy link
Copy Markdown

Coverage report (presidio-structured)

Image Image

This PR does not seem to contain any modification to coverable code.

@github-actions

Copy link
Copy Markdown

Coverage report (presidio-cli)

Image Image

This PR does not seem to contain any modification to coverable code.

@github-actions

Copy link
Copy Markdown

Coverage report (presidio-image-redactor)

Image Image

This PR does not seem to contain any modification to coverable code.

@github-actions

Copy link
Copy Markdown

Coverage report (presidio-analyzer)

Image Image

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  presidio-analyzer/presidio_analyzer/predefined_recognizers/generic
  iban_recognizer.py ImageImageImageImage
Project Total ImageImageImageImage 

This report was generated by python-coverage-comment-action

@SharonHart SharonHart linked an issue Dec 16, 2025 that may be closed by this pull request
@SharonHart SharonHart marked this pull request as ready for review December 16, 2025 08:30
@SharonHart SharonHart merged commit 9c8c690 into main Dec 16, 2025
34 checks passed
@SharonHart SharonHart deleted the copilot/fix-iban-regex-pattern branch December 16, 2025 10:09
prokopidis pushed a commit to prokopidis/presidio that referenced this pull request Jun 23, 2026
…osoft#1818)

* Initial plan

* Update IBAN regex pattern to simpler, more maintainable version

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix line length linting issue in IBAN recognizer

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix separator handling inconsistency in IBAN pattern

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Add pattern documentation and fix digit representation consistency

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Improve documentation clarity for validation fallback mechanism

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix IBAN test cases and add new scenarios

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IBAN_CODE Entity, identification does not meet expected results

3 participants