fix: Improve Korean RRN regex pattern validation#1807
Conversation
- Use negative lookahead/lookbehind instead of word boundaries - Add gender digit validation ([1-4] for first digit of last 7 digits)
|
Thanks @kyoungbinkim. Could you please add a few test cases, especially for the cases that fail on the previous pattern and succeed on the new one? |
@omri374 Thank you for the review! I've added test cases to verify the new pattern implementation. Please check the test code below: from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_analyzer.nlp_engine import SpacyNlpEngine, NlpEngineProvider
from presidio_analyzer.predefined_recognizers import KrRrnRecognizer
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "ko", "model_name": "ko_core_news_sm"}]
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["ko"])
analyzer.registry.add_recognizer(KrRrnRecognizer(supported_language='ko'))
"""
test text containing Korean Resident Registration Numbers
korean RRN format: YYMMDD-GXXXXXX
YYMMDD : birth date
G : gender
1 : male (born 1900-1999)
2 : female (born 1900-1999)
3 : male (born 2000-2099)
4 : female (born 2000-2099)
"""
test_texts = [
"그의 주민등록번호는 960214-1157348입니다.(valid)", # it means "His resident registration number is 960214-1157348."
"rrn 960325-5123456 (invalid)", # invalid RRN gender digit 5
"홍길동의 주민등록번호는 850101-7345678 입니다. (invalid)", # it means "Hong Gil-dong's resident registration number is 850101-7345678."
]
"""
before result :
Text: 그의 주민등록번호는 960214-1157348입니다.(valid)
Text: rrn 960325-5123456 (invalid)
Entity: KR_RRN, Start: 4, End: 18, Score: 0.5 text:960325-5123456
Text: 홍길동의 주민등록번호는 850101-7345678 입니다. (invalid)
Entity: KR_RRN, Start: 13, End: 27, Score: 0.5 text:850101-7345678
after result :
Text: 그의 주민등록번호는 960214-1157348입니다.(valid)
Entity: KR_RRN, Start: 11, End: 25, Score: 0.5 text:960214-1157348
Text: rrn 960325-5123456 (invalid)
Text: 홍길동의 주민등록번호는 850101-7345678 입니다. (invalid)
"""
for text in test_texts:
results = analyzer.analyze(text=text, language='ko')
print(f"\nText: {text}")
for entity in results:
print(f"Entity: {entity.entity_type}, Start: {entity.start}, End: {entity.end}, Score: {entity.score} text:{text[entity.start:entity.end]}") |
- Changed gender digits from 7 and 0 to valid values to 1~4
@microsoft-github-policy-service agree |
That's great, I would add the invalid cases as well, with a '0' matches assertion |
- Updated test cases "050912-2000019" and "0509122000019" scores from (1.0, 1.0) to match actual recognizer behavior
|
@SharonHart There was an error in the test case score calculation, so I fixed it. Thank you. |
|
@kyoungbinkim |
Added more invalid RRN cases to enhance test coverage.
|
@SharonHart Added an invalid test case to detect incorrect input. Thanks! |
* fix unit tests (microsoft#1778) * intial commit * Remove skip marker for spacy_nlp_engine fixture * Remove skip markers for stanza and transformers NLP engine fixtures * move poetry cache dir (microsoft#1784) * remove poetry cache from docker images (microsoft#1785) * rename dockerignore files (microsoft#1787) * Remove build-essential from the Analyzer docker image (microsoft#1789) * update docker * more ignores * Bump actions/checkout from 5 to 6 (microsoft#1793) Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix dev container permission issues (microsoft#1788) * Fix dev container permission issues by removing USER directive Removes USER directive from Dockerfile.dev files to fix permission denied errors when accessing bind-mounted workspaces. Dev containers now run as root, which is standard practice for local development environments and matches the original working configuration before PR microsoft#1759. Fixes microsoft#1782 * Fix dev container permission issues and Poetry 2.0 compatibility - Remove USER directive from Dockerfile.dev files to fix permission errors - Remove poetry shell commands (not available in Poetry 2.0) - Configure VS Code to use Poetry venv automatically Fixes microsoft#1782 --------- Co-authored-by: Omri Mendels <omri374@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * CI coverage test (microsoft#1794) * Add coverage checks to CI and include pytest-cov in dependencies * Add coverage configuration to pyproject.toml files for all packages * Add coverage reporting to CI workflow with combined report generation * Enhance coverage reporting in CI by generating detailed summaries and status badges * Refactor coverage reporting in CI to streamline summary generation and improve output formatting * updating the final table * Refactor coverage report generation in CI to build message dynamically and improve formatting * Add diff-cover support and enhance coverage reporting in CI * Refactor coverage reporting in CI to simplify coverage data handling and enhance PR comment functionality * Enhance coverage reporting in CI by renaming coverage files for better merging and updating upload paths * Add coverage configuration for relative paths and enhance artifact fetching * Refactor coverage reporting in CI to use a relative coverage configuration file and simplify coverage file handling * Refactor CI coverage reporting to check for file changes and comment on PRs, removing combined coverage report steps * Refactor CI unit test coverage reporting for Python 3.12 to include HTML report and improve file change detection logic * Add test coverage trigger comment to AnalyzerEngine docstring * trigger tests * Refactor CI unit test coverage reporting for Python 3.12 to simplify conditions and improve coverage output handling * Update coverage command to use pyproject.toml for configuration * asd * Enhance coverage reporting in CI workflow to include detailed PR diff coverage and upload coverage metrics * Improve diff coverage percentage formatting in CI workflow * Fix diff coverage percentage calculation in CI workflow to handle missing values * trigger coverage change * Update coverage threshold to 80% in CI workflow * fix ruff * Add calculate_pii_density method to analyze PII density in text * Enhance coverage check in CI workflow with 80% threshold and detailed summary * Refactor CI test job to streamline coverage reporting and remove unnecessary permissions * Enhance CI coverage checks: enforce diff coverage on PRs and append summary * Set fetch-depth to 0 for actions/checkout to ensure full history is available * Enhance test coverage reporting: include branch coverage and missing lines in output * Enhance coverage check in CI: show uncovered lines in diff coverage report * Enhance coverage check in CI: include git diff options for more detailed output * Refine coverage check in CI: remove redundant git diff options from diff-cover command * Enhance coverage reporting in CI: update coverage command and add PR comment action * Enhance coverage check in CI: parameterize coverage threshold for consistency * Enhance coverage check in CI: dynamically set package name for coverage reporting * Refine coverage check in CI: remove minimum coverage threshold environment variable * Enhance CI configuration: add coverage path for component-specific coverage data * Fix coverage path in CI: update to use component path directly * Enhance CI configuration: add permissions for pull requests in test job * Refactor AnalyzerEngine: update class docstring and remove unused calculate_pii_density method * Update CI configuration: use environment variables for Python versions and primary Python in tests * Add coverage threshold environment variable to CI job * Update CI configuration: set Python versions directly in matrix and define coverage threshold * changing threshold to 90 * Update CI job permissions to allow write access for contents * Fix typo in coverage check message for clarity * remove duplicate component name * Enable credential persistence for checkout action in CI workflow * remove the if for PRS only to allow run on the default branch main. * Fix coverage job to use component path for SUBPROJECT_ID (microsoft#1798) * Language models integration (LangExtract) (microsoft#1775) * Add LangExtract recognizer for PII extraction - Introduced LangExtract recognizer to enhance PII detection capabilities. - Added configuration files for LangExtract prompts and examples. - Implemented LangExtractRecognizer class to handle PII extraction using LangExtract. - Created tests for LangExtract recognizer to ensure functionality and reliability. - Added a simple standalone test script for quick validation of LangExtract setup. - Updated pyproject.toml to include langextract as a dependency. * refine the docs * narrow support for oollama only * Refactor LangExtract tests to use Ollama; remove API key dependency * adding first draft of docker compose * Update model_id in tests to use 'gemma2:2b' instead of 'gemini-2.5-flash' * Refactor LangExtract documentation to focus on Ollama support; remove references to other LLM providers * Update README to remove Ollama setup instructions and clarify integration guide reference * Enhance Ollama installation script with progress messages and error handling; update model download method for better user feedback * auto ruff fixes * Enhance LangExtractRecognizer tests with real Ollama integration - Updated `langextract_recognizer_class` fixture to create a test-specific configuration for LangExtractRecognizer, enabling it for testing. - Refactored tests in `test_langextract_recognizer.py` to utilize the new configuration and validate the recognizer's behavior with real Ollama. - Removed mock-based tests for LangExtract and replaced them with integration tests that check the recognizer's functionality against a running Ollama instance. - Added tests to verify the recognizer's initialization, entity detection, and error handling when the Ollama server is unreachable. - Ensured that only requested entities are returned and that results include analysis explanations. * Remove unnecessary line breaks in LLM-based PII detection section of README * Add LangExtract LLM-based PII detection test and configuration * Improve Ollama availability check with setup attempt message * Increase wait time for services and update healthcheck parameters for Ollama service * Add Ollama setup for Analyzer tests and improve availability check * Set timeout for Ollama setup in Analyzer tests to 8 minutes * Enhance Ollama setup for Analyzer tests with improved installation and readiness checks * Update Ollama model references from gemma2:2b to llama3.2:1b across configuration and test files * Update model references from llama3.2:1b to gemma2:2b across configuration, scripts, and tests * Remove 'enabled' configuration from LangExtract settings in YAML files and update tests accordingly * Update Ollama service configuration: change port mapping and modify healthcheck command * Refactor logging in LangExtractRecognizer: reduce verbosity and improve clarity of extraction results * Update Ollama service configuration: modify port mapping and healthcheck command * Update CI workflow and tests: reduce sleep duration and add environment variable for LangExtract recognizer * Reduce sleep duration in CI workflow from 150 to 60 seconds * Update LangExtract model references from gemma2:2b to gemma3:1b and remove obsolete installation script * docs and prompt fixes * finalizing the pr * Update Ollama image to latest version and add LangExtract PII/PHI extraction examples and prompts * fix bad example * fix unit-tests * refactor: clean up .env file and simplify skip_engine logic in tests * chore: add a new line to .env file for better readability * chore: remove unnecessary blank line from .env file * revert .env * intial commit * Remove skip marker for spacy_nlp_engine fixture * Remove skip markers for stanza and transformers NLP engine fixtures * Remove Ollama recognizer test and update default recognizers configuration * Remove unused Ollama recognizer configuration and update prompt file references * Add end-to-end tests for API anonymization and redaction features - Implemented tests for the anonymization API in `test_api_anonymizer.py`, covering various scenarios including valid requests, empty inputs, malformed requests, and custom anonymizers. - Created integration tests in `test_api_e2e_integration_flows.py` to validate the analyze and anonymize workflow with PII detection. - Added tests for image redaction functionality in `test_api_image_redactor.py`, ensuring proper handling of image data and error responses. - Developed package-level tests in `test_package_e2e_integration_flows.py` to verify the functionality of the analyzer and anonymizer engines, including support for third-party recognizers. * Remove unused Ollama recognizer imports and related tests * Update requirements and improve Ollama recognizer availability checks in e2e tests * Fix formatting in requirements.txt for analyzer and anonymizer dependencies * Update Ollama model ID from gemma3:1b to gemma2:2b in configuration and tests * gemma2:2b * finalizing the pr * Remove unused ABC import from lm_recognizer.py * Fix indentation in docker-compose.yml for volumes section * Fix line break for clarity in adding_recognizers.md * Add timeout settings for Ollama recognizer and test cases * Refactor timeout comment for clarity in OllamaLangExtractRecognizer * Update Ollama model version and add configuration for LangExtract recognizer tests * Update examples_file path in configuration for Ollama recognizer * Remove timeout decorator from Ollama recognizer * Add rerun settings to unit and E2E tests for improved stability * Remove rerun settings from unit and E2E test commands for simplification * Set max-parallel to 2 for local build and E2E tests * Remove max-parallel setting from local build and E2E tests * move poetry cache dir * pr changes * code review changes * ruff check * remove unused json import in test_ollama_recognizer.py * refactor test names for clarity and consistency in test_ollama_recognizer.py * finalizing the PR * self code review fixes * Refactor LangExtractRecognizer to use yaml for configuration loading * ruff fixes * Update error messages in OllamaLangExtractRecognizer tests for clarity * CR comment addressed * Remove unused variables from Jinja2 prompt rendering in LangExtractRecognizer * exporting functionality to helpers enlarging composition * composition * Refactor entity mapper and langextract utilities for improved clarity and consistency * Refactor tests to use get_langextract_module for mocking LangExtract availability * Refactor langextract utilities to improve clarity and error handling; remove deprecated functions and update tests accordingly * Refactor LLM utilities by simplifying docstrings and consolidating imports for improved readability and maintainability * Update error message for missing Jinja2 installation to include poetry installation instructions * Add Ollama recognizer configuration and tests for YAML integration - Introduced `test_ollama_enabled_recognizers.yaml` to define recognizers including OllamaLangExtractRecognizer. - Enhanced `test_package_e2e_integration_flows.py` with a test to validate loading of Ollama recognizer from YAML configuration. - Updated `OllamaLangExtractRecognizer` to support configuration path and language parameters. - Improved handling of relative paths for configuration files. * Refactor docstrings in OllamaLangExtractRecognizer for improved clarity and formatting * Enhance OllamaLangExtractRecognizer initialization docstring to clarify kwargs usage * pr comments * Refactor OllamaLangExtractRecognizer to streamline config path handling and remove redundant comments * Refactor Ollama recognizer test to improve clarity and enhance entity detection validation * Refactor tests for Ollama recognizer and LMRecognizer to improve exception handling and configuration validation * Update config path for Ollama recognizer in test configuration * Update config paths for Ollama recognizer and add test configuration for LangExtract * Remove test configuration for Ollama LangExtract * Remove test configuration for Ollama LangExtract recognizer * Fix formatting in resolve_config_path function for improved readability * Enable UsLangExtractRecognizer and update its config path * change all configs to use gemma3:1b * Disable Ollama LangExtract recognizer and update its configuration path * Update langextract configuration paths to use absolute paths for prompt and examples files * Remove test script for Ollama recognizer configuration loading * Refactor config loading in examples and prompt loaders to use resolve_config_path; update logging level in LMRecognizer; add langextract availability check in OllamaLangExtractRecognizer. * Refactor parameter description in load_yaml_examples and clean up imports in prompt_loader * Update langextract paths to use repo-root-relative paths in tests and prompt loader * Enhance documentation for Ollama setup and improve __init__.py imports for clarity and maintainability * code review changes * pr comments & align to main --------- Co-authored-by: Tamir Kamara <26870601+tamirkamara@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Coverage data has been included in the documentation. (microsoft#1799) * Add code coverage requirements and update component download table in documentation * Remove Presidio CLI from the downloads and coverage table in the documentation * Fix Redoc API Docs script Inclusion (microsoft#1796) * Bug fix: Remove **kwargs from recognizer __init__ methods (microsoft#1800) * Remove unnecessary kwargs from recognizer initializations * Remove unnecessary kwargs from recognizer initializations --------- Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Add Azure OpenAI support for LangExtract recognizer (microsoft#1801) * Add Azure OpenAI support for LangExtract recognizer * ruff * add redundnant tests to achieve 100% coverage * Add error handling and tests for Azure OpenAI provider initialization * Fix Microsoft Defender secret scanning false positives Replace test API keys with obviously fake placeholders: - test-api-key → PLACEHOLDER_NOT_A_REAL_KEY - test-key-123 → PLACEHOLDER_NOT_A_REAL_KEY - env-key → PLACEHOLDER_FROM_ENV - key → PLACEHOLDER_KEY These are unit test mock values, not real secrets. Using placeholder patterns that won't trigger security scanners while maintaining test validity. All 29 tests passing. * remove bandit * Add bandit tool to Microsoft Security DevOps workflow * Remove bandit from Microsoft Security DevOps workflow tools * Update AzureOpenAILangExtractRecognizer to use deployment name from environment variable * Refactor Azure OpenAI integration: remove legacy provider, update recognizer, and adjust tests * Improve error handling during Azure OpenAI provider registration by logging as error and raising exception * Refactor Azure OpenAI provider initialization logging for improved readability * Refactor test imports in Azure OpenAI recognizer tests for improved clarity and organization * Refactor Azure OpenAI provider tests to remove unnecessary variable assignments for improved clarity * Refactor langextract configuration: reorder supported entities and update entity mappings for consistency * Refactor Azure OpenAI integration: enhance documentation, improve endpoint validation, and streamline provider registration * Refactor Azure OpenAI LangExtract Recognizer: remove unused import and clean up code formatting * Refactor Azure OpenAI provider tests: update imports to use the correct module and remove obsolete test for langextract availability * Refactor Azure authentication handling: consolidate credential management into azure_auth_helper and update related recognizers and tests * Refactor Azure OpenAI recognizers: enhance module imports for registration, streamline model ID handling, and improve test coverage for credential selection * Refactor AHDS Surrogate operator: streamline error handling by mocking Azure credentials and client, and improve code readability * Refactor AHDS Recognizer tests: replace multiple credential mocks with a single get_azure_credential mock for improved clarity and maintainability * Add a validation layer for YAML based configuration (microsoft#1780) * fix: Improve Korean RRN regex pattern validation (microsoft#1807) * fix: Improve Korean RRN regex pattern validation - Use negative lookahead/lookbehind instead of word boundaries - Add gender digit validation ([1-4] for first digit of last 7 digits) * Fix: correct invalid gender digits to valid ones - Changed gender digits from 7 and 0 to valid values to 1~4 * fix: Update KR_RRN test scores to match actual recognizer output - Updated test cases "050912-2000019" and "0509122000019" scores from (1.0, 1.0) to match actual recognizer behavior * add: invalid RRN test cases Added more invalid RRN cases to enhance test coverage. --------- Co-authored-by: Omri Mendels <omri374@users.noreply.github.com> * enabled `OllamaLangExtractRecognizer` by default --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Ron Shakutai <58519179+ShakutaiGit@users.noreply.github.com> Co-authored-by: Tamir Kamara <26870601+tamirkamara@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Dor Lugasi-Gal <dorlugasigal@gmail.com> Co-authored-by: Omri Mendels <omri374@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> Co-authored-by: kim <83156897+kyoungbinkim@users.noreply.github.com>
* fix: Improve Korean RRN regex pattern validation - Use negative lookahead/lookbehind instead of word boundaries - Add gender digit validation ([1-4] for first digit of last 7 digits) * Fix: correct invalid gender digits to valid ones - Changed gender digits from 7 and 0 to valid values to 1~4 * fix: Update KR_RRN test scores to match actual recognizer output - Updated test cases "050912-2000019" and "0509122000019" scores from (1.0, 1.0) to match actual recognizer behavior * add: invalid RRN test cases Added more invalid RRN cases to enhance test coverage. --------- Co-authored-by: Omri Mendels <omri374@users.noreply.github.com>
Change Description
Replacing word boundaries with negative lookahead/lookbehind
\bto(?<!\d)and(?!\d)1901201-1234567(14 digits) will no longer be matched\bdoesn't work correctly with non-ASCII characters, so negative lookahead/lookbehind ensures proper matching in Korean contextsAdding gender digit validation
\d{7}to[1-4]\d{6}901201-5234567or901201-0234567Improving day pattern consistency
[1-2][0-9]to[12]\dfor better readabilityBefore:
\b\d{2}(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1])(-?)\d{7}\bAfter:
(?<!\d)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])(-?)[1-4]\d{6}(?!\d)Checklist