Inspiration
Investigative journalists face shrinking resources and rising document loads. County Recorder data arrives as massive collections of unlabeled, heterogeneous image files. Reporters cannot feasibly review thousands of documents to uncover hidden relationships, which creates blind spots in public oversight.
What it does
Ingests raw document image files from local government offices, performs OCR to extract entities, links related people, organizations, and addresses in a Neo4j graph, and enables natural-language querying across multi-document relationships.
How we built it
We ingest unstructured image files and run them through PaddleOCR for flexible text extraction. We apply AI-based entity extraction to identify individuals, organizations, and addresses. These entities and their relationships are loaded into a Neo4j graph database, which is exposed through a natural-language interface for investigative queries.
Challenges we ran into
Highly variable document structure required robust OCR and entity extraction. Large volumes of unlabeled images created complexity in indexing and linking. Establishing accurate relationships across disparate documents demanded careful graph design.
Accomplishments that we’re proud of
Built an end-to-end pipeline that converts heterogeneous government documents into a navigable knowledge graph. Enabled complex relationship discovery that would otherwise require extensive manual review. Delivered a tool that meaningfully accelerates investigative journalism.
What we learned
Unstructured public records contain valuable but deeply buried insights. OCR performance and consistent entity extraction are critical to making these documents usable. Graph-based representations significantly enhance exploratory research.
What’s next for Form Link
Add support for parsing margin notes and signatures as image-based graph entities. Introduce a full “Research Agent” to automate exploratory workflows. Scale the pipeline to handle petabyte-scale document repositories.
Log in or sign up for Devpost to join the conversation.