Mining Archives

CIACR

Cross referencing the web with 13m pages of CIA declassified reports

Background

On January 17th 2017 the CIA declassified and released 13 million pages of documents to the public. However, the release of the documents garnered relatively little public interest, as it is impossible for any individual to process such large amounts of information. News sources, likewise were unable to fully summarize the documents, and instead focused on a few interesting tidbits. Our project aims to allow the general public to better understand the relevance of the documents by categorizing the documents by keywords and linking the documents to relevant Wikipedia articles.

Product

We created a chrome extension that will provide links to pdf of declassified CIA documents to relevant articles as needed by the user.

Development

1) Web Crawler

In order to parse a large amount of data, tools need to be automated. The first step in automating is obtaining the URLs of the pdf documents to download and parse. The java program was run for 4 hours, and was able to obtain the links of 16,000 documents.

2) Word Extraction

The documents released consist of scans of paper documents, and the pdfs released consisted of images. As such a script was created to convert the pdfs into images to be used in OCR (Optical Character Recognition). Tesseract was used as the OCR of choice due to it being well documented and open source. A batch file was written to automate the conversion between images and a .txt transcript.

OCR was a particularly difficult challenge as there were a wide variety of factors affecting the accuracy and usability of the transcript. First, the documents were written in a wide variety of text, from chicken-scrawl handwriting to modern printed pages. For some documents, OCR simply could not be expected to produce a vaguely accurate transcript. For the majority of documents, transcription time was a large factor, as it could vary between 15 seconds to 5 minutes. The time spent per transcription had to be balanced with the quality, and we had a difficult time finding settings which would work for the range of documents.

3) Data Cleaning

The OCR was often unable to do a perfect transcript of the documents, so the data needed to be further cleaned before determining keywords for the document. An autocorrect java program was written which corrected for simple misspellings and filtered out random punctuation marks.

4) Feature Generation

We used a python script using the RAKE (Rapid Automatic Keyword Extraction) algorithm to identify key phrases for each article. The algorithm assumes that key phrases typically consist of meaningful words with very little stopwords(the, and, of etc.). The algorithm then assigns a "relevance" score based off of the frequency of the words, and the length of the phrase in which they appear. The script ranks the phrases and outputs a list of them for each transcript. Another script then identified common key phrases across multiple documents and the most popular of these were selected as identifying features for a document set.

5) Data Analysis

Using the identified features python script generated a .dat file representing a matrix of 1's and 0's where each row represents a CIA document and each column a specific feature (key phrase). An element of this matrix has a 1 only if its corresponding article has the corresponding key phrase (feature). Using this data, we used matlab to cluster the data and identify trends between them.

6) REST API (Backend)

A rest API with a post endpoint was made for the chrome extension to post the URL of wikipedia articles and in return receive a list of relevant CIA document pdf links. It is written using a combination of node.js and express.js. Additionally, it calls a python script to use RAKE to perform the same key phrase extraction on the inputed page and return the keywords. These returned keywords are then cross referenced with CIA document data that allowing for the mapping of key phrases to their associated documents.

7) Chrome Extension

Our final product, a chrome extension, offers readers important information they may be missing in their everyday news articles, and allows them to read for themselves relevant original documents.

Ties to Social and Civic Hacking

In recent years, the public has grown increasingly distrustful of institutional sources of information, preferring original "leaked" or "declassified" documents. However, such documents are often published in terabytes of data, beyond the comprehension of anyone, or even groups of dedicated civil activists. Our product seeks to help these groups and ordinary citizens to better understand such large amounts of data by performing objective analysis on original sources. The key words we provide and use to reference documents allow for anyone to access and read the documents themselves. A chrome extension is an easily accessible way to reach large amounts of consumers. By providing easy to access, highly relevant original documents in a single click, users can be easily informed and better aware of government actions. We have reduced 13 million documents to 1.

Built With

Updates

Derek Nong started this project — Jan 22, 2017 06:41 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.