Inspiration
Data is likely going to become less and less structured over time. It is important to understand which data pieces correspond with each other. In particular, when dealing with companies extracted from text, it is advantageous to map them to a database of companies by name as a first step of structuring such data in the form of: ({input_company_names}, {internal_company_names}) -> {input_name: internal_name}.
What it does
Mapping between processed and original unstructured text data having company names. A quick algorithm to help with data pipeline needs. Essentially, it automates the process of choosing algorithms with respect to various metrics.
How we built it
-> Algorithms we used are: 1) Direct Mapping (use HashMap) 2) Levenshtein Distance (Dynamic Programming) 3) Natural Language Processing (Machine Learning using tf-idf and cosine similarity) -> Created a table to compare different metrics among the algorithms. -> These metrics are: Accuracy, Coverage, Execution Time, Space Consumption, Recall and F1 score, etc. -> We also made some graphs to compare metrics of different algorithms. -> Data used is “SEC-EDGAR-Companies” from Kaggle containing 600K company names internal database. -> Generated our own input labelled dataset which has been cleaned thoroughly and randomly manipulated to measure accuracy with the internal database.
Challenges we ran into
1) Calculating the F1-Score. (True-Positive, False-Positive, True Negative, False Negative). 2) Few Algorithms like (Levenshtein Minimum Edit Distance) took a lot of time to process the dataset. We eventually had to set a threshold timer value to limit the processing and then depict the metrics of what came in within that threshold. 3) Some algorithms can be further optimized (high computation time). 4) Calculating accuracy for unseen examples. 5) Isolating the running environments for the algorithm.
Accomplishments that we're proud of
1) It works! The metrics come up and are easily comparable. 2) Data is widely visualized with the help of graphs. 3) Created a flexible Framework - RESTful container for the algorithms (as a part of the framework).
What we learned
A LOT! Like F1-Scores, many algorithms for mapping, and gathered a lot of valuable insights from our mentor.
What's next for Map.it
1) We are working to use a GPU for speeding up the processing of computation. 2) Add more and more algorithms. 3) Generate a diverse clean dataset to test out accuracy. 4) A good recommendation engine for recommending algorithm for datasets.
Log in or sign up for Devpost to join the conversation.