Inspiration

Privacy policies can be a hassle to read through, but they often contain some really questionable clauses that hover in the gray area between ethical and unethical data distribution. For instance, LinkedIn's privacy policy literally declares that by signing up for their service, you are granting them "a nonexclusive, irrevocable, worldwide, perpetual, unlimited, assignable, sublicenseable, fully paid up and royalty-free right to us to copy, prepare derivative works of, improve, distribute, publish, remove, retain, add, process, analyze, use and commercialize, in any way now known or in the future discovered, any information you provide, directly or indirectly to LinkedIn..."

If you take the time to read into these long, convoluted legal agreements, you might realize that some of them are not as innocuous as you think. Therefore, it is important to be able to easily comprehend all these privacy policies with the click of a button. That's why we built DocVerifier.

What it does

DocVerifier is a Chrome extension that automatically scans a website's privacy policy and identifies clauses that are ethically questionable. With a click of a button, you are able to understand the nature of the service that you're signing up for and see if it respects your privacy and does not sell your personal data without you knowing about it. It also has a web-app component that allows you to upload privacy policy documents and have it scan through the text and find sentences that are ethically questionable.

How we built it

We built it using the following technologies:

  • Javascript for the Chrome extension
  • Flask for the backend & web-app & machine learning API
  • NLTK for scraping and pre-processing the privacy policy text
  • sklearn for the NLP classification algorithms (we used both the Naive Bayes and Support Vector Machine (SVM) methods, along with a GridSearchCV method to optimize the accuracy of the classification algorithm)

Challenges we ran into

The lack of training data was the biggest challenge we faced. Since we were using a supervised learning method, we first tried to find an existing tagged dataset of good and bad privacy policy sentences. However, there was no such dataset out there, so we decided to build it ourselves. We ended up collecting ~100 good and ~100 bad sentences and used them as our training set. However, since it's a relatively tiny dataset, we were not able to train a robust enough classification algorithm – the max we were able to improve the accuracy score was about 72%.

Accomplishments that we're proud of

  • Explored various new domains: NLP, Chrome extension, Dfinity, Flask, Web Scraping
  • Finishing the project according to the timeline
  • Used good coding and GitHub practices (using issues, Pull requests, and reviewing code)
  • Team collaboration

What we learned

  • Learned about ethical data
  • Explored NLP and algorithms relating to text classification
  • How to develop a chrome extension
  • Building a Flask API
  • working with Dfinity (Definitely ><)

What's next for DocVerifier

  • Making the ML model more efficient by getting more training datasets
  • Predict the flaws in the website cookies and offer letters
  • Integrating file upload in the chrome extension
Share this project:

Updates