Inspiration
The UCLA Digital Library has archived many old manuscripts. Entries are organized by their name (e.g. Voltaire's Dictionnaire philosophique). However, many of these manuscripts have annotations written on them. Some pages were underlined, others have notes written on the margins, and some even have doodles. While many scholars have worked hard deciphering these annotations, I think it would be great if we could filter out all the pages that DO NOT have annotations. This saves people from going through hundreds of "clean" pages and have them only focus on pages with annotations written on them.
What it does
Rarely does anyone build a convolutional neural network from scratch; I am no different. This uses the resnet model and retrains its last layer to classify manuscripts binarily --manuscripts with annotations and manuscripts without.
How I built it
- Get as much images of manuscripts as possible.
- Painfully label all the data and organize them into training and validation data .
within the script...
- Scale the data small enough to make computations easier without losing too much information of the image (pretty trial and error)
- Train the model a number of times
- Take model with best validation error
- Test it~
Challenges I ran into
Getting labeled data is REALLY hard. Many hours were spent downloading images and organizing them all into training data and testing data. I was new to PyTorch, so it took some time getting all the arguments correct. It was cold at Santa Barbara :(
Accomplishments that I'm proud of
It actually kinda works! Learned to persist despite going at this project alone.
What I learned
MAD RESPECT TO THE PEOPLE WHO LABEL DATA. I COULD RAISE A FAMILY WITH THE AMOUNT OF TIME TAKEN TO LABEL THE AMOUNT OF DATA NECESSARY TO TRAIN A MODEL. This was my first time exposing myself into the deep learning sector of machine learning.
What's next for Manuel 0.1
This is really part of larger pipeline of the project. The next step is to take all the annotated manuscripts and write some software that circles the region where annotations are in the manuscript (another machine learning problem).
Log in or sign up for Devpost to join the conversation.