Project Fishtype

Github: https://github.com/AKil47/FishType

Problem: Cybercrime is more rampant than ever

Today we live in a world where technology is advancing, rapidly simplifying our lives while making us utterly dependent on computers and digital networks. At the same time, data breaches and complex hacks have also become daily news, threatening our wealth, privacy, and security. These modern times mean that any and all measures of reinforcing our security are more vital than ever before.

Solution: Keystroke Dynamics - Applying Gait Analysis to Typing

Gait Analysis demo in Mission Impossible

Gait Analysis is the study of analyzing a person's walking habits to determine their identity. Although real-time systems like the gif shown above don't fully exist, it has been scientifically proven that each human's gait has a set of uniqueness that is akin to a fingerprint.

Our Idea is to take gait analysis from the feet and see if the same holds true for the fingers Spoiler Alert: It kind of does!

Collecting the Data

To date, there is only a single dataset on Kaggle that has anything related to keystroke data and identity. It has a whopping five rows where each user entered a string five characters long. Our goal was to create an open dataset that has enough volume and reliability to help Data Scientists develop accurate models to solve the problems listed above

We designed a custom keystroke surveying tool called fishtype (bc we're freshman) to collect the data. The code is on our GitHub, so check it out! Screenshot of our capture application

The website had users type the phrase "The quick brown fox jumped over the lazy dog" five different times into a text box. If the users entered the phrase incorrectly, they were asked to type it again so that the dataset could remain consistent.

We went across the Hall of Champions and asked every single person that we saw to fill out our survey. We also sent the link to our friends and family members.

By asking the user to enter the data multiple times and repeat the process until they did not make any mistakes, we ensured that our data would be clean and consistent from the collection stage itself.

The data was then uploaded to Google's Firebase Cloud Storage platform as JSON files that we could download for further analysis on our local machine.

Analyzing the Data

At the end of our surveying and cleaning, we were able to collect 63 users with 5 trials each worth of data yielding a total of 315 typing samples. Each sample logged 90 different key events per trial.

Feature Generation

Here is a sample of our raw data: Screenshot of raw data from firebase Our raw data consisted of timestamps for every keypress and key release a user made in a given trial. On its own, this data is not very useful. In order to provide this data some meaning, we extracted two main features from the dataset, seek time and duration.

In gait analysis with feet, analysts focus on the time it takes to switch legs, and the time a person spends on a single leg. We took the same concepts and applied them to typing to get our two main features. Diagram explaining variables

  • Seek Time - The time a user spends switching from one key to the following key
  • Duration - The time a user hovers on a single key

We were able to compute both of the variables by simply subtracting the relevant timestamps from the raw data.

Cleaning the Data

We designed the experiment to promote data cleanliness from the get-go but overlooked one crucial factor: capital letters. On a keyboard, capital letters can be inputted using the shift key or the Caps Lock key. To ensure that each of the records was exactly the same , we pruned all records where participants used the Caps Lock key instead of the shift key. This was an unfortunate setback that impacted a small percentage of our dataset.

Validating Our Theory (With Pictures ! )

We collected data based on a gut instinct. The most important question is, were we right? Well, we think so, and here's why:

Note: Each point on the x-axis is one part of the sentence "The quick brown fox jumped over the lazy dog" We omitted the text labels on the axis for readability sake.

Conclusion 1: Humans do, in fact, have a habitual manner of typing

The first thing we confirmed was that typing data is not just a random blob. Here is a look at a sample trial for a single user's duration and seek. Each color represents an individual trial. Sample duration and seek for one user in one trial As you can see, the lines are relatively close together and show slight variance aside from some outliers.

To further confirm this, we calculated the coefficient of variance (fancy DS term that means the standard deviation/mean) for each data point across multiple trials for each user. Any CV value that is less than one means that the data variance is statistically small. The following two plots show the CV's for each user for each letter in the sentence where each color is a user.

Plot of CV Values for Features

That's a lot of dots! The plot shows that most CV values are well under 0.5 and are thus highly similar! Wow!

The plots indicate that seek data varies slightly more than duration, so perhaps that might not be as important of a feature to use in models.

Conclusion 2: Human typing is distinguishable

The easiest way to prove that human typing is distinguishable is to build a model that has a high accuracy. So that's exactly what we did.

After a long process of hyperparameter tuning, we were able to get a RandomForestClassifier to predict a user with an accuracy of 88.2%! To get this accuracy value without overfitting, we split the data into 67% training and 33% testing data.

More importantly, we created a confusion matrix to determine the False Positive and False Negative Rates. That is, when we accidentally let a bad guy login vs when we accidentally stop a good guy from logging in. The results speak for themselves.....

enter image description here

The false positive rate is infinitesemly small while the false negative rate is about 12%. This is very good because it means that the model will always keep the bad guy out and wrongly stop the good guy only 12% of the time when logging in.

Potential Use Cases

This dataset has the potential to impact multiple industries to promote security and integrity.

Industry Use Case Description
Industry:
Cybersecurity
Detect unauthorized login attempts The standard use case is detecting whenever an unauthorized user tries to access an account. Say that an attacker obtains your password. When they try to access your web app, the site will ask them extra verification questions because their typing habits were different than what was expected.
Industry:
Data Providers
Detect bot activity vs. human activity The dataset indicates that there are many trends that all humans follow. Perhaps these can be used to build a classifier that can distinguish between human and bot activity and mitigate web scraping without intrusive CAPTCHAs.
Industry:
Academia
Detect students impersonating others on exams When students are taking an online exam, the proctoring software could cross-check typing samples with other students to ensure that another student is not taking the test on someone else's behalf.

Reflections

  • Google Firebase is excellent for quickly gathering data into a bucket and downloading it. Data scientists that are building quick surveys should definitely consider using it
  • A significant and very important step of Data Science is properly cleaning the dataset. We spent hours debugging our analysis code when the problems were caused by a failure to correctly sanitize all of our input. Survey data from humans is tough to analyze
  • Coding is good to know for data science, but math is way more important.
  • This dataset was only with the people at the datathon. I believe that it will become more valuable if we keep the website open and continue to gather data.
Share this project:

Updates