Inspiration

But why?

Data is plentiful nowadays, and is used for all sorts of decision making and governance. Datasets are rapidly growing, but so are models [0]. Moreover the biggest models act as huge black boxes, which limit their ability to be applied to decision making. They can tell how things are associated but not why. Enter Socratic.

What it does

Causal Inference is a field of stats/maths which tries to deal with this [1]. It does this by interrogating data in a specific way to tease out how variables are related to one another, rather than just how strongly correlated they are, and can represent the relationships as a graph of dependencies. By using its framework, we built an automatic causal model builder which takes in csv data and suggests various causal models.

We showcase the model on some generated toy data as well as on part of a real dataset from TfL.

How we built it

We used numpy, scikit-learn, statsmodels and networkx python libraries to create a model for the data analysis, Flask and HTML for the front end and the web app has been deployed on AWS.

Challenges we ran into

With the model, one problem was getting the statistical dependencies right. Without proper statistical analysis, there is no way to infer any causal links between the variables in the model. Another major issue the is the scaling of the problem with the number of variables. There are layers upon layers of combinatorically scaling components when checking all causal models, making the problem very NP-hard. With AWS there were many errors along the way and it was necessary to go through the logs.

Accomplishments that we're proud of

We're proud of coming up with this solution and having combined so many moving parts in a projects that is larger than the sum of its parts.

What we learned

(building model learning) To overcome certain issues such as better solutions for file uploads on AWS.

What's next for Socratic

Next, we will enhance the model and scale the underlying infrastructure to be able to handle larger datasets, and mixtures of continuous and categorical data.

Share this project:

Updates