This repository contains 3 Data Analysis projects. Each folder also consists a report incorporating the findings in each project. Each project is described in more detail as follows:
-
Stack Overflow Question Recommendation System - A recommendation system for Stack Overflow Data to recommend previously answered similar questions for the current post. Evaluation method involved comparing the system generated recommendations with duplicate marked questions suggested by stack overflow.Using distance similarity measures such as Jaccard Similarity, TF-IDF combined with cosine similarity and accuracy of 64% was achieved. Topic Modelling approach - LDA was utilised as the higher model where topics were extracted using MLE. It further improved the accuracy to 71%. Naive Bayes classifier was used to predict if a post is duplicate of the other post. Python was primarily used as the development platform, and text analysis modules - nltk and scikit-learn were employed widely.
-
Future Business Scope Analysis based on Yelp Reviews - Text analysis was done on the Yelp Reviews Data set to identify the business trends and future business scope. Sentiment Analysis was undertaken to extract subjective information about reviews and its indication towards the business’s standing. Regression techniques such as Linear Regression and Logistic Regression were used to predict the ratings for a business by a particular user and to determine if a business is superior or inferior. Multiples classification was utilised to determine the probability of a business being classified in each star rating group. Multifarious Data visualisation techniques such as word clouds, confusion matrix, heat maps were used to showcase the results of aforementioned experiments. Python was primarily employed.
-
NYPD Motor Vehicle Collisions Data Analysis - Exploratory Data Analysis techniques were used for Data Analysis and Visualisation of data provided by NYPD in NYC Open Data. 2D density plots, heat maps and bubble charts were used to plot the accident frequency and its distribution across the city and the factors contributing to these accidents. The analysis was narrowed down from city to boroughs to a more granular level of streets to find out the most accident prone streets, causes and the improvements that can be done. Hypothesis testing was done to determine at which time of the day majority accidents occur. The results were visualised by doing a time line analysis of the data. R programming language was predominantly used for the project.