DataScience Projects

This repository contains 3 Data Analysis projects. Each folder also consists a report incorporating the findings in each project. Each project is described in more detail as follows:

Stack Overflow Question Recommendation System - A recommendation system for Stack Overflow Data to recommend previously answered similar questions for the current post. Evaluation method involved comparing the system generated recommendations with duplicate marked questions suggested by stack overflow.Using distance similarity measures such as Jaccard Similarity, TF-IDF combined with cosine similarity and accuracy of 64% was achieved. Topic Modelling approach - LDA was utilised as the higher model where topics were extracted using MLE. It further improved the accuracy to 71%. Naive Bayes classifier was used to predict if a post is duplicate of the other post. Python was primarily used as the development platform, and text analysis modules - nltk and scikit-learn were employed widely.
Future Business Scope Analysis based on Yelp Reviews - Text analysis was done on the Yelp Reviews Data set to identify the business trends and future business scope. Sentiment Analysis was undertaken to extract subjective information about reviews and its indication towards the business’s standing. Regression techniques such as Linear Regression and Logistic Regression were used to predict the ratings for a business by a particular user and to determine if a business is superior or inferior. Multiples classification was utilised to determine the probability of a business being classified in each star rating group. Multifarious Data visualisation techniques such as word clouds, confusion matrix, heat maps were used to showcase the results of aforementioned experiments. Python was primarily employed.
NYPD Motor Vehicle Collisions Data Analysis - Exploratory Data Analysis techniques were used for Data Analysis and Visualisation of data provided by NYPD in NYC Open Data. 2D density plots, heat maps and bubble charts were used to plot the accident frequency and its distribution across the city and the factors contributing to these accidents. The analysis was narrowed down from city to boroughs to a more granular level of streets to find out the most accident prone streets, causes and the improvements that can be done. Hypothesis testing was done to determine at which time of the day majority accidents occur. The results were visualised by doing a time line analysis of the data. R programming language was predominantly used for the project.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
NYC Open Data Analysis		NYC Open Data Analysis
Stack Overflow Recommendation System		Stack Overflow Recommendation System
Yelp Data Analysis		Yelp Data Analysis
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataScience Projects

About

Uh oh!

Releases

Packages

Languages

arsheth/DataScience

Folders and files

Latest commit

History

Repository files navigation

DataScience Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages