Team 58 | Devpost

Inspiration

Our team's Datathon participation is fueled by our shared passion for the transformative potential of data analysis. We want to embrace the challenge to refine our analytical skills and explore the practical implications of data science in the real world!

What it does

In this project, our task is to predict the sales figures of various companies, based on the figures provided. Through rigorous analysis and predictive modelling, we aim to uncover relationships between variables and trends within the provided figures, ultimately generating accurate predictions for the sales figures of the companies.

How we built it

In building our project, we meticulously followed a systematic approach to ensure robust data analysis and modelling. We started by importing essential libraries and the dataset itself. To enhance clarity and efficiency, we dropped unnecessary columns and removed missing or irrelevant data, such as "Account ID", "Square Footage" etc. Next, recognizing the importance of a complete dataset, we used KNN imputer to fill in null values where relevant. This also ensures that bias is not introduced in our results. Subsequently, we conducted exploratory data analysis (EDA), where we analyzed various critical variables. We started by conducting an analysis of our target variable (Domestic and Global Ultimate Sales), by using various data representations - such as histograms and boxplots. Following that, in order to analyze our quantitative features in detail, we created a heatmap for our correlation matrix to examine the strength and direction of linear relationships between variables in our dataset. We then went on to explore the relationships between sales and critical variables such as geographical location and number of employees. ANOVA analysis of categorical variables is employed to find the significance of categorical variables in predicting sales. As part of our final data processing, we decided on our final predictors and dropped the unnecessary ones, and performed one-hot encoding for the categorical variables. Finally, moving to the modelling phase, we employed several models to fit our dataset. Eventually, we selected our 4 best models with the highest R-squared values, lowest AIC and lowest mean MSE, which are: Random Forest Regressor, Gradient Boosting Regressor, XGB Regressor, LightGBM Regressor.

Challenges we ran into

Our team feel that data cleaning and feature selection are the most challenging aspects of the project, primarily due to the intricate nature of these processes, necessitating in-depth analysis and involving multiple steps, such as one-hot encoding and removing null values. Feature selection required us to thoroughly analyze the relevance of each of our variables and weigh their potential impact on the model's performance. Finally, we tried and tested many models before managing to decide which models to implement.

What we learned

We learned the importance of various data processing stages, as immediate handling of raw data is impractical. Challenges like missing data and diverse formats necessitate the application of methods such as one-hot encoding to ensure effective data processing.

Having experienced real-world data from Champions Group firsthand during the datathon, we've come to appreciate the stark contrast it holds against the structured datasets often encountered in school assignments. The dynamic nature of the data, laden with missing values, irregularities, and diverse formats, has added a layer of complexity in the raw data given. This unpredictability, while challenging, has made our learning experience more engaging but has also provided valuable insights into the complexities of real-world data analytics.