Amazon Bestselling Books Analysis Project using Machine Learning
We offer you a brighter future with placement-ready courses - Start Now!!
The Amazon Book Analysis project aims to analyze a dataset of best-selling books on Amazon, utilizing libraries like NumPy, Matplotlib, Seaborn, and Pandas to create visualizations. It explores genre distribution, author popularity, and book characteristics.
The analysis includes genre distribution for unique books from 2009 to 2019, genre trends in two time periods, top authors in fiction and non-fiction, and the number of unique books and total reviews for these authors. By providing valuable insights into genre trends, popular authors, and successful book characteristics, this project sheds light on the best-selling book market on Amazon.
About Dataset Amazon Book Analysis
The dataset consists of 550 books and has been categorized into fiction and non-fiction based on Goodreads. The dataset description reveals that it includes seven categories: Name of the Book, The author of the Book, Amazon User Rating, Number of written reviews on Amazon, The price of the book, The Year(s) it ranked on the bestseller list, and Whether it is categorized as fiction or non-fiction. For our data analysis, we will focus on the following variables: Genre as a categorical variable, and User Rating, Reviews, and Price as numerical variables.
Tools and libraries used
- Pandas
- NumPy
- Seaborn
- Matplotlib
- Scikit-learn
Download Machine Learning Amazon Bestselling Books Analysis Project
Please download the source code of Machine Learning Amazon Bestselling Books Analysis Project: Machine Learning Amazon Bestselling Books Analysis Project
Steps to Analyse Amazon Bestselling Books in Machine Learning
1. Importing the required libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import string import re
2. Reading the dataset
df = pd.read_csv(r'C:\Users\vaish\Downloads\Amazon Best Selling Books Analysis\ bestsellers with categories.csv')
3. Renaming a column in the DataFrame
df.rename(columns={"User Rating": "User_Rating"}, inplace=True)
4. Data cleaning and manipulation
df.loc[df.Author == 'J. K. Rowling', 'Author'] = 'J.K. Rowling'
5. Adding new columns to the DataFrame
df['name_len'] = df['Name'].apply(lambda x: len(x) - x.count(" "))
punctuations = string.punctuation
def count_punc(text):
count = sum(1 for char in text if char in punctuations)
return round(count / (len(text) - text.count(" ")) * 100, 3)
df['punc%'] = df['Name'].apply(lambda x: count_punc(x))
6. Visualization – Pie chart showing the distribution of genres for unique book
no_dup = df.drop_duplicates('Name')
g_count = no_dup['Genre'].value_counts()
fig, ax = plt.subplots(figsize=(8, 8))
genre_col = ['navy', 'crimson']
explode = [0.1, 0] # Explode the first slice (optional)
wedges, texts, autotexts = ax.pie(g_count.values, explode=explode, labels=g_count.index, autopct='%.2f%%',
startangle=90, textprops={'size': 14}, colors=genre_col, pctdistance=0.85)
# Customize wedges and texts
for wedge in wedges:
wedge.set_edgecolor('white')
wedge.set_linewidth(1)
for text, autotext in zip(texts, autotexts):
text.set_fontsize(14)
text.set_fontweight('bold')
autotext.set_fontsize(12)
autotext.set_color('white')
# Add a central circle
center_circle = plt.Circle((0, 0), 0.7, color='white')
ax.add_artist(center_circle)
# Title and aspect ratio
ax.set_title('Distribution of Genre for all unique books from 2009 to 2019', fontsize=20)
ax.axis('equal')
plt.show()
Output:
7. Visualization – Pie charts showing genre distribution for each year
y1 = np.arange(2009, 2014)
y2 = np.arange(2014, 2020)
g_count = df['Genre'].value_counts()
fig, ax = plt.subplots(2, 6, figsize=(12, 6))
# Adjust the spacing between subplots
fig.subplots_adjust(hspace=0.4, wspace=0.3)
# Create a custom color palette for the pie charts
colors = ['#ff9999', '#66b3ff']
# Set the font size for the titles
title_fontsize = 14
# Iterate over the subplots and create the pie charts
for i, year in enumerate(y1):
counts = df[df['Year'] == year]['Genre'].value_counts()
ax[0, i+1].pie(x=counts.values, labels=None, autopct='%.1f%%',
startangle=90, textprops={'size': 12, 'color': 'white'},
pctdistance=0.7, colors=colors, radius=1.1)
ax[0, i+1].set_title(year, color='darkred', fontsize=title_fontsize)
for i, year in enumerate(y2):
counts = df[df['Year'] == year]['Genre'].value_counts()
ax[1, i].pie(x=counts.values, labels=None, autopct='%.1f%%',
startangle=90, textprops={'size': 12, 'color': 'white'},
pctdistance=0.7, colors=colors, radius=1.1)
ax[1, i].set_title(year, color='darkred', fontsize=title_fontsize)
# Set the title and font size for the overall chart
overall_title = ax[0, 0].set_title('2009 - 2019\n(Overall)', color='darkgreen', fontsize=title_fontsize)
# Create a legend for the genres
fig.legend(g_count.index, loc='center right', fontsize=12)
# Remove unnecessary spines and labels
for row in ax:
for col in row:
col.axis('equal')
col.spines['top'].set_visible(False)
col.spines['right'].set_visible(False)
col.spines['bottom'].set_visible(False)
col.spines['left'].set_visible(False)
col.set_xticks([])
col.set_yticks([])
# Adjust the position of the subplots and title
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Output:
8. Visualization – Bar charts for top authors in fiction and non-fiction categories
st_nf_authors = df.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack()['Name', 'Non Fiction'].sort_values(ascending=False)[:11]
best_nf_authors = df.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack()['Name', 'Fiction'].sort_values(ascending=False)[:11]
with plt.style.context('Solarize_Light2'):
fig, ax = plt.subplots(1, 2, figsize=(10, 8))
# Create a custom color palette
colors = ['#ff9999', '#66b3ff']
# Plot the horizontal bar chart for Non Fiction Authors
ax[0].barh(y=best_nf_authors.index, width=best_nf_authors.values, color=colors[0])
ax[0].invert_xaxis()
ax[0].yaxis.tick_left()
ax[0].set_xticks(np.arange(max(best_nf_authors.values) + 1))
ax[0].set_yticklabels(best_nf_authors.index, fontsize=12, fontweight='semibold')
ax[0].set_xlabel('Number of Appearances', fontsize=12)
ax[0].set_title('Top Non Fiction Authors', fontsize=14)
# Plot the horizontal bar chart for Fiction Authors
ax[1].barh(y=best_nf_authors.index, width=best_nf_authors.values, color=colors[1])
ax[1].yaxis.tick_right()
ax[1].set_xticks(np.arange(max(best_nf_authors.values) + 1))
ax[1].set_yticklabels(best_nf_authors.index, fontsize=12, fontweight='semibold')
ax[1].set_title('Top Fiction Authors', fontsize=14)
ax[1].set_xlabel('Number of Appearances', fontsize=12)
# Set the legend
fig.legend(['Non Fiction', 'Fiction'], fontsize=12)
# Remove spines
ax[0].spines['right'].set_visible(False)
ax[0].spines['top'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].spines['top'].set_visible(False)
# Adjust space between subplots
plt.subplots_adjust(wspace=0.4)
# Add a horizontal line at the bottom
for a in ax:
a.spines['bottom'].set_linewidth(0.5)
a.spines['bottom'].set_color('gray')
a.tick_params(axis='y', which='both', length=0)
# Invert y-axis to show the highest count at the top
ax[0].invert_yaxis()
plt.show()
Output:
9. Visualization – Horizontal bar charts for top authors based on appearances, unique books, and total reviews
n_best = 20
top_authors = df.Author.value_counts().nlargest(n_best)
no_dup = df.drop_duplicates('Name')
fig, ax = plt.subplots(1, 3, figsize=(11, 10), sharey=True)
color = sns.color_palette("hls", n_best)
ax[0].hlines(y=top_authors.index, xmin=0, xmax=top_authors.values, color=color, linestyles='dashed')
ax[0].plot(top_authors.values, top_authors.index, 'go', markersize=9)
ax[0].set_xlabel('Number of appearances')
ax[0].set_xticks(np.arange(top_authors.values.max() + 1))
book_count = []
total_reviews = []
for name, col in zip(top_authors.index, color):
book_count.append(len(no_dup[no_dup.Author == name]['Name']))
total_reviews.append(no_dup[no_dup.Author == name]['Reviews'].sum() / 1000)
ax[1].hlines(y=top_authors.index, xmin=0, xmax=book_count, color=color, linestyles='dashed')
ax[1].plot(book_count, top_authors.index, 'go', markersize=9)
ax[1].set_xlabel('Number of unique books')
ax[1].set_xticks(np.arange(max(book_count) + 1))
ax[1].set_title('Unique books')
ax[2].barh(y=top_authors.index, width=total_reviews, color=color, edgecolor='black', height=0.7)
for name, val in zip(top_authors.index, total_reviews):
ax[2].text(val + 2, name, val)
ax[2].set_xlabel("Total Reviews (in 1000's)")
plt.show()
Output:
Summary
This project on PythonGeeks analyzed the Amazon Top 50 Bestselling Books from 2009 to 2019, using a dataset of 550 books categorized as fiction and non-fiction. The objective was to understand book market trends over the past decade. The analysis involved data preprocessing, including column renaming and author name cleaning, along with the addition of features like book name length and punctuation percentage.
Visualizations, such as pie charts illustrating genre distribution for unique books each year, provided insights into genre popularity. The project also explored top authors in fiction and non-fiction categories using horizontal bar charts based on appearances, unique books, and total reviews.
Overall, the analysis enhanced our understanding of book sales and popularity factors, benefiting publishers, authors, and marketers in the competitive book industry.
You can check out more such machine learning projects on PythonGeeks.




