Amazon Bestselling Books Analysis Project using Machine Learning

We offer you a brighter future with placement-ready courses - Start Now!!

The Amazon Book Analysis project aims to analyze a dataset of best-selling books on Amazon, utilizing libraries like NumPy, Matplotlib, Seaborn, and Pandas to create visualizations. It explores genre distribution, author popularity, and book characteristics.

The analysis includes genre distribution for unique books from 2009 to 2019, genre trends in two time periods, top authors in fiction and non-fiction, and the number of unique books and total reviews for these authors. By providing valuable insights into genre trends, popular authors, and successful book characteristics, this project sheds light on the best-selling book market on Amazon.

About Dataset Amazon Book Analysis

The dataset consists of 550 books and has been categorized into fiction and non-fiction based on Goodreads. The dataset description reveals that it includes seven categories: Name of the Book, The author of the Book, Amazon User Rating, Number of written reviews on Amazon, The price of the book, The Year(s) it ranked on the bestseller list, and Whether it is categorized as fiction or non-fiction. For our data analysis, we will focus on the following variables: Genre as a categorical variable, and User Rating, Reviews, and Price as numerical variables.

Tools and libraries used

Pandas
NumPy
Seaborn
Matplotlib
Scikit-learn

Download Machine Learning Amazon Bestselling Books Analysis Project

Please download the source code of Machine Learning Amazon Bestselling Books Analysis Project: Machine Learning Amazon Bestselling Books Analysis Project

Steps to Analyse Amazon Bestselling Books in Machine Learning

1. Importing the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

2. Reading the dataset

df = pd.read_csv(r'C:\Users\vaish\Downloads\Amazon Best Selling Books Analysis\ bestsellers with categories.csv')

3. Renaming a column in the DataFrame

df.rename(columns={"User Rating": "User_Rating"}, inplace=True)

4. Data cleaning and manipulation

df.loc[df.Author == 'J. K. Rowling', 'Author'] = 'J.K. Rowling'

5. Adding new columns to the DataFrame

df['name_len'] = df['Name'].apply(lambda x: len(x) - x.count(" "))
punctuations = string.punctuation
def count_punc(text):
    count = sum(1 for char in text if char in punctuations)
    return round(count / (len(text) - text.count(" ")) * 100, 3)
df['punc%'] = df['Name'].apply(lambda x: count_punc(x))

6. Visualization – Pie chart showing the distribution of genres for unique book

no_dup = df.drop_duplicates('Name')
g_count = no_dup['Genre'].value_counts()
fig, ax = plt.subplots(figsize=(8, 8))
genre_col = ['navy', 'crimson']
explode = [0.1, 0]  # Explode the first slice (optional)

wedges, texts, autotexts = ax.pie(g_count.values, explode=explode, labels=g_count.index, autopct='%.2f%%',
                                  startangle=90, textprops={'size': 14}, colors=genre_col, pctdistance=0.85)

# Customize wedges and texts
for wedge in wedges:
    wedge.set_edgecolor('white')
    wedge.set_linewidth(1)

for text, autotext in zip(texts, autotexts):
    text.set_fontsize(14)
    text.set_fontweight('bold')
    autotext.set_fontsize(12)
    autotext.set_color('white')

# Add a central circle
center_circle = plt.Circle((0, 0), 0.7, color='white')
ax.add_artist(center_circle)

# Title and aspect ratio
ax.set_title('Distribution of Genre for all unique books from 2009 to 2019', fontsize=20)
ax.axis('equal')
plt.show()

Output:

7. Visualization – Pie charts showing genre distribution for each year

y1 = np.arange(2009, 2014)
y2 = np.arange(2014, 2020)
g_count = df['Genre'].value_counts()

fig, ax = plt.subplots(2, 6, figsize=(12, 6))

# Adjust the spacing between subplots
fig.subplots_adjust(hspace=0.4, wspace=0.3)

# Create a custom color palette for the pie charts
colors = ['#ff9999', '#66b3ff']

# Set the font size for the titles
title_fontsize = 14

# Iterate over the subplots and create the pie charts
for i, year in enumerate(y1):
    counts = df[df['Year'] == year]['Genre'].value_counts()

    ax[0, i+1].pie(x=counts.values, labels=None, autopct='%.1f%%',
                   startangle=90, textprops={'size': 12, 'color': 'white'},
                   pctdistance=0.7, colors=colors, radius=1.1)

    ax[0, i+1].set_title(year, color='darkred', fontsize=title_fontsize)

for i, year in enumerate(y2):
    counts = df[df['Year'] == year]['Genre'].value_counts()

    ax[1, i].pie(x=counts.values, labels=None, autopct='%.1f%%',
                 startangle=90, textprops={'size': 12, 'color': 'white'},
                 pctdistance=0.7, colors=colors, radius=1.1)

    ax[1, i].set_title(year, color='darkred', fontsize=title_fontsize)

# Set the title and font size for the overall chart
overall_title = ax[0, 0].set_title('2009 - 2019\n(Overall)', color='darkgreen', fontsize=title_fontsize)

# Create a legend for the genres
fig.legend(g_count.index, loc='center right', fontsize=12)

# Remove unnecessary spines and labels
for row in ax:
    for col in row:
        col.axis('equal')
        col.spines['top'].set_visible(False)
        col.spines['right'].set_visible(False)
        col.spines['bottom'].set_visible(False)
        col.spines['left'].set_visible(False)
        col.set_xticks([])
        col.set_yticks([])

# Adjust the position of the subplots and title
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()

Output:

8. Visualization – Bar charts for top authors in fiction and non-fiction categories

st_nf_authors = df.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack()['Name', 'Non Fiction'].sort_values(ascending=False)[:11]
best_nf_authors = df.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack()['Name', 'Fiction'].sort_values(ascending=False)[:11]

with plt.style.context('Solarize_Light2'):
    fig, ax = plt.subplots(1, 2, figsize=(10, 8))

    # Create a custom color palette
    colors = ['#ff9999', '#66b3ff']

    # Plot the horizontal bar chart for Non Fiction Authors
    ax[0].barh(y=best_nf_authors.index, width=best_nf_authors.values, color=colors[0])
    ax[0].invert_xaxis()
    ax[0].yaxis.tick_left()
    ax[0].set_xticks(np.arange(max(best_nf_authors.values) + 1))
    ax[0].set_yticklabels(best_nf_authors.index, fontsize=12, fontweight='semibold')
    ax[0].set_xlabel('Number of Appearances', fontsize=12)
    ax[0].set_title('Top Non Fiction Authors', fontsize=14)

    # Plot the horizontal bar chart for Fiction Authors
    ax[1].barh(y=best_nf_authors.index, width=best_nf_authors.values, color=colors[1])
    ax[1].yaxis.tick_right()
    ax[1].set_xticks(np.arange(max(best_nf_authors.values) + 1))
    ax[1].set_yticklabels(best_nf_authors.index, fontsize=12, fontweight='semibold')
    ax[1].set_title('Top Fiction Authors', fontsize=14)
    ax[1].set_xlabel('Number of Appearances', fontsize=12)

    # Set the legend
    fig.legend(['Non Fiction', 'Fiction'], fontsize=12)

    # Remove spines
    ax[0].spines['right'].set_visible(False)
    ax[0].spines['top'].set_visible(False)
    ax[1].spines['left'].set_visible(False)
    ax[1].spines['top'].set_visible(False)

    # Adjust space between subplots
    plt.subplots_adjust(wspace=0.4)

    # Add a horizontal line at the bottom
    for a in ax:
        a.spines['bottom'].set_linewidth(0.5)
        a.spines['bottom'].set_color('gray')
        a.tick_params(axis='y', which='both', length=0)

    # Invert y-axis to show the highest count at the top
    ax[0].invert_yaxis()

    plt.show()

Output:

9. Visualization – Horizontal bar charts for top authors based on appearances, unique books, and total reviews

n_best = 20
top_authors = df.Author.value_counts().nlargest(n_best)
no_dup = df.drop_duplicates('Name')

fig, ax = plt.subplots(1, 3, figsize=(11, 10), sharey=True)
color = sns.color_palette("hls", n_best)

ax[0].hlines(y=top_authors.index, xmin=0, xmax=top_authors.values, color=color, linestyles='dashed')
ax[0].plot(top_authors.values, top_authors.index, 'go', markersize=9)
ax[0].set_xlabel('Number of appearances')
ax[0].set_xticks(np.arange(top_authors.values.max() + 1))

book_count = []
total_reviews = []

for name, col in zip(top_authors.index, color):
    book_count.append(len(no_dup[no_dup.Author == name]['Name']))
    total_reviews.append(no_dup[no_dup.Author == name]['Reviews'].sum() / 1000)

ax[1].hlines(y=top_authors.index, xmin=0, xmax=book_count, color=color, linestyles='dashed')
ax[1].plot(book_count, top_authors.index, 'go', markersize=9)
ax[1].set_xlabel('Number of unique books')
ax[1].set_xticks(np.arange(max(book_count) + 1))
ax[1].set_title('Unique books')

ax[2].barh(y=top_authors.index, width=total_reviews, color=color, edgecolor='black', height=0.7)

for name, val in zip(top_authors.index, total_reviews):
    ax[2].text(val + 2, name, val)

ax[2].set_xlabel("Total Reviews (in 1000's)")
plt.show()

Output:

Summary

This project on PythonGeeks analyzed the Amazon Top 50 Bestselling Books from 2009 to 2019, using a dataset of 550 books categorized as fiction and non-fiction. The objective was to understand book market trends over the past decade. The analysis involved data preprocessing, including column renaming and author name cleaning, along with the addition of features like book name length and punctuation percentage.

Visualizations, such as pie charts illustrating genre distribution for unique books each year, provided insights into genre popularity. The project also explored top authors in fiction and non-fiction categories using horizontal bar charts based on appearances, unique books, and total reviews.

Overall, the analysis enhanced our understanding of book sales and popularity factors, benefiting publishers, authors, and marketers in the competitive book industry.

You can check out more such machine learning projects on PythonGeeks.

Amazon Bestselling Books Analysis Project using Machine Learning

About Dataset Amazon Book Analysis

Tools and libraries used

Download Machine Learning Amazon Bestselling Books Analysis Project

Steps to Analyse Amazon Bestselling Books in Machine Learning

Summary

Leave a Reply Cancel reply