Comparing groups within a dataset is another aspect of analysis. Here, we will use some tools from Python.
Libraries & Data Prep
First, we need to load our libraries and prepare our data. Below is the code for the libraries we need.
import seaborn as sns
from pydataset import data
import matplotlib.pyplot as plt
The first and last lines load the libraries we need for data visualization. The second line loads the data() function from pydataset, where our data will come from. Below is the code for loading our data.
df=data('Prestige')
df.head()

Our data is the Prestige dataset from the data() function, loaded as the object df. The .head() method displays the first few lines of our dataset. This dataset contains various jobs measured in terms of education, income, women, prestige, census, and type. We are now ready to create our first comparison.
Histogram Comparision
The histogram comparison allows us to compare the shape of different distributions of data when the histograms overlap each other. We will compare the income distribution by job type in the code below.
# Filter dataset for prof
sns.kdeplot(df[df.type == 'prof'].income,
# Shade under kde and add a helpful label
fill = True,
label = 'prof')
# Filter dataset for non prof
sns.kdeplot(df[df.type != 'prof'].income,
# Again, shade under kde and add a helpful label
fill = True,
label = 'non-prof')
plt.show()

We create the first plot (blue color) by submitting the data for “prof” and income. We repeat this process and subset the data for individuals who are not “prof.”The plot shows that there is a broader distribution of income for people whose job type is professional. We could confirm this with a t-test or ANOVA, but the visual can often help to determine if statistical testing is appropriate
Rug Plot
A rug plot serves a similar purpose to a histogram. The main difference is that a rug plot includes ticks along the x-axis that help to show where data points are located. This knowledge can be used to remove outliers when necessary. Below is the code, output, and explanation of how to create a rug plot.
sns.kdeplot(df[df.type == 'prof'].income,
label = 'prof',
# Turn the color blue to stand out
color = 'green')
sns.kdeplot(df[df.type != 'prof'].income,
label = 'Other types',
# Turn the color gray
color = 'red')
# Turn on rugplot
sns.rugplot(df[df.type == 'prof'].income,
label = 'prof',
color = 'green')
sns.rugplot(df[df.type != 'prof'].income,
label = 'Other types',
color = 'red')
plt.show()

The first two blocks of code are the same as the last example and make the histograms. The main difference is that no color fills the inside of the histograms. The new code is the two blocks of code that use the rugplot() method. The code for the arguments within the ruplot() method is almost the same as for the histogram, so there is little to discuss.
Swarm Plot
The swarm plot provides a different way to visualize a distribution. Like the histogram, it can be useful for comparison. In the code below, we use a swarm plot to look at the distribution of education by job type.
# Plot beeswarm
sns.swarmplot(y = "type",
x = 'education',
data = df,
# Decrease the size of the points to avoid crowding
size = 3)
# Give a descriptive title
plt.title('Education and Type')
plt.show()

The code is straightforward. You use the swarmplot() method and put in your values for the x and y axes. The plot shows you that prof job types have a higher level of education compared to the other job types.
Conclusion
People have their preferences for how they want to view data. The point here was not to state that one method was better than another. Rather, the goal was to raise awareness of the available options and to show how they can be created using Python.
