In this post, we are going to look at how to highlight data points in a scatterplot. We will specifically look at two different methods for doing this. These two methods are hard coding and programmatically.
Libraries and Data Preparations
First, we will load our libraries and prepare our data. Below are the libraries we will use
import seaborn as sns
import matplotlib.pyplot as plt
from pydataset import data
The first two lines are libraries for our data visualization. The last line of code will be used for pulling the data that we will use. Below we prepare our data by loading it into an object called “df” and we take a quick peek at it as well.
df=data('Prestige')
df.head()

The data set we are using is called “Prestige” and we load it using the data() function. This data contains various jobs, education, income, women, prestige, census, and type information. Next, we will look at how to highlight a specific data point in a scatterplot.
Hard Coding
Hard coding is when you manually pick a specific data point to highlight. Below is the code and output for this
df_prof = df[df.type == 'prof']
# Make array orangred for highest income
prof_colors = ['orangered' if (education == 12.26) & (income == 25879) else 'lightgray'
for education,income in zip(df_prof.education, df_prof.income)]
sns.regplot(x = 'education',
y = 'income',
data = df_prof,
fit_reg = False,
# Send scatterplot argument to color points
scatter_kws = {'facecolors': prof_colors, 'alpha': 0.7})
plt.show()

We did the following to make the plot above.
- We subset the data so that it only contains job types of “prof” and save this as an object called df_prof. The reason we did this was to reduce the number of data points and make it easier to see what we were doing.
- Next, we make an object called prof_colors which will color one dot orange if it meets the criteria for the values for education and income. Everything I just said is captured in an if else statement. The for statement is used to tell Python where to apply the if else statement. Since this is hard to understand, below is a visual of the prof_colors object.

Notice the second row and how it is labeled “orangered” this is because this row matches the criteria for the values of education and income. We will use this object to make the colors of our dots
3. The next block of code is for making the visualization. Most of this is self-explanatory. You set your x and y values for education and income,. The fit_reg argument was set to false because we do not want a regression line. The scatter_kws argument is used to set the color of the dots and the alpha sets the level of transparency of the dots.
Setting the highlighted point manually is good if your data is static. However, if your data is dynamic, you want to highlight the points programmatically so that the highlight point changes as the data does.
Progammatically
The code is mostly similar as above with a few minor changes. Below is the code followed by the output and lastly the explanation.
df_prof = df[df.type == 'prof']
# Find the highest income
max_income = df_prof.income.max()
# Make a column that denotes which occuaption has highest income
df_prof['point_type'] = ['Highest Income' if income == max_income else 'Others' for income in df_prof.income]
# Encode the hue of the points with the O3 generated column
sns.scatterplot(x = 'education',
y = 'income',
hue = 'point_type',
data = df_prof)
plt.show()

Here is what we did
- We start by subsetting the data as before for type of “prof”.
- We then create an object called max_income and find the highest income in the df_prof object using the max() method.
- This time we create a new column in our data called “point_type” which is created using an if else statement and a for loop. If income matches our highest income, it will be labeled highest income, and the rest will be labeled others for all data in the income column.
- Lastly, we create our scatterplot. We set the x and y values, and we set the hue to match the “point_type” which is the new column we just created.
With this second method, our highlighted data point will change as necessary if it changes in the data.
Conclusion
Highlighting data points is something that is needed at times when creating data visualizations. The examples above provide two different ways to deal with this. Which method is best depends on the context.






























































































































































