Generating fake data is one way to protect an individual’s privacy. The video below provides examples of how to do this using Python.
Category Archives: python
Python for Data Privacy VIDEO
Privacy of Continous Data with Python
There are several ways that an individual’s privacy can be protected when dealing with continuous data. In this post, we will look at how protecting privacy can be accomplished using Python.
Libraries
We will begin by loading the necessary libraries. Below is the code.
from pydataset import data
import pandas as pd
The library setup is simple. We are importing the data() function from pydataset. This will allow us to load the data we will use in this post. Below we will address the data preparation. We are also importing pandas to make a frequency table later on.
Data Preparation
The data preparation is also simple. We will load the dataset called “SLID” using the data() function into an object called df. We will then view the df object using the .head() method. Below is the code followed by the output.
df=data('SLID')
df.head()

The data set has five variables. The focus of this post will be on the manipulation of the “age” variable. We will now make a histogram of the data before we manipulate it.
View of Original Histogram
Below is the code output of the histogram of the “age” variable. The reason for making this visual is to provide a “before” picture of the data before changes are made.
df['age'].hist(bins=15)

We will now move to our first transformation which will involve changing the data to a categorical variable.
Change to Categorical
Changing continuous data to categorical is one way of protecting privacy as it removes individual values and replaces them with group values. Below is an example of how to do this with the code and the first few rows of the modified data.
df['age'] = df['age'].apply(lambda x:">=40"if x>=40 else"<40" )
df.head()

We are overwriting the “age” variable in the code using an anonymous function. On the “age” variable we use the .apply() method and replace values above 40 with “>=40” and values below 40 with “<40”. The data is now broken down into two groups, those above 40 and those below 40. Below is a frequency table of the transformed “age” variable.
df['age'].value_counts()
age
>=40 3984
<40 3441
Name: count, dtype: int64
The .value_counts() method comes from the pandas library. There are two groups now. The table above is a major transformation from the original histogram. Below is the code and output of a bar graph of this transformation
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x="age", data=df)
plt.show()

This was a simple example. You do not have to limit yourself to only two groups to divide your data. How many groups depends on the context and the purpose of the use of this technique.
Top Coding
Top coding is a trick used to bring extremely high values down to a specific value. Again, the purpose of modifying these values in our context is to protect people’s privacy. Below is the code and output for this approach.
df=data('SLID')
df.loc[df['age'] > 75, 'age'] = 75
df['age'].hist(bins=15)

The code does the following.
- We load the “SLID” dataset again so that we can modify it again from its original state.
- We then use the .loc method to change all values in “age” above 75 to 75.
- Lastly, we create our histogram for comparison purposes to the original data
If you look to the far right you can see that spike in the number of data points at age 75 compared to our original histogram. This is a result of our manipulation of the data. Through doing this, we can keep all of our data for other forms of analysis while also protecting the privacy of the handful of people who are over the age of 75.
Bottom Coding
Bottom coding is the same as top coding except now you raise values below a threshold to a minimum value. Below is the code and output for this.
df=data('SLID')
df.loc[df['age'] < 20, 'age'] = 20
df['age'].hist(bins=15)

The code is the same as before with the only difference being the less than “<” symbol and the threshold being set to 20. As you compare this histogram to the original you can see a huge spike in the number of values at 20.
Conclusion
Data protection is an important aspect of the analysis role. The examples provided here are just some of the many ways in which the privacy of individuals can be respected with the help of Python
Python for Data Privacy
Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.
Libraries & Data Preparation
There are few libraries and minimal data preparation for this example. The code and output are below.
from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.
We will now move to the first way to protect privacy when working with data.
Drop Columns
Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.
# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")
# Explore obtained dataset
suppressed_language.head()

To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.
Drop Rows
It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.
# Drop rows with education higher than 14
education = df.drop(df[df.education >= 14].index)
# See DataFrame
education.head()

In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14
Data Masking
Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.
# Uniformly mask the education column
df['education'] = '****'
# See resulting DataFrame
df.head()

The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.
Replace Part of String
Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.
#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )
#See Results
df.head()

The code involves rewriting the data in the “sex” column.
- We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
- After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
- After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
- Lastly, we subset from “text and find the string “le” in “text” using the find() method.
The apply() method allows us to loop through the column like a for loop and repeat this process for every row.
Conclusion
Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.
Bokeh-Manipulating Glyph Color in Python VIDEO
Bokeh: Modifying Glyphs VIDEO
Generating Fake Data for Privacy with Python
The privacy of individuals in a dataset can be protected through the development of fake data. Using false numbers makes it much more difficult to identify individual people within a dataset. In this post, we will look at how to generate fake numbers and names using Python.
Libraries & Data Preparation
The initial library needed is only “pydataset” which will allow us to load the data. We will use the data() function to load the “SLID” dataset into an object called “df”. Next, we will look at the data using the .head() method. Below is the code and the output.
from pydataset import data
df=data('SLID')
df.head()

We have five columns of data that address wages, education level, age, sex, and language. However, for this example, we need to take several additional steps.
We are going to create four new columns that will be manipulated in the example below. These columns will be “name”, “credit_card”, “credit_code”, and “credit_company”. Each of these columns will have a default value that we will manipulate. Below is the code and output.
df['name']="Dan"
df['credit_card']=1234567890
df['credit_code']=123
df['credit_company']='comp'
df.head()

All of this new data will serve as data that needs protection. The original data isn’t needed it just serves as a dataset that we are grafting the privacy data onto. Making a dataframe from scratch is a little complicated in Python and beyond the scope of this video so we took a shortcut by adding to preexisting data. We will now see how to generate fake numbers and names.
Fake Numbers
The “faker” library has a function called “Faker” that can generate fake data for almost any circumstance. We will demonstrate this by generating phony credit card numbers. Below is the code and output.
# Import Faker class
from faker import Faker
# Create fake data generator
fake_data = Faker()
# Generate a credit card number
fake_data.credit_card_number()
'6561857744400343'
To generate the false credit card number we loaded the faker library and imported the Faker() function. Then we created an instance of the Faker() function called “fake_data”. Lastly, we used the .credit_card_number() method on the “fake_data” object.
We will now generate fake numbers for “credit_card”, “credits_code”, and “credit_company”.
# Mask card number with new generated data using a lambda function
Faker.seed(0)
df['credit_code'] = df['credit_code'].apply(lambda x: fake_data.credit_card_security_code())
df['credit_company'] = df['credit_company'].apply(lambda x: fake_data.credit_card_provider())
df['credit_card'] = df['credit_card'].apply(lambda x: fake_data.credit_card_number())
# See the resulting pseudonymized data
df.head()

If you compare this output to the original you can see that the values have changed. We set the seed using Faker.seed(0) so we always get the same results. The next three lines of code use an anonymous function which allows us to loop through our dataset. First, we subset the name of the column we want to overwrite. Second, we use the .apply() method on the same column. Inside the .apply() method we lambda followed by the argument x. After the x we indicate what we want done to the column using the appropriate method from the faker library. Lastly, we display the results using the .head() method. We will address the names of people.
Fake Names
There are at least three different methods for generating fake names, there is a method that generates male or female names, a method that generates only male names, and a method that only generates female names. Below is a brief example of each.
Faker.seed(0)
print(fake_data.name())
print(fake_data.name_male())
print(fake_data.name_female())
Norma Fisher
Jorge Sullivan
Elizabeth Woods
The code above is self-explanatory. We used the print function in order to print several lines of code with different outputs. We will use the .name() method in the code below to generate fake names for our “name” column.
Faker.seed(0)
df['name'] = df['name'].apply(lambda x: fake_data.name())
df.head()

The steps for changing the names are the same as what we did with the credit card information. As such, we will not reexplain it here.
Conclusion
The ability to generate fake data as shown in this post allows an incredible amount of flexibility in protecting people’s identity. However, nothing must be lost that is used for developing insights. For example, generating random credit card numbers could be catastrophic if this information provides insights in a given context. Therefore, any tool that is going to be used must be used with wisdom and caution.
Bokeh-Manipulating Glyph Color
In this post, we will examine how to manipulate the color of the glyphs in a Bokeh data visualization. We are doing this not necessarily for aesthetic reasons but to convey additional information. Below are the initial libraries that we need. We will load additional libraries as required.
from pydataset import data
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.io import output_file, show
pydatasets will be used to load the data we need. The next two lines create the axes we will need and the objects for storing our data. The last line includes functions for saving and displaying our visualization.
Data Preparation
For this example, data preparation is simple. We will load the dataset “Duncan” using the data() function in an object called “df”. This dataset includes data about various occupations as measured on several variables. We will then display the data using the .head() method. Below is the code and output.
df=data('Duncan')
df.head()

Color Glyphs
In the example below, we will color the glyphs based on one of the variables. We will graph education vs income and color code the glyphs based on income. Below is the code followed by the output and finally the explanation.
from bokeh.transform import linear_cmap
from bokeh.palettes import RdBu8
source = ColumnDataSource(data=df)
# Create mapper
mapper = linear_cmap(field_name="income", palette=RdBu8, low=min(df["income"]), high=max(df["income"]))
# Create the figure
fig = figure(x_axis_label="Education", y_axis_label="Income", title="Education vs. Income")
# Add circle glyphs
fig.circle(x="education", y="income", source=source, color=mapper,size=16)
output_file(filename="education_vs_income.html")
show(fig)

Here is what we did
- We had to load additional libraries. liner_cmap() will be used to create the actual coloring of the glyphs. RdBu8 is the color choice for the glyphs.
- We then create the source of our data using the ColumSourcData() function
- We create our mapper function using the linear_map() function. The arguments inside the function are the variable we are using (income) and the low and high values for the variable.
- Next, we create our figure. We label our x and y axis based on the variables we are using and set the title.
- We use the .circle() method to create our glyphs. Notice how we set the color argument to our “mapper” object.
- The last two lines of code are for creating our output and showing it.
You could set the glyph color to a third variable, which would allow you to express a third variable in a two-dimensional space. For example, we could have used the “prestige” variable for the coloring of the glyphs rather than income, as income was already represented on the y-axis.
Adding a Color Bar
Adding a color bar will help to explain to a reader of our visualization what the color of the glyphs means. The code below is mostly the same and is followed by the output and lastly the explanation.
from bokeh.models import ColorBar
source = ColumnDataSource(data=df)
# Create mapper
mapper = linear_cmap(field_name="income", palette=RdBu8, low=min(df["income"]), high=max(df["income"]))
# Create the figure
fig = figure(x_axis_label="Education", y_axis_label="Income", title="Education vs. Income")
fig.circle(x="education", y="income", source=source, color=mapper,size=16)
# Create the color_bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
# Update layout with color_bar on the right
fig.add_layout(color_bar, "right")
output_file(filename="Education_vs_prestige_color_mapped.html")
show(fig)

Here is what happened.
- We loaded a new function called ColorBar().
- We create our source data (same as the previous example)
- We create our mapper (same as the previous example)
- We create our figure and glyphs (same as the previous example)
- Next, we create our color bar using the ColorBar() function. Inside this function, we set the color_mapper argument to a transformed version of the mapper object we already created. We also can set the width of the color bar using the width argument. Everything we have done in this step is saved in an object called “color_bar”
- We then use the .fig_layout() method on our “fig” object and place the object “color_bar” inside it along with the phrase “right” which tells Python to place the color bar on the right-hand side of the scatterplot.
There is one more example of glyph manipulation below.
Color by Category
In this last example, we will map an additional categorical variable onto our plot using color. Below is the code, output, and explanation.
# Import modules
from bokeh.transform import factor_cmap
from bokeh.palettes import Category10_5
source = ColumnDataSource(data=df)
# Create positions
TOOLTIPS=[('Education', '@education'), ('Prestige', '@prestige')]
positions = ["wc","bc","prof"]
fig = figure(x_axis_label="Education", y_axis_label="Prestige", title="Education vs Prestige", tooltips=TOOLTIPS)
# Add circle glyphs
fig.circle(x="education", y="prestige", source=source, legend_field="type",
size=16,fill_color=factor_cmap("type", palette=Category10_5, factors=positions))
output_file(filename="Education_vs_prestige_by_type.html")
show(fig)

- We need to load some additional libraries. factpor_cmap() will be used to color the glyphs based on the categorical variable “types. Category10_5 is the color palette.
- We create an object called “source” for our data using the ColumnSourceData() function.
- We create an object called “TOOLTIPS”. This is an object that will be used to display the individual data points of a glyph when the mouse hovers over it in the visualization
- We create an object called “positions” which is a list of all of the job types we want to match to different colors in our plot.
- We create an object called “fig” which uses the figure() function to create the x and y axes and the title of the plot. Inside this function, we also set the tooltips argument equal to the “TOOLTIPS” object we created previously
- Next, we use the .circle() method on our “fig” object. Most of the arguments are self-explanatory but notice how the argument “fill_color” is set to the function factor_cmap(). Inside this function we indicate the variable “type” as the variable to use, set the palette, and set the factors to the “position” object that we made earlier.
- The last two lines save the output and display it.
Conclusion
Bokeh allows you to do many cool things when creating a visualization for data purposes. This post was focused on how to manipulate the glyphs in a scatterplot. However, there is so much more that can be done beyond what was shared here.
Bokeh Display Multiple Plots VIDEO
Bokeh Tools and Tooltips VIDEO
Bokeh Display Customization in Python
In this post, we will examine how to modify the default display of a plot in Bokeh, a library for interactive data visualizations in Python. Below are the initial libraries that we need.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
The first line of code is where our data comes from. We are using the data() function from pydataset for loading our data. The next two lines are for making the plot’s figure (x and y axes) and for the output file.
Data Preparation
There is no data preparation beyond loading the dataset using the data() function. We pick the dataset “Duncan” and load it into an object called “df.” The code is below, followed by a brief view of the actual data using the .head() method.
df=data('Duncan')
df.head()

This dataset includes various occupations measured in four ways: job type, income, education, and prestige.
Default Graph’s Appearance
Before we modify the appearance of the plot, it is important to know what the default appearance of the plot is for comparison purposes. Below is the code for a simple plot followed by the actual output and then lastly an explanation.
# Create a new figure
fig = figure(x_axis_label="Education", y_axis_label="Income")
# Add circle glyphs
fig.circle(x=df["education"], y=df["income"])
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

The first line of code sets up the fig or figure. We use the figure() function to label the axes which are education and income. The second line of code creates the actual data points in the figure using the .circle() method. The last two lines create the output and display it.
So the figure above is the default appearance of a graph. Below we will look at several modifications.
Modification 1
In the code below, we are making the following changes to the plot.
- Identifying data points by job type using color
- Change the background color to black
Below is the code followed by the output and the explanation
# Import curdoc
from bokeh.io import curdoc
prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]
# Change theme to contrast
curdoc().theme = "contrast"
fig = figure(x_axis_label="Education", y_axis_label="Income")
# Add prof circle glyphs
fig.circle(x=prof["education"], y=prof["income"], color="yellow", legend_label="prof",size=10)
# Add bc circle glyphs
fig.circle(x=bc["education"], y=bc["income"], color="red", legend_label="bc",size=10)
output_file(filename="prof_vs_bc.html")
show(fig)

Here is what happened,
- We load a library that allows us to modify the appearance called curdoc
- Next, we do some data preparation. Separating the data for types that are “prof” and those that are “bc” into separate objects.
- We change the theme of the plot to contrast using curdoc().theme
- We also created the figure as done previously
- We use the .circle() method twice. Once to set the “prof” data points on the plot and a second time to place the “bc” data points on the plot. We also make the data points larger by setting the size and using different colors for each job type.
- The last two lines of code are for creating the output and displaying it.
You can see the difference between this second plot and the first one. This also shows the flexibility that is inherent in the use of Bokeh. Below we add one more variation to the display.
Modified Graph’s Appearance
The plot below is mostly the same except for the following
- We add a third job type “wc”
- We modify the shapes of the data points
Below is the code followed by the graph and the explanation
# Create figure
wc = df.loc[df["type"] == "wc"]
prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]
fig = figure(x_axis_label="Education", y_axis_label="Income")
# Add circle glyphs for houses
fig.circle(x=wc["education"], y=wc["income"], legend_label="wc", color="purple",size=10)
# Add square glyphs for units
fig.square(x=prof["education"], y=prof["income"], legend_label="prof", color="red",size=10)
# Add triangle glyphs for townhouses
fig.triangle(x=bc["education"], y=bc["income"], legend_label="bc", color="green",size=10)
output_file(filename="education_vs_income_by_type.html")
show(fig)

The code is almost all the same. The main difference is there are now three job types and each type has a different shape for their data points. The shapes are determined by using either .circle(), .triangle(), or .square() methods.
Conclusion
There are many more ways to modify the appearance of visualization in bokeh. The goal here was to provide some basic examples that may lead to additional exploration.
Bokeh Tools and Tooltips
In this post, we will look at how to manipulate the different tools and tooltips that you can use to interact with data visualizations that are made using Bokeh in Python. Tool are the icons that are displayed by default to the right of a visual when looking at a Bokeh output. To the right is what default tools look like. Tooltips provide interactive data based on the position of the mouse.
We will now go through the process of changing these tools and tooltips for various reasons and purposes.
Load Libraries
First, we need to load the libraries we need to make our tools. Below is the code followed by an explanation.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
We start by loading “data” from “pydataset”. This library contains the actual data we are going to use. The other libraries are all related to Bokeh’s “figure” which will create details for our visualization. In addition, we will need the “output_file” to make our HTML document and the “show” function to display our visualization.
Data Preparation
Data preparation is straightforward. All we have to do is load our data into an object. We will use the “Duncan” dataset, which contains data on various jobs’ income, education, and prestige. Below is the code followed by a snippet of the actual data.
df=data('Duncan')
df.head()

Default Settings for Tools
We will now look at a basic plot with the basic tools. Below is the code.
# Create a new figure
fig = figure(x_axis_label="income", y_axis_label="prestige")
# Add circle glyphs
fig.circle(x=df["income"], y=df["prestige"])
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

There is nothing new here. We create the figure for our axes first. Then we add the points in the next line of code. Lastly, we write some code to create an output. The default tools has 7 options. Below they are explained from top to bottom.
- At the top, is the logo that takes you to the Bokeh website
- Pan tool
- Box zoom
- Wheel zoom
- Save figure
- Reset figure
- Takes you to Bokeh documentation
We will now show how to customize the available tools.
Custom Settings for Tooltips
In order to make a set of custom tools, we need to make some small modifications to the previous code as shown below.
# Create a list of tools
tools = ["lasso_select", "wheel_zoom", "reset","save"]
# Create figure and set tools
# Create a new figure
fig = figure(x_axis_label="income", y_axis_label="prestige",tools=tools)
# Add circle glyphs
fig.circle(x=df["income"], y=df["prestige"])
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

What is new in the code is the object called “tools”. This object contains a list of the tooltips we want to be available in our plot. The names of the tools is available in the Bokeh documentation. We then add this object “tools” to the argument called “tools” in the line of code where we create the “fig” object. If you compare the second plot to the first plot you can see we have fewer tools in the second one as determine by our code.
Hover Tooltip
The hover tooltip allows you to place your mouse over the plot and have information displayed about what your mouse is resting upon. Being able to do this can be useful for gaining insights about your data. Below is the code and the output followed by an explanation.
# Import ColumnDataSource
from bokeh.models import ColumnDataSource
# Create source
source = ColumnDataSource(data=df)
# Create TOOLTIPS and add to figure
TOOLTIPS = [("Education", "@education"), ("Position", "@type"), ("Income", "@income")]
fig = figure(x_axis_label="education", y_axis_label="income", tooltips=TOOLTIPS)
# Add circle glyphs
fig.circle(x="education", y="income", source=source)
output_file(filename="first_tooltips.html")
show(fig)

Here is what happened.
- We loaded a new library called ColumnDataSource. This function allows us to create a data structure that is unique to Bokeh. This is not required but will appear in the future.
- We then save are dataset using the new function and called it “source”
- Next, we create a list called “TOOLTIPS” this list contains tuples which are in parentheses. The first string in the parentheses will be the name that appears in the hover. The second string in the parentheses accesses the value in the dataset. For example, if you look at the hover in the plot above the first line says “Education” and the number 72. The string “Education” is just the first string in the tuple and the value 72 is the value of education from the dataset for that particular data point
- The rest of the code is a review of what has been done previously. The only difference is that we use the argument “tooltip” instead of “tool”
Conclusion
With tooltips and tools you can make some rather professional looking visualization with a minimum amount of code. That is what makes the Bokeh library so powerful.
Bar Graphs Using Bokeh and Python VIDEO
Make a Bar Graph with Bokeh in Python
Bokeh is a data visualization library available in Python with the unique ability of interaction. In this video, we will look at how to make a basic bar graph using bokeh.
To begin we need to load certain libraries as shown below.
from pydataset import data
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import output_file, show
In the code above, we load the “pydataset” library to gain access to the data we will use. Next, we load “pandas” which will help us with some data preparation. The last two libraries are related to “bokeh.” The “figure” function will be used to set the actual plot, the “output_file” function will allow us to save our plot as an HTML file and the “show” function will allow us to display our plot.
Data Preparation
We need to do two things to be ready to create our bar graph. First, we need to load the data. Second, we need to calculate group means for the bar graph. Below is the code for the first step followed by the output.
df=data('Duncan')
df.head()

In the code above we use the “data” function to load the “Duncan” dataset into an object called “df”. Next, we display the output of this. The “Duncan” dataset contains data on different jobs, the type of job, income, education, and prestige. We want to graph prestige and job type as a bar graph which will require us to calculate the mean of prestige by type. The code for this is below.
# Calculate group means of prestige
positions = df.groupby('type', as_index=False)['prestige'].mean()
positions

In the code above we use the “groupby” function on the “df” object. Inside the function, we indicate we want to group by “type”. The “as_index” argument is set to false so that the “type” column is not set at the index or you can say as the row numbers. Next, we subset the data using square brackets to only include the “prestige” column. Lastly, we indicate that we want to calculate the “mean”. The result is that there are three job types and we have the mean for each job’s prestige. The job types and means from this table above are what we will use for making our visualization.
Bar Graph
We are now ready to make our bar graph. Below is the code followed by the output.
# Instantiate figure
fig = figure(x_axis_label="positions", y_axis_label="Prestige", x_range=positions["type"])
# Add bars
fig.vbar(x=positions["type"], top=positions["prestige"],width=0.9)
# Produce the html file and display the plot
output_file(filename="Prestige.html")
show(fig)

Here are the steps.
- We began by creating the “fig” object. We labeled are x and y axes and also indicated the range of the x values which means determining the categories of our data. For our purposes, this was the unique job type in the “types” column.
- Next, we use the “vbar” function to make our bar graph. The x values were set to the “type” column from the “positions” object. The y or “top” values were set to the means of “prestige” from the “positions” object. The “width” argument was set to 0.9 to ensure there was a little whitespace between the bars.
- The “output_file” creates a saved plot and the “show” function displays the bar graph.
Conclusion
Bokeh has lots of cool tools available for the data analyst. This post was focused on bar graphs but this is only the most basic information that has been shared here. There is much more possible with this library.
Bokeh-Scatter Plot Basics in Python
Bokeh is another data visualization library available in Python. One of Bokeh’s unique features is that it allows for interaction. In this post, we will learn how to make a basic scatterplot in Bokeh while also exploring some of the basic interactions that are provided by default.
Data Preparation
We are going to make a scatterplot using the “Duncan” data set that is available in the “pydataset” library. Below is the initial code.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
The code above is just the needed libraries. We loaded “pydataset” because this is where our data will come from. All of the other libraries are related to “bokeh.” “Figure” allows us to set up our axes for the scatterplot. “Output_file” allows us to create the file of our plot. Lastly, “show” allows us to show the plot of our visualization. In the code below we will load our dataset, give it a name, and print the first few rows.
df=data('Duncan')
df.head()

In the code above we store the “Duncan” dataset in an object called “df” using the data() function. We then display a snippet of the data using the .head() function. The “Duncan” data shares information on jobs as defined by several variables. We will now proceed
Making the Scatterplot
We will now make our scatterplot. We have to do this in three steps.
- Make the axis
- Add the data to the plot
- Create the output file and show the results
Below is the code with the output
# Create a new figure
fig = figure(x_axis_label="education", y_axis_label="income") #labels axises
# Add circle glyphs
fig.circle(x=df["education"], y=df["income"]) #adds the dots
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

At the top of the code, we create our axis information using the “figure” function. Here we are plotting education vs income and storing all of this in an object called “fig”. Next, we insert the data into our plot using the “circle” function. To insert the data we also have to subset the “df” dataframe for the variables that we want. Note that the data added to a plot are called “glyphs” in Bokeh. Lastly, we create an output file using a function with the same name and show the results.
To the right of your plot, there are also some interaction buttons as shown below
Here is what they do from top to bottom.
- Takes you to bokeh.org
- Pan the image
- Box zoom
- Wheel zoom
- Download image
- Resets image
- It takes you to information about the bokeh function
There are other interactions possible but these are the default ones when you make a plot.
Conclusion
Bokeh is one of many tools used in Python for data visualization. It is a powerful tool that can be used in certain contexts. The interactive tools can also enhance the user experience.
Importing Files with Python VIDEO
The video below provides several different examples and ways that data can be imported into Python.

T-test & ANOVA with Python VIDEO
The video below shows how to conduct a t-test and an ANOVA analysis using the pingouin library from Python.

T-Test with Pingouin
In this post, we will look at how to use the Pingouin package to calculate both t-test and ANOVA results. This post is not a post on statistics. Rather, we are focused on how to do t-test and ANOVA using Python. Therefore, the explanation of the statistics is not a part of this post.
We will be using the Duncan dataset from the pydataset package. In the code below, we are loading the needed libraries and we are also printing a portion of the Duncan dataset.
import pandas as pd
import pingouin
from pydataset import data
df=data("Duncan")
df.head()

The Duncan dataset is simple. It has stats on various jobs which include the type of job, income, education, and prestige. We want to compare job type with income. What we want to do is compare professional jobs (prof) with white-collar jobs (wc) and see if there is a difference. After doing this, we will compare all three job types (bc, wc, prof) using ANOVA.
T-Test
In the code below, we need to subset our data so that the professionals and white-collar workers are separate.
df_prof=df[df['type']=='prof']
df_wc=df[df['type']=='wc']
Now that this is complete, the code below is what is used for conducting the t-test. We are comparing professional income with white-collar income. The t-test is two-sided which means we are looking for any difference at all. Below are the results
pingouin.ttest(x=df_prof['income'],y=df_wc['income'],alternative="two-sided")

According to the p-value, there is no difference between the salaries of professionals when compared to white-collar workers. We will now move to ANOVA.
ANOVA
T-test only allows the user to compare two groups. ANOVA allows the user to compare multiple groups. We have three types of workers and not just two. Using ANOVA, we can compare all three at once. In addition, unlike the t-test, there is no data preparation needed in this example.
The code below is relatively simple, we are using the ANOVA function from Pingouin. The first argument is for the data, the second indicates the dependent variable, and the between argument indicates the independent variable. Below is the code and output.
pingouin.anova(data=df,dv="income",between="type")

The value we are focused on is the p-unc or p-value. The results are significant. In other words, there is a difference between one of the comparisons. We don’t know which one will require us to do a pairwise comparison. Below are two different pairwise comparisons, one without an adjustment and one with an adjustment.
Pairwise Comparision No Adjustment
The first pairwise comparison is without an adjustment. The code below is mostly the same as for ANOVA. The main difference is we are using the pairwise_test function and there is an additional argument called padjust which is set to none. Below is the code and output.
pingouin.pairwise_tests(data=df,dv="income",between="type",padjust='none')

Focusing on the p-values (p-unc) again we can see that there is a difference between blue-collar workers and professionals and another difference between blue-collar workers and white-collar workers. However, there is no difference between professional and white-collar workers. Keep in mind that we already knew that there was no difference between professionals and white-collar workers from the t-test results.
Pairwise Comparision with Adjustment
In the code below, we have the same code but with a Bonferroni p-value adjustment. Adjustments become important when you have a large number of groups. The details of this are beyond the scope of this post. However, it is important to make this adjustment because otherwise, you could get false positives which could skew your results and interpretation. Below is the code and output.
pingouin.pairwise_tests(data=df,dv="income",between="type",padjust='bonf')

You may have noticed that the numbers are the same. That is because in our example we have a small number of groups. Therefore, this correction is not necessary for the data we are using.
Conclusion
The main purpose here was to show what the penguin package can do when it comes to t-tests and ANOVA. We could have calculated means for each group and other statistics. However, that was not the focus. Now, you know some of the tools that are available in the pingouin library.
Import Simple Files into Python
In this post we will be using Python to import files. Importing a text file is rather easy into Python. We will look at several different examples and file types in this post.
Importing a Text File
Importing a text file is often done in Python. To do this see the code below.
file=open('Corr.txt',mode='r')
text=file.read()
file.close()
print(text)
$r
ACmean CLMean SFIMean EnrichMean
ACmean 1.0000000 0.4386146 0.2463862 0.5758464
CLMean 0.4386146 1.0000000 0.2874991 0.5730721
SFIMean 0.2463862 0.2874991 1.0000000 0.2076200
EnrichMean 0.5758464 0.5730721 0.2076200 1.0000000
$n
ACmean CLMean SFIMean EnrichMean
ACmean 172 172 172 172
CLMean 172 172 172 172
SFIMean 172 172 172 172
EnrichMean 172 172 172 172
$P
ACmean CLMean SFIMean EnrichMean
ACmean NA 1.763762e-09 0.0011214521 0.000000e+00
CLMean 1.763762e-09 NA 0.0001312634 2.220446e-16
SFIMean 1.121452e-03 1.312634e-04 NA 6.277935e-03
EnrichMean 0.000000e+00 2.220446e-16 0.0062779348 NA
attr(,"class")
[1] "rcorr"
In order to load the text file we used the open() function to open the file in the working directory. Next, we indicated the mode as ‘r’ which means ‘read’. Everything that was just mentioned was saved into an object called ‘file’. Then we use the read() function on the object called ‘file’ and save all this in a new object called ‘text’. The next step involves using the close() function in order to complete the process. The last step involves printing the content of the text object using print().
Below is a way to complete this process faster.
with open('Corr.txt','r') as file:
print(file.read())
$r
ACmean CLMean SFIMean EnrichMean
ACmean 1.0000000 0.4386146 0.2463862 0.5758464
CLMean 0.4386146 1.0000000 0.2874991 0.5730721
SFIMean 0.2463862 0.2874991 1.0000000 0.2076200
EnrichMean 0.5758464 0.5730721 0.2076200 1.0000000
$n
ACmean CLMean SFIMean EnrichMean
ACmean 172 172 172 172
CLMean 172 172 172 172
SFIMean 172 172 172 172
EnrichMean 172 172 172 172
$P
ACmean CLMean SFIMean EnrichMean
ACmean NA 1.763762e-09 0.0011214521 0.000000e+00
CLMean 1.763762e-09 NA 0.0001312634 2.220446e-16
SFIMean 1.121452e-03 1.312634e-04 NA 6.277935e-03
EnrichMean 0.000000e+00 2.220446e-16 0.0062779348 NA
attr(,"class")
[1] "rcorr"
Using the ‘with’ approach is much faster and simpler. THe content of the open() function is the same while we save its as “file” by writing this after the open() function rather than before. Lastly, we print the file and use the read() function together.
Import with Numpy
Naturally, there is more than one way to import data. The example below involve the use of the NumPy library. This approach is used in particular for dealing with numerical data that might be saved as a texr file. Below is an example of how to do this.
import numpy as np
text=np.loadtxt('sample.txt', delimiter=',')
print(text)
[[1. 2. 3. 4.]
[5. 6. 7. 8.]
[9. 0. 1. 2.]]
We begin by importing the numpy librar as np. Next, we create an object called ‘text’ and use the loadtxt() function to load a text file called ‘sample’. The argument ‘delimiter’ is used to tell numpy how the numbers are separated in the file. Lastly, we print the ‘text’ object as an array.
Import with Pandas
Pandas is another way to import data. For our example, we will look at how to import csv files. Below is the code for how to complete this.
import pandas as pd
data=pd.read_csv('sample.csv')
data.head()

In line 1 of the code we load the pandas library as ‘pd’. In line 2, we use the read_csv() function to load our data into the ‘data’ object. Lastly, we use head() to take a peek at the first few lines of data.
Conclusion
With this information, you now possess some basic knowledge on how to get data into Python for the purpose of being able to manipulate it. In a future post, we will look at other ways and means of importing data into Python.
RANSAC Regression with Python VIDEO
RANSAC regression is a unique style of regression. This algorithm identifies outliers and inliers using the unique tools of this approach. The video below provides an overview of how it can be used in Python

Gradient Boosting CLassification with Python VIDEO
In this video, we will look at gradient boosting classification with python. Gradient boosting is similar to Adaboost in that it is an ensemble technique and is often associated with decision trees. The main difference is the focus on the gradient or slope in the calculations.

AdaBoost Regression with Python VIDEO
AdaBoost regression uses ensemble learning to improve the performance of numeric prediction models. The video below explains how to use adaBoost with Python.

AdaBoost Classification with Python VIDEO
AdaBoost classification is a type of ensemble learning. What this means is that the algorithm makes multiple models that work together to make predictions. Such techniques are powerful in improving the strength of models. The video below explains how to use this algorithm within Python.

Elastic Net Regression with Python VIDEO
Elastic net regression has all the strengths of both ridge and lasso regression without the apparent weaknesses. As such this is a great algorithm for regularized regression. The video below explains how to use this algorithm with Python

Lasso Regression with Python VIDEO
Ridge Regression with Python VIDEO
Ridge regression belongs to a family of regression called regularization regression. This family of regression uses various mathematical techniques to reduce or remove coefficients from a regression model. In the case of ridge, this algorithm will reduce coefficients close to zero but never actually remove variables from a model. In this video, we will focus on using this algorithm in python rather than on the mathematical details.
Hyper-Parameter Tuning with Python VIDEO
Hyper-parameter tuning is one way of taking your model development to the next level. This tool provides several ways to make small adjustments that can reap huge benefits. In the video below, we will look at tuning the hyper-parameters of a KNN model. Naturally, this tuning process can be used for any algorithm.

Cross-Validation with Python VIDEO
Cross-validation is a valuable tool for assessing a model’s ability to generalize. In the video below, we will look at how to use cross-validation with Python.

Intro to Matplotlib with Python VIDEO
Matplotlib is a data visualization module used often in Python. In this video, we will go over some introductory basic commands. Doing so will allow anybody who wants to be able to make simple manipulations to their visualizations.

Random Forest Regression with Python VIDEO
In the video below we will take a look at how to perform a random forest regression analysis with Python. Random forest is one of many tools that can be used in the field of data science to gain insights to help people.

K Nearest Neighbor Classification with Python VIDEO
Naive Bayes with Python VIDEO
Naive Bayes is an algorithm that is commonly used with text classification. However, it can also be used for separating observations into multiple categories. In this video, we will look at a simple example of the use of Naive Bayes in Python.

K-Nearest Neighbor Regression with Python VIDEO
Support Vector Machines Regression with Python VIDEO
Support Vector Machines Classification with Python VIDEO
Linear Discriminant Analysis with Python VIDEO
Factor Analysis with Python VIDEO
Factor analysis is a statistical technique used to reduce the number of dimensions in order to simplify additional analysis or confirm a construct. In this video, we will look at a very simple example of factor analysis along with a visualization.

Natural Language Process and WordClouds with Python VIDEO
KMeans Clustering with Python VIDEO
Kmeans clustering is an unsupervised learning technique used to place date in various groups as determine by the algorithm. In this video, we will go step by step through the process of using this insight tool.





















