Showing posts with label datascience. Show all posts
Showing posts with label datascience. Show all posts

Wednesday, August 26, 2020

Jupyter: JUlia PYThon and R

Image
it's "ggplot2", not "ggplot", but it is ggplot()

 

Did you know that @projectJupyter's Jupyter Notebook (and JupyterLab) name came from combining 3 programming languages: JUlia, PYThon and R.

Readers of my blog do not need an introduction to Python. But what about the other 2?  

Today we will talk about R. Actually, R and Python, on the Raspberry Pi.

R Origin

R traces its origins to the S statistical programming language, developed in the 1970s at Bell Labs by John M. Chambers. He is also the author of books such as Computational Methods for Data Analysis (1977) and Graphical Methods for Data Analysis (1983). R is an open source implementation of that statistical language. It is compatible with S but also has enhancements over the original.

 

A quick getting started guide is available here: https://support.rstudio.com/hc/en-us/sections/200271437-Getting-Started

 

Image


Installing Python

As a recap, in case you don't have Python 3 and a few basic modules, the installation goes as follow (open a terminal window first):


pi@raspberrypi: $ sudo apt install python3 python3-dev build-essential

pi@raspberrypi: $ sudo pip3 install jedi pandas numpy


Installing R

Installing R is equally easy:

 

pi@raspberrypi: $ sudo apt install r-recommended

 

We also need to install a few development packages:


pi@raspberrypi: $ sudo apt install libffi-dev libcurl4-openssl-dev libxml2-dev


This will allow us to install many packages in R. Now that R is installed, we can start it:


pi@raspberrypi: $ R

Installing packages

Once inside R, we can install packages using install.packages('name') where name is the name of the package. For example, to install ggplot2 (to install tidyverse, simply replace ggplot2 with tidyverse):

> install.packages('ggplot2')


To load it:


> library(ggplot2)

And we can now use it. We will use the mpg dataset and plot displacement vs highway miles per gallon and set the color to:

>ggplot(mpg, aes(displ, hwy, colour=class))+

 geom_point()

Image


Combining R and Python

We can go at this 2 ways, from Python call R, or from R call Python. Here, from R we will call Python.

First, we need to install reticulate (the package that interfaces with Python):

> install.packages('reticulate')

And load it:

> library(reticulate)

We can verify which python binary that reticulate is using:

> py_config()

 Then we can use it to execute some python code. For example, to import the os module and use os.listdir(), from R we do ($ works a bit in a similar fashion to Python's .):

> os <- import("os")
> os$listdir(".")

Or even enter a Python REPL:

> repl_python()
>>> import pandas as pd

>>>


Type exit to leave the Python REPL.

One more trick: Radian

we will now exit R (quit()) and install radian, a command line REPL for R that is fully aware of the reticulate and Python integration:

pi@raspberrypi: $ sudo pip3 install radian


pi@raspberrypi: $ radian

This is just like the R REPL, only better. And you can switch to python very quickly by typing ~:

r$> ~

As soon as the ~ is typed, radian enters the python mode by itself:

r$> reticulate::repl_python()

>>> 

Hitting backspace at the beginning of the line switches back to the R REPL:

r$> 


I'll cover more functionality in a future post.


Francois Dion

Sunday, March 11, 2018

Are you smarter than a fifth grader?

"the editorial principle that nothing should be given both graphically and in tabular form has to become unacceptable" - John W. Tukey

Back to school

In the United States, most fifth grade students are learning about a fairly powerful type of visualization of data. In some states, it starts at an even younger age, in the 4th grade. As classwork and homework, they will produce many of these plots:

Image

They are called stem-and-leaf displays, or stem-and-leaf plots. The left side of the vertical bar is the stem, and the right side, the leaves. The key or scale is important as it indicates the multiplier. The top row in the image above has a stem of 2 and leaves 0,6 and 7, representing 20, 26 and 27. Invented by John W. Tukey in the 1970's (see the statistics section of part II and the classics section of part V of my "ex-libris" series), few people use them once they leave school. Doing stem-and-leaf plots by hand is not the most entertaining thing to do. The original plot was also limited to handling small data sets. But there is a variation on the original display that gets around these limitations.

"Data! Data! Data!"

Powerful? Why did I say that in the first paragraph?

And why should stem-and-leaf plots be of interest to students, teachers, analysts, data scientists, auditors, statisticians, economists, managers and other people teaching, learning or working with data? There are a few reasons, with the two most important being:
  • they represent not only the overall distribution of data, but the individual data points themselves (or a close approximation)
  • They can be more useful than histograms as data size increases, particularly on long tailed distributions

 

An example with annual salaries

We will look at a data set of the salaries for government employees in Texas (over 690,000 values, from an August 2016 snapshot of the data from the Texas Tribune Salary Explorer). From this we create a histogram, one of the most popular plot for looking at distributions. As can be seen, we can't really tell any detail (left is Python Pandas hist, right is R hist):

Image

It really doesn't matter the language or software package used, we get one very large bar with almost all the observations, and perhaps (as in R or seaborn), a second tiny bar next to it. A box plot (another plot popularized by John Tukey) would have been a bit more useful here adding some "outliers" dots. And, how about a stem-and-leaf plot? We are not going to sort and draw something by hand with close to 700,000 values...

Fortunately, I've built a package (python modules plus a command line tool) that handles stem-and-leaf plots at that scale (and much, much larger). It is available from http://stemgraphic.org and also from github (the code has been available as open source since 2016) and pypi (pip install stemgraphic).
So how does it look for the same data set?

Image

Now we can see a lot of detail. Scale was automatically found to be optimal as 10000, with consecutive stems ranging from 0 to 35 (350000). We can read numbers directly, without having to refer to a color coded legend or other similar approach. At the bottom, we see a value of 0.00 (who works and is considered employed for $0 annual income? apparently, quite a few in this data set), and a maximum of $5,266,667.00 (hint, sports related), we see a median of about $42K and we see multiple classes of employees, ranging from non managerial, to middle management, upper management and beyond ($350,000+). We've limited the display here to 500 observations, and that is what the aggregate count on the leftmost column tells us. Notice also how we have a convenient sub-binning going on, allowing us to see which $1000 ranges are more common. All this from one simple display. And of course we can further trim, zoom, filter or limit what data or slice of data we want to inspect.

Knowing your data (particularly at scale) is a fundamental first step to turning it into insight. Here, we were able to know our data a lot better by simply using the function stem_graphic() instead of hist() (or use the included stem command line tool - compatible with Windows, Mac OS and Linux).

Tune in next episode...

Customers already using my software products for data governance, anomaly detection and data quality are already familiar with it. Many other companies, universities and individuals are using stemgraphic in one way or another. For everybody else, hopefully this has raised your interest, you'll master this visualization in no time, and you'll be able to answer the title question affirmatively...

Stemgraphic has another dozen types of visualizations, including some interactive and beyond numbers, adding support for categorical data and for text (as of version 0.5.x). In the following months I'll talk a bit more about a few of them.


Francois Dion
@f_dion

N.B. This article was originally published on LinkedIn at:

https://www.linkedin.com/pulse/you-smarter-than-fifth-grader-francois-dion/

Tuesday, February 27, 2018

Stemgraphic v.0.5.x: stem-and-leaf EDA and visualization for numbers, categoricals and text


 Stemgraphic open source


In 2016 at PyDataCarolinas, I open-sourced my stem-and-leaf toolkit for exploratory data analysis and visualization. Later, in October 2016 I had posted the link to the video.


Image

Stemgraphic.alpha


Image
With the 0.5.x releases, I've introduced the categorical and text support. In the next few weeks, I'll be introducing some of the features, particularly those found in the new stemgraphic.alpha module of the stemgraphic package, such as back-to-back plots and stem-and-leaf heatmaps:

Image

Image


But if you want to get started, check out stemgraphic.org, and the github repo (especially the notebooks).

Github Repo

https://github.com/fdion/stemgraphic


Francois Dion
@f_dion

Monday, February 26, 2018

Readings in Communication

"Ex-Libris" part VI: Communication

Part 6 of my "ex-libris" of a Data Scientist is now available. This one is about communication.
When I started this series, I introduced this diagram.
Image

I also quoted Johh Tukey:
"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise" 

This is quite important since a data science project has to start with a question and come up with an answer.

But how do we communicate at the time of formulating the question? How about at the time of providing an answer? By any means necessary.

Do check out the full article with the list of books:

"ex-libris" part vi


See also

Part I was on "data and databases": "ex-libris" of a Data Scientist - Part i
Part II, was on "models": "ex-libris" of a Data Scientist - Part II

Part III, was on "technology": "ex-libris" of a Data Scientist - Part III
Part IV, was on "code": "ex-libris" of a Data Scientist - Part IV
Part V was on "visualization". Bonus after that will be on management / leadership.
Francois Dion
@f_dion

Friday, August 11, 2017

Readings in Visualization

"Ex-Libris" part V: Visualization


Part 5 of my "ex-libris" of a Data Scientist is now available. This one is about visualization.

Starting from a historical perspective, particularly of statistical visualization, and covering a few classic must have books, the article then goes on to cover graphic design, cartography, information architecture and design and concludes with many recent books on information visualization (specific Python and R books to create these were listed in part IV of this series). In all, about 66 books on the subject.

Just follow the link to the LinkedIn post to go directly to it:



Image
From Jacques Bertin’s Semiology of Graphics

"Le plus court croquis m'en dit plus long qu'un long rapport", Napoleon Ier

See also

Part I was on "data and databases": "ex-libris" of a Data Scientist - Part i
Part II, was on "models": "ex-libris" of a Data Scientist - Part II

Part III, was on "technology": "ex-libris" of a Data Scientist - Part III
Part IV, was on "code": "ex-libris" of a Data Scientist - Part IV
Part VI will be on communication. Bonus after that will be on management / leadership.
Francois Dion
@f_dion

P.S.
Je vais aussi avoir une liste de publications en francais
En el futuro cercano voy a hacer una lista en espanol tambien

Monday, May 8, 2017

Readings in Data Science models

Ex-libris


As I've previously mentioned, I recently started a 6 part series on LinkedIn called "ex-libris" of a Data Scientist. 

Part I was on "data and databases": "ex-libris" of a Data Scientist - Part i

I just posted part II, "models": "ex-libris" of a Data Scientist - Part II

It does cover machine learning, but before going there, I cover Metrics, Operations Research, Econometrics and Time Series and Statistics. Even more fundamentally, I start with Math.

Yes, if you slept through linear algebra or calculus (or analysis as it is called in certain parts of the world), check the list out. Book suggestions and links to videos and other resources.


Also, a reminder: Python specific books will show up in part IV.



Francois Dion
@f_dion

Thursday, April 13, 2017

Readings in data and databases

Image
Recent readings (can you guess/decipher some of them?)

I've been fairly quiet on this particular blog this year. Beside a lot of data science work, I've done presentations at meetups and conferences, including a recent tutorial on "Getting to know your data at scale" at the IEEE SouthEastCon 2017. Notebooks will be posted on github soon.

But, in the meantime...

Ex-libris

Something else I've been doing is publishing a few articles here and there. Just recently, I started a 6 part series on LinkedIn called "ex-libris" of a Data Scientist. I think many readers of this blog will appreciate this series, and particularly this first installment on "data and databases":

"ex-libris" of a Data Scientist - Part i

It covers a good variety of books on the subject, some pretty much must read for whatever corner of the computer science world you live in. Also of interest will be the Postgres, Hadoop and graph database pointers and a list of over 20 curated must read papers in the field.

Python specific books will show up in part IV.


Francois Dion
@f_dion

Thursday, October 20, 2016

Stemgraphic, a new visualization tool

PyData Carolinas 2016

At PyData Carolinas 2016 I presented the talk Stemgraphic: A Stem-and-Leaf Plot for the Age of Big Data.

Intro

The stem-and-leaf plot is one of the most powerful tools not found in a data scientist or statistician’s toolbox. If we go back in time thirty some years we find the exact opposite. What happened to the stem-and-leaf plot? Finding the answer led me to design and implement an improved graphical version of the stem-and-leaf plot, as a python package. As a companion to the talk, a printed research paper was provided to the audience (a PDF is now available through artchiv.es)

The talk




Thanks to the organizers of PyData Carolinas, videos of all the talks and tutorials have been posted on youtube. In just 30 minutes, this is a great way to learn more about stemgraphic and the history of the stem-and-leaf plot for EDA work. This updated version does include the animated intro sequence, but unfortunately the sound was recorded from the microphone, and not the mixer. You can see the intro sequence in higher audio and video quality on the main page of the website below.

Stemgraphic.org

I've created a web site for stemgraphic, as I'll be posting some tutorials and demo some of the more advanced features, particularly as to how stemgraphic can be used in a data science pipeline, as a data wrangling tool, as an intermediary to big data on HDFS, as a visual validation for building models and as a superior distribution plot, particularly when faced with non uniform distributions or distributions showing a high degree of skewness (long tails).

Github Repo

https://github.com/fdion/stemgraphic


Francois Dion
@f_dion
 



Tuesday, October 11, 2016

PyData Carolinas 2016 Tutorial: Datascience on the web

PyData Carolinas 2016

Don Jennings and I presented a tutorial at PyData Carolinas 2016: Datascience on the web.

The plan was as follow:

Description

Learn to deploy your research as a web application. You have been using Jupyter and Python to do some interesting research, build models, visualize results. In this tutorial, you’ll learn how to easily go from a notebook to a Flask web application which you can share.

Abstract

Jupyter is a great notebook environment for Python based data science and exploratory data analysis. You can share the notebooks via a github repository, as html or even on the web using something like JupyterHub. How can we turn the work we have done in the notebook into a real web application?
In this tutorial, you will learn to structure your notebook for web deployment, how to create a skeleton Flask application, add a model and add a visualization. While you have some experience with Jupyter and Python, you do not have any previous web application experience.
Bring your laptop and you will be able to do all of these hands-on things:
  1. get to the virtual environment
  2. review the Jupyter notebook
  3. refactor for reuse
  4. create a basic Flask application
  5. bring in the model
  6. add the visualization
  7. profit!
Now that is has been presented, the artifacts are a github repo and a youtube video.

Github Repo


https://github.com/fdion/pydata/

After the fact

The unrefactored notebook is here while the refactored one is here.
Once you run through the whole refactored notebook, you will have train and test sets saved in data/ and a trained model in trained_models/. To make these available in the tutorial directory, you will have to run the publish.sh script. On a unix like environment (mac, linux etc):
chmod a+x publish.sh
./publish.sh

Video

The whole session is now on youtube: Francois Dion & Don Jennings Datascience on the web

Francois Dion
@f_dion

Sunday, September 18, 2016

Something for your mind: Polymath Podcast launched


Some episodes
will have more Art content, some will have more Business content, some will have more Science content, and some will be a nice blend of different things. But for sure, the show will live up to its name and provide you with “something for your mind”. It might raise more questions than it answers, and that is fine too.

Episode 000
Listen to Something for your mind on http://Artchiv.es

Francois Dion
@f_dion

Saturday, March 5, 2016

The return of the Los Alamos Memo 10742 -

Image
Modern rendering of the original 1947 Memo 10742

The mathematician prankster


Can you imagine yourself receiving this memo in your inbox in Washington in 1947? There's a certain artistic je ne sais quoi in this memo...

This prank was made by J Carson Mark and Stan Ulam.  A&S was Administration and Services.

And Ulam, well known for working on the Manhattan project, also worked on really interesting things in mathematics. Specifically, a collaboration with Nicholas Constantine Metropolis and John Von Neumann. You might know this as the Monte Carlo method (so named due to Ulam's uncle always asking for money to go and gamble in a Monte Carlo casino...). Some people have learned about a specific Monte Carlo simulation (the first) known as Buffon's needle.

Copying the prankster

When I stumbled upon this many years ago, I decided that it would make a fantastic programming challenge for a workshop and/or class. I first tried it in a Java class, but people didn't get quite into it. Many years later I redid it as part of a weekly Python class I was teaching at a previous employer.

The document is the output of a Python script. In order to make the memo look like it came from the era, I photocopied it. It still didn't look quite right, so I then scanned that into Gimp, bumped the Red and Blue in the color balance tool to give it that stencil / mimeograph / ditto look.


Your assignment


Here is what I asked the students:

"Replicate either:
a) the whole memo
or
b) the list of numbers 
Whichever assignment you choose, the numbers must be generated programmatically."

That was basically it. So, go ahead and try it. In Python. Or in R, or whatever you fancy and post a solution as a comment.

We will come back in some days (so everybody gets a chance to try it) and present some possible methods of doing this. Oh, and why the title of "the return of the Los Alamos Memo"? Well, I noticed I had blogged about it before some years back, but never detailed it...

Learning more on Stan Ulam


See the wikipedia entry and also:

LOS ALAMOS SCIENCE NO. 15, 1987



[EDIT: Part 2 is at: los-alamos-10742-making-of.html]

Francois Dion
@f_dion

Monday, December 28, 2015

The 10 colors of Pi

Picking up patterns

Or the lack thereof. The brain is pretty good at spotting patterns and anomalies. But we have to help it with something that can be easily abstracted. Numbers are not good for that.

Of reds and greens and blues

Colors are good helpers. Shades, hues. Unfortunately, many people are affected by colorblindness. It is said that in some segments of the population, up to 8% of men and 0.4% of women experience congenital color deficiency, with the most common being red-green color blindness.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
sns.set_context("talk")

Seaborn palettes

According to Seaborn's documentation (http://web.stanford.edu/~mwaskom/software/seaborn/tutorial/color_palettes.html), there is a ready made color palette called colorblind. There is not a lot of details on this, so I thought I'd experiment with this and ask for feedback. Unfortunately, only 6 colors are available until it starts recycling itself. That is clearly not good if we want to see patterns in numbers that range from 0 to 9 on each digit.
Another interesting choice is cubehelix. It works in color and grayscale. You can learn more about it here: http://www.mrao.cam.ac.uk/~dag/CUBEHELIX/
First thing first, let's set it as the default palette, and display it.
In [2]:
sns.set_palette(sns.color_palette("cubehelix", 10))
sns.palplot(sns.color_palette())
Image

An infinite source of entertainment

At the very least, an infinite source of digits: pi
So we will be plotting a large grid, with each square representing a digit of pi and filled in the color corresponding to the color in the Seaborn palette. Let's grab a pi digit generator. There is one here that's been around since the days of python 2.5:
https://www.daniweb.com/programming/software-development/code/249177/pi-generator-update
In [3]:
def pi_generate():
    """
    generator to approximate pi
    returns a single digit of pi each time iterated
    """
    q, r, t, k, m, x = 1, 0, 1, 1, 3, 3
    while True:
        if 4 * q + r - t < m * t:
            yield m
            q, r, t, k, m, x = 10*q, 10*(r-m*t), t, k, (10*(3*q+r))//t - 10*m, x
        else:
            q, r, t, k, m, x = q*k, (2*q+r)*x, t*x, k+1, (q*(7*k+2)+r*x)//(t*x), x+2

The 10 colors of Pi

We are now ready to do our actual visualization of pi.
For legal size paper, 55x97 is a good size (plus the ratio is the square root of pi). For poster, I use 154 x 204 = 31416 digits :)
It'll run for a while, even at the reduced size, probably a good time to go and get your favorite drink...
In [4]:
width = 55  # 154
height = 97  # 204 
digit = pi_generate()

fig = plt.figure(figsize=((width + 2) / 3., (height + 2) / 3.))
ax = fig.add_axes((0.05, 0.05, 0.9, 0.9),
                  aspect='equal', frameon=False,
                  xlim=(-0.05, width + 0.05),
                  ylim=(-0.05, height + 0.05))

for axis in (ax.xaxis, ax.yaxis):
    axis.set_major_formatter(plt.NullFormatter())
    axis.set_major_locator(plt.NullLocator())
    
for j in range(height-1,-1,-1):
    for i in range(width):
        pi_digit = next(digit)
        ax.add_patch(Rectangle((i, j),
                               width=1,
                               height=1,
                               ec=sns.color_palette()[pi_digit],
                               fc=sns.color_palette()[pi_digit],
                              )
                    )
        ax.text(i + 0.5, j + 0.5,
                pi_digit, color='k',
                fontsize=10,
                ha='center', va='center')
ax.text(0,-1,"'THE 10 COLORS OF PI' by Francois Dion", fontsize=15)
Out[4]:
<matplotlib.text.Text at 0x10966f1d0>
Image

The original jupyter notebook can be downloaded here:
https://github.com/fdion/infographics_research/blob/master/The%2010%20colors%20of%20Pi.ipynb

Francois Dion
@f_dion