Mark the Graph: Python

Showing posts with label Python. Show all posts

Tuesday, May 12

Plotting time-series DataFrames in pandas

Pandas provides a convenience method for plotting DataFrames: DataFrame.plot. There is also a quick guide here.

Unfortunately, when it comes to time series data, I don't always find the convenience method convenient. I often have a sparse DataFrame with lots of NaNs, which are not ignored by the convenience method. Additionally, I don't like the way that matplotlib places the lines hard against the left and right-hand sides of the canvas. I like a little bit of space at each end of the chart. Finally, I like playing with the tick marks and tick labels to get the right density of information on the x-axis.

Rather than use the inconvenient convenience method, I regularly find myself writing a short function to produce the plot layout I find a little more aesthetically pleasing. An example chart (from Mark the Ballot) and the associated python code follows.

import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, YearLocator, MonthLocator
plt.style.use('ggplot')

def plot(df, filename, heading=None):

    fig, ax = plt.subplots(figsize=(8, 4))

    min_date = None
    max_date = None
    for col_name in df.columns.values:

        # plot the column
        col = df[col_name]
        col = col[col.notnull()] # drop NAs
        dates = [zzz.to_timestamp().date() for zzz in col.index]
        ax.plot_date(x=dates, y=col, fmt='-', label=col_name,
            tz=None, xdate=True, ydate=False, linewidth=1.5)

        # establish the date range for the data
        if min_date:
            min_date = min(min_date, min(dates))
        else:
            min_date = min(dates)
        if max_date:
            max_date = max(max_date, max(dates))
        else:
            max_date = max(dates)

    # give a bit of space at each end of the plot - aesthetics
    span = max_date - min_date
    extra = int(span.days * 0.03) * datetime.timedelta(days=1)
    ax.set_xlim([min_date - extra, max_date + extra])

    # format the x tick marks
    ax.xaxis.set_major_formatter(DateFormatter('%Y'))
    ax.xaxis.set_minor_formatter(DateFormatter('\n%b'))
    ax.xaxis.set_major_locator(YearLocator())
    ax.xaxis.set_minor_locator(MonthLocator(bymonthday=1, interval=2))

    # grid, legend and yLabel
    ax.grid(True)
    ax.legend(loc='best', prop={'size':'x-small'})
    ax.set_ylabel('Percent')

    # heading
    if heading:
        fig.suptitle(heading, fontsize=12)
    fig.tight_layout(pad=1.5)

    # footnote
    fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', 
        va='bottom', fontsize=8, color='#999999')

    # save to file
    fig.savefig(filename, dpi=125)

Sunday, May 3

Using python statsmodels for OLS linear regression

This is a short post about using the python statsmodels package for calculating and charting a linear regression.

Let's start with some dummy data, which we will enter using iPython. We fake up normally distributed data around y ~ x + 10.

In [1]: import numpy as np

In [2]: x = np.random.randn(100)

In [3]: y = x + np.random.randn(100) + 10

We can plot this simply ...

In [4]: import matplotlib.pyplot as plt

In [5]: fig, ax = plt.subplots(figsize=(8, 4))

In [6]: ax.scatter(x, y, alpha=0.5, color='orchid')
Out[6]: 

In [7]: fig.suptitle('Example Scatter Plot')
Out[7]: 

In [8]: fig.tight_layout(pad=2); 

In [9]: ax.grid(True)

In [10]: fig.savefig('filename1.png', dpi=125)

That was easy. Next we will add a regression line. We will use the statsmodels package to calculate the regression line. Lines 11 to 15 is where we model the regression. Lines 16 to 20 we calculate and plot the regression line.

The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.

In [11]: import statsmodels.api as sm

In [12]: x = sm.add_constant(x) # constant intercept term

In [13]: # Model: y ~ x + c

In [14]: model = sm.OLS(y, x)

In [15]: fitted = model.fit()

In [16]: x_pred = np.linspace(x.min(), x.max(), 50)

In [17]: x_pred2 = sm.add_constant(x_pred)

In [18]: y_pred = fitted.predict(x_pred2)

In [19]: ax.plot(x_pred, y_pred, '-', color='darkorchid', linewidth=2)
Out[19]: []

In [20]: fig.savefig('filename2.png', dpi=125)

If we wanted key data from the regression, the following would do the job, after line 15:

print(fitted.params)     # the estimated parameters for the regression line
print(fitted.summary())  # summary statistics for the regression

We can add a confidence interval for the regression. There is a 95 per cent probability that the true regression line for the population lies within the confidence interval for our estimate of the regression line calculated from the sample data. We will calculate this from scratch, largely because I am not aware of a simple way of doing it within the statsmodels package.

To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28.

In [22]: y_hat = fitted.predict(x) # x is an array from line 12 above

In [23]: y_err = y - y_hat

In [24]: mean_x = x.T[1].mean()

In [25]: n = len(x)

In [26]: dof = n - fitted.df_model - 1

In [27]: from scipy import stats

In [28]: t = stats.t.ppf(1-0.025, df=dof)

In [29]: s_err = np.sum(np.power(y_err, 2))

In [30]: conf = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((x_pred-mean_x),2) / 
   ....:     ((np.sum(np.power(x_pred,2))) - n*(np.power(mean_x,2))))))

In [31]: upper = y_pred + abs(conf)

In [32]: lower = y_pred - abs(conf)

In [33]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.4)
Out[33]: 

In [34]: fig.savefig('filename3.png', dpi=125)

The final step is a prediction interval. There is a 95 per cent probability that the real value of y in the population for a given value of x lies within the prediction interval. There is a statsmodels method in the sandbox we can use.

In [35]: from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [36]: sdev, lower, upper = wls_prediction_std(fitted, exog=x_pred2, alpha=0.05)

In [37]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.1)
Out[37]: 

In [38]: fig.savefig('filename4.png', dpi=125)

Saturday, February 21

Load ABS files to MySQL

When I am working in R, I tend to have my working data in a MySQL database. I found that R did not always play nicely (and quickly) with complex Microsoft Excel files.

Previously, I had quite a complex bit of python code to read Excel files and upload them to MySQL. I have now retooled the way in which I load files from the Australian Bureau of Statistics (ABS) to MySQL using Python pandas. The code is much simpler.

First, I store my MySQL database username and password in a file MysqlConnect.py (it is used by a number of different programs). It lives in the bin directory (../bin from where I do my work). And, just in case you are wondering: no, it is not my password.

host     = 'localhost'
user     = 'root'
password = 'BigRedCar'
database = 'dbase1'

Now let's move on to the function to load ABS files into MySQL. It lives in the bin directory (../bin from where I do my work), in a file named LoadABSToMySQL.py.

import pandas as pd
import pymysql
from sqlalchemy import create_engine
import os.path
import re

# local imports - a file that contains database login details
import MysqlConnect as MSC

def LoadABSToMySQL(pathName):
    """ Read an Excel file from the Australian Bureau of Statistics
        and load it into a MySQL database"""

    # --- 1 --- open MySQL
    s = 'mysql+pymysql://'+MSC.user+':'+MSC.password+'@'+MSC.host+'/'+MSC.database
    engine = create_engine(s)

    # --- 2 --- identify proposed table name from file name
    (head,tail) = os.path.split(pathName)
    tail = re.split('\.', tail)
    tablename = tail[0]

    # --- 3 --- open the XL file
    wb = pd.ExcelFile(pathName)

    # --- 4 --- load XL workbooks into a single DataFrame
    df = pd.DataFrame()
    for name in wb.sheet_names:

        # -- ignore junk
        if not 'Data' in name:
            continue

        # -- read
        tmp = wb.parse(sheetname=name, header=9, index_col=0, na_values=['', '-', ' '])

        # -- amalgamate
        df = pd.merge(left=df, right=tmp, how='outer', left_index=True, right_index=True)
        tmp = None

    # --- 5 --- write this DataFrame to MySQL
    df.to_sql(tablename, engine, index=True, if_exists='replace')

Finally, an example code snippet to load some of the ABS National Account files to MySQL. This files sits in my national accounts directory and has the rather unimaginative name py-load.py. The ABS Microsoft Excel files live in the ./raw-data sub-directory.

import sys
sys.path.append( '../bin' )

from LoadABSToMySQL import LoadABSToMySQL

dataDirectory = './raw-data/'
dataFiles = [
    '5206001_key_aggregates',
    '5206002_expenditure_volume_measures',
    '5206003_expenditure_current_price',
    '5206004_expenditure_price_indexes',
    '5206006_industry_gva',
    '5206008_household_final_consumption_expenditure',
    '5206022_taxes',
    '5206023_social_assistance_benefits',
    '5206024_selected_analytical_series'
]
dataSuffix = '.xls'

for f in dataFiles :
    LoadABSToMySQL(dataDirectory + f + dataSuffix)

To run this python load file, I have a BASH shell script, which I use on my iMac. It has another unimaginative name: run-load.sh.

#!/bin/bash

# mac os x fix ...
cd "$(dirname "$0")"

python ./py-load.py

Sunday, December 28

Updated cheat sheets: python, pandas and matplotlib

I have updated my cheat sheets for python, pandas and matplotlib.

You can find them here.

Sunday, August 3

Pandas 0.14.1

I have just upgraded to Pandas 0.14.1.

It was a pain. At first, none of my graphics programs worked. It looks like a change to the API for parsing Microsoft Excel files was the problem. I am not sure whether my previous approach was wrong (but worked serendipitously), or the API was deliberately changed to break old code (an unusual practice for a point release). If someone knows, I'd appreciate something in the comments below.

What follows are the classes I use to upload Australian Bureau of Statistics (ABS) and Reserve Bank of Australia (RBA) data, with the updates to the parsing stage commented.

And yes, I use Python 2.7, not Python 3 (it's what comes with the Apple Mac).

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.14.1

import pandas as pd
assert( pd.__version__ >= '0.14.1' )

class ABSExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a dictionary of pandas DataFrame objects for
           each Data work sheet in an ABS Excel spreadsheet"""

        wb = pd.ExcelFile(pathName)
        returnDict = {}

        for name in wb.sheet_names:
            if not 'Data' in name:
                continue

            # ExcelFile.parse: API behaviour change with pandas 14.1
            #df = wb.parse(sheetname=name, skiprows=8, header=9, index_col=0, na_values=['', '-', ' '])
            df = wb.parse(sheetname=name, skiprows=9, header=0, index_col=0, na_values=['', '-', ' '])

            periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
            df.set_index(keys=periods, inplace=True)
            df.index.name = index_name
            returnDict[name] = df

            if verbose:
                print ("\nFile: '{}', sheet: '{}'".format(pathName, name))
                print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))])

        return returnDict

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.14.1

import pandas as pd
assert( pd.__version__ >= '0.14.1' )

class RBAExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a pandas DataFrame for an RBA Excel spreadsheet"""
        wb = pd.ExcelFile(pathName)
        sheetname = 'Data'

        # ExcelFile.parse: API behaviour change with pandas 14.1
        #df = wb.parse(sheetname, skiprows=9, header=10, index_col=0, na_values=['', '-', ' '])
        df = wb.parse(sheetname, skiprows=10, header=0, index_col=0, na_values=['', '-', ' '])

        periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
        df.set_index(keys=periods, inplace=True)

        if verbose:
            print "\nFile: '{}', sheet: '{}'".format(pathName, sheetname)
            print 'Columns: {}'.format(df.columns.tolist())
            print 'Top left hand corner ...'
            print '------------------------'
            print df.iloc[:min(5, len(df)), :min(5, len(df.columns))]

        return df

Monday, June 23

Henderson Moving Average

I have posted my R code for a Henderson moving average here. This is the same code in python.

## Henderson.py
## calculate a Henderson moving average

import pandas as pd
import numpy as np

def hmaSymmetricWeights(n):
    """ derive an n-term array of symmetric 'Henderson Moving Average' weights
        formula from ABS (2003), 'A Guide to Interpreting Time Series', page 41.
        returns a numpy array of symmetric Henderson weights indexed from 0 to n-1"""

    # calculate the constant denominator and terms
    m = int((n-1)//2) # the mid point - n must be odd
    m1 = (m+1)*(m+1)
    m2 = (m+2)*(m+2)
    d = float(8*(m+2)*(m2-1)*(4*m2-1)*(4*m2-9)*(4*m2-25))
    m3 = (m+3)*(m+3)

    # calculate the weights
    w = np.repeat(np.nan, n) # Actually indexed from 0 to n-1
    for j in range(m+1):
        j2 = j*j
        v = (315*(m1-j2)*(m2-j2)*(m3-j2)*(3*m2-11*j2-16))/d
        w[(m+j)] = v
        if j > 0:
            w[(m-j)] = v

    w.flags.writeable = False # let's make it quasi-immutable
    return (w)


def hmaAsymmetricWeights(m, w):
    """calculate the asymmetric end-weights

        w --> an array of symmetrical henderson weights (from above function)
        m --> the number of asymmetric weights sought; where m < len(w);

        returns a numpy array of asymmetrical weights, indexed from 0 to m-1;

        formula from Mike Doherty (2001), 'The Surrogate Henderson Filters in X-11',
        Aust, NZ J of Stat. 43(4), 2001, pp901-999; see formula (1) on page 903"""

    n = len(w) # the number of weights

    # - some quick sanity checks
    if m >= n:
        raise ValueError('The m argument must be less than n')
    if m <= int((n-1)//2):
        raise ValueError('The m argument must be greater than (n-1)/2')

    # --- let's build up Doherty's formula (1) from the top of page 903

    # - the second chunk of the formula
    sumResidual = w[range(m, n)].sum() / float(m)

    # - the last chunk of the formula
    sumEnd = 0.0
    for i in range(m+1, n+1):
        sumEnd += (float(i)-((m+1.0)/2.0)) * w[i-1] # w indexed from 0 to n-1

    # - the beta squared / sigma squared - formula at the bottom of page 904
    ic = 1.0
    if n >= 13 and n < 15:
        ic = 3.5
    elif n >= 15:
     ic = 4.5
    b2s2 = (4.0/np.pi)/(ic*ic)

    # - the gnarly bit in the middle of the formula
    denominator = 1.0 + ((m*(m-1.0)*(m+1.0) / 12.0 ) * b2s2)
    u = np.repeat(np.nan, m) # return series - created empty
    for r in range(m): # r ranges 0 to m-1; but the formulae assumes 1 to m
        numerator = ((r+1.0) - (m+1.0)/2.0) * b2s2
        # - finally putting it all together
        u[r] = w[r] + sumResidual + ( numerator / denominator ) * sumEnd

    u.flags.writeable = False # let's make it quasi-immutable
    return (u)


def Henderson(s, n):
    """ Calculate an n-term Henderson Moving Average for the Series s
        Note: we blithely assume s is ordered, contiguous and without missing data"""

    # - some simple sanity checks
    if not isinstance(s, pd.core.series.Series):
        raise TypeError('The s argument should be a pandas Series')
    if not isinstance(n, int):
        raise TypeError('The n argument must be an integer')
    if n < 5:
        raise ValueError('The n argument must be >= 5')
    if n % 2 == 0:
        raise ValueError('The n argument must be odd')
    if len(s) < n:
        raise ValueError('The s argument should be a Series longer than n')

    # - calculate the symmetric weights
    w = hmaSymmetricWeights(n)

    # preliminaries
    r = pd.Series(np.repeat(np.nan, len(s)), index=s.index) # the empty return vehicle
    m = int((n-1)//2)
    l = len(s)

    # - and now move over the length of the series ...
    for i in range(len(s)) :
        if i < m:
            # --- head section of series
            u = hmaAsymmetricWeights(m+i+1, w)[::-1] # reverse - asymmetric to the left
            r.iloc[i] = (s.iloc[0:(i+m+1)] * u).sum()
        elif i + m >= l:
            # --- tail section of series
            u = hmaAsymmetricWeights(m+l-i, w)
            r.iloc[i] = (s.iloc[(i-m):l] * u).sum()
        else:
            # --- middle section of series
            r.iloc[i] = (s.iloc[(i-m):(i+m+1)] * w).sum()

    return (r)


### - test code
#--------------
# Check against Table 1 in B Quenneville and B Lefrancois (2001),
# "Implicit Forecasts in Musgrave Asymmetric Averages",
# Proceedings of the Annual Meeting of the American Statistical Association,
# August 5-9, 2001.
#--------------
#w = hmaSymmetricWeights(9)
#print(w)
#print(w.sum()) # should be one
#u = hmaAsymmetricWeights(7, w)
#print(u)
#print(u.sum()) # should be one
#--------------
#print (Henderson(pd.Series(range(30))+pd.Series(np.random.randn(30)), 13))

Saturday, January 25

Another Python Cheat Sheet

This time for matplotlib.

Sunday, January 5

Pandas DataFrame cheat sheet and the Python v R debate

This has taken a lot longer than I thought it would. But I now have a rough, early draft of a cheat sheet on the Python pandas DataFrame object. When dealing with such a rich environment, it is a challenging decision on what to include and exclude in a four page set of notes. As always, comments and suggestions for improvement are always welcome. Doubtless there are many typographic errors that would be eradicated with another set of eyes and a ruthless proof reading of the draft text.

I have also been thinking about the Python versus R debate. Which is the better tool for data analysis? Clearly, this is one of those questions to which there is no ultimate truth. Your answer will depend on what sort of analysis you do. My analytical interest is primarily data capture and visualization. While I do some simple statistical analyses, I do not use R in a deep statistical way. I do, however, depend heavily on Hadley Wickham's excellent graphics package ggplot2.

Before coming to the question, I should expose my relative experience with the two languages. I have been using R as my primary analytical language for a number of years. Over the same period I have used Python as well; albeit less frequently, and largely for data munging (loading data from various data sources into a MySQL server from which R can access it).

On the criterion of data munging, Python beats R hands down. Python is simply better able to manage dirty data in all its forms and is better for automating the cleaning and loading operations.

Another area where I think Python beats R is coding time. My impression is that I code at least twice as fast in Python. I spend much less time debugging code. I became a very defensive coder in R. Every function I wrote religiously tested the arguments to see that the values were of the right type and within the expected ranges. It was not unusual to start a function with 6 to 12 stopifnot() statements. Even so, I simply code faster in Python. There are a few reasons for this. While both languages are very expressive (compared with C, C++ or Java), I find Python the more expressive language. List comprehensions and generator expressions are powerful tools for tight code. While environments in R come close, they are no where as natural to use as dictionaries in Python. Second, Python's much stronger typing better protects me from my poor typing skills (no pun intended). Third, my learning curve with Python was much shorter than for R. But again, this may just be a product of my background (as someone coming from the C, C++, Objective-C and Java programming paradigms).

On graphics, I think Hadley Wickham's ggplot2 beats the competition from Python in the form of matplotlib. But work is afoot to replicate ggplot2 in the Python environment. When that work is well progressed I might just change ships.

Another area where R leads is idiomatic coherence. While the pandas DataFrame object is an immensely rich environment, it feels cobbled together and a little rough at the edges (it does not feel coherently designed from the ground up). Take the myriad of indexing options: [], .loc[], .iloc[], .ix[], .iat[], .at[], .xs[], which feel like the maze of twisty little passages, all alike, but each a little different. And then there are the confusing rules (for example, single indexes on a DataFrame return columns but single sliced indexes return rows). Furthermore, the Pythonic notion of container truthiness was not maintained by pandas (in the rest of Python, empty containers are False while non-empty containers are True, regardless of what they contain). I could go on. But, simply put, data.frames in R are more coherent with the rest of the R language compared with the DataFrame object and the rest of Python.

And another point of comparison in R's favour: the R help system is much more helpful than its counterpart from Python and pandas.

Finally there is something of the nanny state that annoys the hell out of me in both Hadley Wickham's ggplot2 and Wes McKinney's DataFrame. Hadley won't let you plot a chart with two y-axes. Wes, won't let you plot a time series as a bar chart. I can see the arguments for both protections. But really, is prohibition needed? Ironically, you can commit Hadley's unforgivable sin under Wes' DataFrame. And Hadley will happily let you plot a time series as a bar chart. It is time both Hadley and Wes embraced liberalism.

Wednesday, January 1

Reading ABS Excel files in pandas

Experimental code follows for reading excel files from the Australian Bureau of Statistics.

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.12

import pandas as pd

class ABSExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a dictionary of pandas DataFrame objects for
           each Data work sheet in an ABS Excel spreadsheet"""

        wb = pd.ExcelFile(pathName)
        returnDict = {}

        for name in wb.sheet_names:
            if not 'Data' in name:
                continue

            df = wb.parse(name, skiprows=8, header=9, index_col=0)
            periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
            df.set_index(keys=periods, inplace=True)
            df.index.name = index_name
            returnDict[name] = df

            if verbose:
                print ("\nFile: '{}', sheet: '{}'".format(pathName, name))
                print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))])

        return returnDict

Saturday, December 28

Python cheat sheet

I am working on two Python cheat sheets. The first one is ready for review. It covers the basics of Python. The second one, still in development, covers the pandas DataFrame object.

As always - corrections and suggestions for improvement welcome.