Introduction to Scientific Programming with Python¶
Session 4: NumPy, SciPy & matplotlib
Denis Tome' -- [email protected] -- http://www0.cs.ucl.ac.uk/staff/D.Tome/
Importing numpy, scipy, and matplotlib¶
%matplotlib inline
Start the Ipython process with¶
ipython notebook or ipython qtconsole, depending on your prefered interface.
Start your notebook with the following command - it enables the interactive plotting in your notebook or qtconsole:
or use
%matplotlib qt
if you are using the qtconsole interface
Get plotting tools¶
we need both plotting tools and numpy for this tutorial
import matplotlib.pyplot as plt
import numpy as np
Basic Plotting¶
- Basic line and scatter plots can be created using the plot function from the pylab package
- Below, the first list determines the x-coordinates and the second list the y -coordinates.
plt.plot([1,2,3],[4,2,6])
[<matplotlib.lines.Line2D at 0x107b67f98>]
- Use the clf() function to clear a figure, or the figure() function to start a new one.
- Commonly used markers are ".", "o", "x" and line types "-", "--", "-". but have a look at the help of the plot command if you need more.
plt.plot([1,2,3,4],[1,2,6,7], 'o');
Note the ; at the end of the plot statement above: it stops the statement from returning and so the [<matplotlib.lines.Line2D at 0xb64b0b8>] is not displayed.
help(plt.plot)
Help on function plot in module matplotlib.pyplot:
plot(*args, **kwargs)
Plot lines and/or markers to the
:class:`~matplotlib.axes.Axes`. *args* is a variable length
argument, allowing for multiple *x*, *y* pairs with an
optional format string. For example, each of the following is
legal::
plot(x, y) # plot x and y using default line style and color
plot(x, y, 'bo') # plot x and y using blue circle markers
plot(y) # plot y using x as index array 0..N-1
plot(y, 'r+') # ditto, but with red plusses
If *x* and/or *y* is 2-dimensional, then the corresponding columns
will be plotted.
If used with labeled data, make sure that the color spec is not
included as an element in data, as otherwise the last case
``plot("v","r", data={"v":..., "r":...)``
can be interpreted as the first case which would do ``plot(v, r)``
using the default line style and color.
If not used with labeled data (i.e., without a data argument),
an arbitrary number of *x*, *y*, *fmt* groups can be specified, as in::
a.plot(x1, y1, 'g^', x2, y2, 'g-')
Return value is a list of lines that were added.
By default, each line is assigned a different style specified by a
'style cycle'. To change this behavior, you can edit the
axes.prop_cycle rcParam.
The following format string characters are accepted to control
the line style or marker:
================ ===============================
character description
================ ===============================
``'-'`` solid line style
``'--'`` dashed line style
``'-.'`` dash-dot line style
``':'`` dotted line style
``'.'`` point marker
``','`` pixel marker
``'o'`` circle marker
``'v'`` triangle_down marker
``'^'`` triangle_up marker
``'<'`` triangle_left marker
``'>'`` triangle_right marker
``'1'`` tri_down marker
``'2'`` tri_up marker
``'3'`` tri_left marker
``'4'`` tri_right marker
``'s'`` square marker
``'p'`` pentagon marker
``'*'`` star marker
``'h'`` hexagon1 marker
``'H'`` hexagon2 marker
``'+'`` plus marker
``'x'`` x marker
``'D'`` diamond marker
``'d'`` thin_diamond marker
``'|'`` vline marker
``'_'`` hline marker
================ ===============================
The following color abbreviations are supported:
========== ========
character color
========== ========
'b' blue
'g' green
'r' red
'c' cyan
'm' magenta
'y' yellow
'k' black
'w' white
========== ========
In addition, you can specify colors in many weird and
wonderful ways, including full names (``'green'``), hex
strings (``'#008000'``), RGB or RGBA tuples (``(0,1,0,1)``) or
grayscale intensities as a string (``'0.8'``). Of these, the
string specifications can be used in place of a ``fmt`` group,
but the tuple forms can be used only as ``kwargs``.
Line styles and colors are combined in a single format string, as in
``'bo'`` for blue circles.
The *kwargs* can be used to set line properties (any property that has
a ``set_*`` method). You can use this to set a line label (for auto
legends), linewidth, anitialising, marker face color, etc. Here is an
example::
plot([1,2,3], [1,2,3], 'go-', label='line 1', linewidth=2)
plot([1,2,3], [1,4,9], 'rs', label='line 2')
axis([0, 4, 0, 10])
legend()
If you make multiple lines with one plot command, the kwargs
apply to all those lines, e.g.::
plot(x1, y1, x2, y2, antialiased=False)
Neither line will be antialiased.
You do not need to use format strings, which are just
abbreviations. All of the line properties can be controlled
by keyword arguments. For example, you can set the color,
marker, linestyle, and markercolor with::
plot(x, y, color='green', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=12).
See :class:`~matplotlib.lines.Line2D` for details.
The kwargs are :class:`~matplotlib.lines.Line2D` properties:
agg_filter: unknown
alpha: float (0.0 transparent through 1.0 opaque)
animated: [True | False]
antialiased or aa: [True | False]
axes: an :class:`~matplotlib.axes.Axes` instance
clip_box: a :class:`matplotlib.transforms.Bbox` instance
clip_on: [True | False]
clip_path: [ (:class:`~matplotlib.path.Path`, :class:`~matplotlib.transforms.Transform`) | :class:`~matplotlib.patches.Patch` | None ]
color or c: any matplotlib color
contains: a callable function
dash_capstyle: ['butt' | 'round' | 'projecting']
dash_joinstyle: ['miter' | 'round' | 'bevel']
dashes: sequence of on/off ink in points
drawstyle: ['default' | 'steps' | 'steps-pre' | 'steps-mid' | 'steps-post']
figure: a :class:`matplotlib.figure.Figure` instance
fillstyle: ['full' | 'left' | 'right' | 'bottom' | 'top' | 'none']
gid: an id string
label: string or anything printable with '%s' conversion.
linestyle or ls: ['solid' | 'dashed', 'dashdot', 'dotted' | (offset, on-off-dash-seq) | ``'-'`` | ``'--'`` | ``'-.'`` | ``':'`` | ``'None'`` | ``' '`` | ``''``]
linewidth or lw: float value in points
marker: :mod:`A valid marker style <matplotlib.markers>`
markeredgecolor or mec: any matplotlib color
markeredgewidth or mew: float value in points
markerfacecolor or mfc: any matplotlib color
markerfacecoloralt or mfcalt: any matplotlib color
markersize or ms: float
markevery: [None | int | length-2 tuple of int | slice | list/array of int | float | length-2 tuple of float]
path_effects: unknown
picker: float distance in points or callable pick function ``fn(artist, event)``
pickradius: float distance in points
rasterized: [True | False | None]
sketch_params: unknown
snap: unknown
solid_capstyle: ['butt' | 'round' | 'projecting']
solid_joinstyle: ['miter' | 'round' | 'bevel']
transform: a :class:`matplotlib.transforms.Transform` instance
url: a url string
visible: [True | False]
xdata: 1D array
ydata: 1D array
zorder: any number
kwargs *scalex* and *scaley*, if defined, are passed on to
:meth:`~matplotlib.axes.Axes.autoscale_view` to determine
whether the *x* and *y* axes are autoscaled; the default is
*True*.
.. note::
In addition to the above described arguments, this function can take a
**data** keyword argument. If such a **data** argument is given, the
following arguments are replaced by **data[<arg>]**:
* All arguments with the following names: 'x', 'y'.
- Other useful functions (see below)
- Try it out!
plt.plot([1,2,3],[1,2,3])
plt.title("A line")
plt.xlabel("X-axis")
plt.ylabel("Y-axis");
Example: Plotting a function¶
- Let’s say you want to plot the function (in the math sense) $f (x) = 2x^2 + 3$ for $x$ in the range $[−3, 3]$. How would you do it?
# Define a Python function that computes the function value
def f(x):
return 2*x**2 + 3
print('f(0):', f(0))
print('f(-3):', f(-3))
print('f(3):', f(3))
plt.plot([-3, 0, 3], [f(-3), f(0), f(3)]);
f(0): 3 f(-3): 21 f(3): 21
- We can use range() and list comprehensions
xs = range(-3, 4)
plt.plot(xs, [f(x) for x in xs],'o-');
- Need smaller intervals ⇒ use linspace or arange
print('np.linspace\n', np.linspace(2.0, 3.0, num=5))
print('np.arange\n', np.arange(2, 3, 0.5))
np.linspace [ 2. 2.25 2.5 2.75 3. ] np.arange [ 2. 2.5]
xs_1 = np.arange(-3,3,0.1)
plt.plot(xs_1, [f(x) for x in xs_1]);
xs_2 = np.linspace(-3,3,100)
plt.plot(xs_2, [f(x) for x in xs_2]);
- Wouldn’t it be nice if we could just write
plt.plot(xs_2, 2*xs_2**2 + 3);
- Oh, wait, we can!
- What is going on? The magic here is that
arangeandlinspacedoes not return a list likerange, but anumpy.ndarray. - If we try it with a list, it doesn’t work
xs = range(-3,4)
plt.plot(xs, 2*xs**2 + 3)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-69-c9b93f34ec9e> in <module>() 1 xs = range(-3,4) ----> 2 plt.plot(xs, 2*xs**2 + 3) TypeError: unsupported operand type(s) for ** or pow(): 'range' and 'int'
Example: Linear Regression¶
- Given pairs $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}),\ldots, (x^{(N)}, y^{(N)})\}$, where $x^{(i)} \in \mathbb{R}^D,y^{(i)} \in \mathbb{R}$
- Goal find $w$ such that $y^{(i)} \approx f(x^{(i)}) := w^T x^{(i)}$
- Measure of (mean squared) error $$\mathrm{MSE} = \frac{1}{N}\sum_{i=1}^N (y^{(i)} - f(x^{(i)}))^2$$
Rewrite as matrices/vectors $$y = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{pmatrix}, \quad X = \begin{pmatrix} x^{(1)} \\ x^{(2)} \\ \vdots \\x^{(N)} \end{pmatrix}= \begin{pmatrix} x^{(1)}_1 & \cdots & x^{(1)}_{D} \\x_{1}^{(2)} & \cdots & x_{D}^{(2)} \\\vdots & \ddots & \vdots \\ x^{(N)}_{1} & \cdots &x^{(N)}_{D} \end{pmatrix}$$
Solution: $w_{ls} = X^{\dagger}y = (X^TX)^{-1}X^T y$
Simpler scalar case:¶
If each of the $x^{(i)}$ is just a scalar, i.e. it is in $\mathbb{R}$, the solutions simplifies to
$$\Rightarrow w= \frac{\sum_i x^{(i)} y^{(i)}}{\sum_i x^2} $$
Turn the math into python¶
- The simplest case with scalar $x$s can be written as
def f(x):
return 2 * x
# Training data (x,y) pairs)
X = np.arange(10)
y = [f(xi) for xi in X]
# plot training data
plt.plot(X,y, 'o')
plt.title("Training data")
plt.show() # causes the plot to be shown NOW
# linear regression in 1D -> SIMPLEST SCENARIO
w = ( sum([X[i]*y[i] for i in range(len(X))])/ float(sum([x**2 for x in X])) )
print("Learned w:", w)
Learned w: 2.0
- How does the learned function look like?
plt.plot(X,y, 'o')
plt.plot(X, w*X, 'r-')
plt.title("Linear Regression")
plt.legend(["Training data", "Learned linear function"]);
The ndarray type¶
- The numpy.ndarray type represents a multidimensional, homogeneous array of fixed-size items
- You can think of this as a table, where each cell is of the same type and is indexed by a tuple of integer indices
- Vectors (1-D) and matrices (2-D) are the most common examples, but higher-dimensional arrays are also often useful
- For example, you can represent a video as a 4-D array (x, y, frame, channel)
- A 1-D ndarray is similar to a list of numbers
- However it makes it very easy to apply the same (mathematical) operation to all elements in the array
- They also make it very easy to compute aggregates, like sums, means, products, sum-products, etc.
r = np.arange(1,4,1)
print('Element-wise operations')
print("r:", r)
print(3*r + 2)
print(np.sqrt(r))
print('\nOperations considering all the elements')
print(np.sum(r))
print(np.mean(r))
print(np.prod(r))
Element-wise operations r: [1 2 3] [ 5 8 11] [ 1. 1.41421356 1.73205081] Operations considering all the elements 6 2.0 6
- ndarray objects have two important attributes:
- The data type (dtype) of the objects in the cells
- The shape of the array, describing the dimensionality and the size of the individual dimensions
a = np.zeros(shape = (3, 5), dtype = np.int64)
print("Array a:\n %r \n" % a)
print('Type of the variable:\n %r \n' % type(a))
print('Type of the objects in the cells:\n %r \n' % a.dtype)
print('Shape of the array:\n', a.shape)
Array a:
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
Type of the variable:
<class 'numpy.ndarray'>
Type of the objects in the cells:
dtype('int64')
Shape of the array:
(3, 5)
Accessing Elements¶
- Access to the individual elements is done using the familiar [ ] syntax
- IMPORTANT: The first element is the the one in position 0, not 1 (like Matlab)
a = np.zeros(shape = (3, 5), dtype = np.int64)
print('Element in position [0,0]:', a[0,0])
print(a)
Element in position [0,0]: 0 [[0 0 0 0 0] [0 0 0 0 0] [0 0 0 0 0]]
a[0,0] = 1
a[2,3] = 5
print(a)
[[1 0 0 0 0] [0 0 0 0 0] [0 0 0 5 0]]
Creating arrays¶
- Arrays can be created in various ways:
- from (nested) list objects (using array) by creating an empty array and then assigning values (using empty/zeros)
- using built-in functions for special arrays (ones, eye, rand, randn, . . . )
a = np.array([3, 2, 1, 2])
print(a)
[3 2 1 2]
b = np.array([2.0, 4.1, 3.14])
print(b)
[ 2. 4.1 3.14]
c = np.array([[3, 1, 2], [2, 3, 4]])
print(c)
[[3 1 2] [2 3 4]]
- Note that numpy tries to guess the data type:
print(a.dtype)
print(b.dtype)
int64 float64
- We can force a particular data type by providing a dtype
a = np.array([3,2,1,2], dtype=np.float64)
print(a.dtype)
print(a)
float64 [ 3. 2. 1. 2.]
b = np.array([3,2,1,2], dtype=np.uint8)
print(b.dtype)
print(b)
uint8 [3 2 1 2]
- CAREFUL ABOUT OVER/UNDERFLOWS!
b[0] = 256
print(b)
[0 2 1 2]
Data types¶
- Numpy provides a range of data types
- floating point data types:
float32,float64 - integer data types:
int64,int32, . . . ,uint8 - object data type: object – any Python object
- Unless you are sure you need something else, use
float64. This is the default data type in numpy. - Exceptions to this rule:
- use
int32,int64when you need to store (large) integers (e.g. counts) - use objects when you need to store other Python objects (dicts, lists, etc.)
Creating special arrays¶
- Numpy provides various functions for creating useful special arrays. The most common ones are:
zeros– create an array filled with zerosones– create an array filled with onesempty– create an array, but don’t initialize it’s elements.arange– similar to rangeeye– identity matrix of a given sizerandom.rand,random.randn– random arrays (uniform, normal)
np.zeros((2,3))
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
np.ones((2,3))
array([[ 1., 1., 1.],
[ 1., 1., 1.]])
np.empty((2,3))
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
np.arange(5,10)
array([5, 6, 7, 8, 9])
np.eye(2)
array([[ 1., 0.],
[ 0., 1.]])
np.random.rand(2,3)
array([[ 0.95800881, 0.27233108, 0.54525544],
[ 0.51234556, 0.17209455, 0.03634038]])
np.random.randn(2,3)
array([[-0.50142349, -1.42633263, -0.39798825],
[ 1.54177749, 0.01696024, 0.06331506]])
plt.hist(np.random.randn(10000));
plt.hist(np.random.rand(100000));
Dimensionality¶
- Numpy arrays support the slice notation familiar from strings and lists
- Can be used to index contiguous sub-arrays
- (Very similar to the notation used in Matlab)
a = np.eye(3)
print(a)
[[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]]
a[0,:]
array([ 1., 0., 0.])
- which has exactly the same effect of writing
a[0]
array([ 1., 0., 0.])
- To get a set of elements
- from position $i$
- to position $j-1$
a[0:2,0:2]
array([[ 1., 0.],
[ 0., 1.]])
a[1:2,0:2]
array([[ 0., 1.]])
a[0:1,:]
array([[ 1., 0., 0.]])
- We can also assign new values to a subarray
print(a)
[[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]]
a[0:2,0:2] = 5
print(a)
[[ 5. 5. 0.] [ 5. 5. 0.] [ 0. 0. 1.]]
a[1:3,1:3] = np.ones((2, 2))
print(a)
[[ 5. 5. 0.] [ 5. 1. 1.] [ 0. 1. 1.]]
Accessing Subarrays – Boolean indexing¶
- Comparison operatorions on arrays return boolean arrays of the same size
a = np.eye(2)
a == 1
array([[ True, False],
[False, True]], dtype=bool)
a > 0
array([[ True, False],
[False, True]], dtype=bool)
a <= 0
array([[False, True],
[ True, False]], dtype=bool)
- These boolean arrays can then be used for indexing the original array
a = np.eye(2)
print(a)
[[ 1. 0.] [ 0. 1.]]
a[a == 1]
array([ 1., 1.])
a[a == 1] = 2
print(a)
[[ 2. 0.] [ 0. 2.]]
Elementwise operations¶
- Numpy contains many standard mathematical functions that operate elementwise on arrays
- These are called universal functions (ufuncs)
- Included are things like add, multiply, log, exp, sin, . . .
- See http://docs.scipy.org/doc/numpy/reference/ufuncs.html
- These are very fast (in constrast to doing the same by loops)
a = np.ones((2, 2))*2
print(a)
[[ 2. 2.] [ 2. 2.]]
np.add(a, a)
array([[ 4., 4.],
[ 4., 4.]])
a * a
array([[ 4., 4.],
[ 4., 4.]])
np.log(a)
array([[ 0.69314718, 0.69314718],
[ 0.69314718, 0.69314718]])
DON'T DO THIS, it is very slow (for large arrays):
result = np.empty(a.shape)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
result[i,j] = a[i,j]*2
print(result)
[[ 4. 4.] [ 4. 4.]]
Broadcasting¶
- Broadcasting is how numpy handles operations between arrays of different, but compatible shapes
- For example, you might want to add a column vector to all columns of a matrix
a = np.ones((2, 2))
b = 2 * np.ones((2, 1))
print("a:")
print(a)
print("b")
print(b)
print("a+b")
print(a + b)
a: [[ 1. 1.] [ 1. 1.]] b [[ 2.] [ 2.]] a+b [[ 3. 3.] [ 3. 3.]]
a = np.ones((3, 2))
b = np.array([4, 5])
print("a:", a.shape)
print(a)
print("b:", b.shape)
print(b)
a: (3, 2) [[ 1. 1.] [ 1. 1.] [ 1. 1.]] b: (2,) [4 5]
print(a + b)
[[ 5. 6.] [ 5. 6.] [ 5. 6.]]
print(a * b)
[[ 4. 5.] [ 4. 5.] [ 4. 5.]]
Reductions¶
- One of the most common operation one needs to do to an array is to sum its values along one of it’s dimensions.
- Again, this is much faster than looping over the array by hand.
a = np.array([[1, 2], [3, 4]])
print(a)
[[1 2] [3 4]]
np.sum(a)
10
np.sum(a, axis = 0)
array([4, 6])
np.sum(a, 1)
array([3, 7])
- There are other reductions the behave the same way as sum, for example
print(a)
[[1 2] [3 4]]
np.prod(a)
24
np.mean(a)
2.5
np.mean(a, 0)
array([ 2., 3.])
np.cumsum(a)
array([ 1, 3, 6, 10])
Vector/Matrix operations¶
- Many algorithms in scientific computing can be cast in a form that makes use of only a few linear algebra primitves.
- Probably the most common such primitives are the dot-product between vectors, matrix-vector products, as well as matrix-matrix products.
- The product between a matrix $A$ of size $M \times N$ and a matrix $B$ of size $N \times K $, written as $\langle A, B \rangle$ or just $AB$ is defined as $$(AB)_{ij} :=\sum_k A_{ik} B_{kj}$$
- Numpy uses a single functions for all these operations: dot, both for arrays and matrices:
a = np.ones(2) * 3
print("a")
print(a)
a [ 3. 3.]
b = np.array([[1, 2]]).T
print(b)
[[1] [2]]
A = np.random.rand(2,2)
print(A)
[[ 0.07784416 0.39286506] [ 0.71671045 0.5990694 ]]
np.dot(a,A)
array([ 2.38366385, 2.97580337])
np.dot(A,b)
array([[ 0.86357428],
[ 1.91484926]])
np.dot(A,A)
array([[ 0.28763021, 0.26593569],
[ 0.48515103, 0.64045464]])
Regression revisited¶
- Numpy makes your life much easier
# Training data (x,y) pairs)
X = np.array(range(10))
y = 2 * X
# linear regression in 1D
w = X.T.dot(y) / X.T.dot(X)
print("Learned w:", w)
Learned w: 2.0
Linear Regression on noisy data¶
- Use
randnto add (Gaussian) noise $\mathcal{N}(0,1)$ to data - Real Life data almost always contains noise!
- The normal distribution often is a good approximation to that, cf https://en.wikipedia.org/wiki/Gaussian_noise
X = np.array(range(10))
y = 2 * X + np.random.randn(10)
w = X.T.dot(y) / X.T.dot(X)
plt.plot(X,y, 'o')
plt.plot(X, w * X, 'r-')
plt.title("Linear Regression on noisy data")
plt.legend(["Noisy training data", "Learned linear function"])
<matplotlib.legend.Legend at 0x1086e4208>
- Compare with the old code
- The numpy one is not only easier to read but also much (!) faster
- Worth diving into numpy documentation and built-in functions
w1 = X.T.dot(y) / X.T.dot(X)
w2 = ( sum([X[i]*y[i] for i in range(len(X))])/ float(sum([x**2 for x in X])) )
print(w1, w2)
2.02851564299 2.02851564299
Scipy Overview¶
- SciPy is large collection of (sub-)packages that contains a variety of functions that are useful for scientific computing
- Impossible to cover all, check out the documentation and examples at http://wiki.scipy.org/Cookbook
- In particular, there are functions for
- Special functions (scipy.special)
- Integration (scipy.integrate)
- Optimization (scipy.optimize)
- Interpolation (scipy.interpolate)
- Fourier Transforms (scipy.fftpack)
- Signal Processing (scipy.signal)
- Linear Algebra (scipy.linalg)
- Statistics (scipy.stats)
- Multi-dimensional image processing (scipy.ndimage)
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
c1, c2 = 5.0, 2.0
noise_p = 0.05
xi = 0.1*np.arange(1,11)
yi = c1*np.exp(-xi) + c2*xi
# Adding noise to the original points
zi = yi + noise_p * np.max(yi) * np.random.randn(len(yi))
# Matrix containing the input data
A = np.hstack([np.exp(-xi)[:, np.newaxis], xi[:,np.newaxis]])
# Solving the problem
c, resid, rank, sigma = linalg.lstsq(A, zi)
# See how the learned model behaves
xi2 = np.r_[0.1:1.0:100j]
yi2 = c[0]*np.exp(-xi2) + c[1]*xi2
plt.plot(xi, zi, 'x', xi, yi, '--', xi2, yi2)
plt.axis([0, 1.1, 3.0, 5.5])
plt.xlabel('$x_i$')
plt.legend(["Noisy training data", "Training data", "Learned function"])
plt.title('Data fitting with linalg.lstsq')
plt.show()