-Last updated 9/15/2025
Erich Purpur
Research Librarian for Science & Engineering
[email protected]
Brown Science & Engineering Library room I054
I'm a part of a group called research data services and I do these things:
1. Serve as Liaison to various engineering and social sciences departments
2. Teach workshops and classes (like this one)
3. Help people with research projects
4. Internal Library Projects
5. Random other things as they come up
| Workshop | Date | Time |
|---|---|---|
| Intro to Python pt 1 | Tuesday 9/2 | 11:00 - 12:30am |
| Intro to Python pt 2 | Tuesday 9/9 | 11:00 - 12:30am |
| Python Data Analysis & Visualization | Tuesday 9/16 | 11:00 - 12:30am |
| AI Prompt Engineering for Python | Thursday 9/18 | 1:00 - 2:30pm |
| Workshop | Date | Time |
|---|---|---|
| Using Large Language Models Locally | Thursday 9/4 | 1:00 - 2:30pm |
| AI Prompt Engineering w/ CLEAR Framework | Thursday 9/11 | 1:00 - 2:30pm |
| AI Prompt Engineering for Python | Thursday 9/18 | 1:00 - 2:30pm |
See the rest of UVA Library's workshop series
I can't think of a research project, class assignment, or other kind of analysis that doesn't involve data. Can you?
In this workshop today we will be walking through the steps of using python to read a dataset and manipulate it. This involves working it into the shape and size we like (aka data wrangling) before looking at it in a meaningful way. (ie: visualizing the data)
Data visualization is the act of taking information (data) and placing it into a visual context, such as a map or graph. Data visualizations make data easier for the human brain to understand. You can more easily detect patterns, trends, and outliers in groups of data.
Good data visualizations should give meaning to your data and clearly communicate what is happening in your analysis. Excel has been a go-to data visualization tool for many years. Often, data visualization does not need to be fancy. As long as your audience understands your work, it is effective data visualization.
Pandas is an open source python library providing high-performance, easy-to-use data structures and data analysis tools. We will be using pandas to work with our data before feeding it into matplotlib.
Excerpt taken from MatPlotLib's Website:
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code.
From MatPlotLib's Wikipedia page: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications
From this github blog:
Matplotlib was conceived by John Hunter in 2002, originally as a patch to IPython to enable interactive MATLAB-style plotting via the IPython command line. Version 0.1 of Matplotlib was released in 2003. It became the plotting package of choice by the Space Telescope Science Institute, which financially supported Matplotlib's development at that time.
Matplotlib currently has a large, active developer base, and is now widely used in the scientific Python world. But, it is not the only data visualization tool available!
The ggplot python library evolved out of the ggplot2 R-specific package. It seems to be accepted that ggplot2 (in R) is a more sophisticated graphics tool and provides more high end functionality. It is not clear to me if ggplot for python integrates all the functionality that ggplot2 has in R.
Seaborn is a python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seems to be accepted as an extension to matplotlib functionality, particularly for statistical visualization.
Bokeh is different in that it does not depend on matplotlib and is geared toward generating visualizations in the web browser. It is meant to make interactive web visualizations.
There is no right or wrong answer. It depends what you are doing, what you are familiar with, or other influences in your life. Matplotlib is a good jack of all trades package for relatively basic plotting and graphing. It also integrates nicely with numpy and pandas, two other very common scientific packages.
All these packages have large user communities and good documentation. My advice is to choose one you like and stick with it unless you find it does not have the functionality you are looking for.
Reasons to use any given data visualization package/tool in python:
- You are already familiar with it
- Your advisor/professor already likes one and you live with that decision
- You inherited code that is already using that package
- You found a code example you liked online for a specific package
A freely available software distribution containing various data science softwares and accompanying libraries. Inside are Spyder (IDE), R Studio, Jupyter Notebooks, Jupyter Lab, etc.
Another topic to cover is Jupyter Notebooks and/or JupyterLab. Jupyter Notebooks are a web-based interactive computational environment for creating notebook-like documents. It supports several languages like python, R, Julia, etc. JupyterLab is the next generation user interface, which includes Jupyter Notebooks.
In my opinion, they seem almost exactly the same. We will be using Jupyter Notebooks today.
Honestly, you don't need to remember most of it. Here are the resources I use when looking for answers:
ChatGPT
- ChatGPT has quickly made huge changes to the programming landscape. It is a hugely powerful tool If you use it the right way!. I think it is a somewhat slippery slope of how to advise new programmers to use ChatGPT (or other AI tools) so I will refer to some best practices. My personal opinion is that you should use AI minimally when you are starting. When you have a better grasp of basic fundamentals, then you can include AI and greatly increase your speed. Never accept ChatGPT code verbatim! Always double check it before including it in your workflows.
- How to Effectively Learn to Program w/ ChatGPT
- Corey Schafer's "How to use ChatGPT"
Stack Overflow is a huge user community Q&A type site. Odds are very high that someone has asked your question before, just google something like "how to make scatter plot matlplotlib python". I'm pretty certain a StackOverflow thread will be one of the first few search results
Stack Overflow Etiquette Don't just ask questions right away. Odds are high that for widely used packages, like matplotlib, a question and answer already exists. It is good practice to use that (and upvote it) if you like the answer.
If you do ask a question, make sure it is specific and reproducible. People will downvote you and moderators will close the question if it is vague, incoherent, not-reproducible, or not clear in some other way. StackOverflow's purpose is to act as a reference guide, not as a forum to debate open ended questions such as "what is better, matplotlib or ggplot?". Go on Reddit if you want to do that.