5 Python Libraries Every Data Engineer Should Know

From web scraping to AWS integration, these five Python libraries help data engineers at every skill level.

Dec 6th, 2024 5:00am by Jack Wallen

Featued image for: 5 Python Libraries Every Data Engineer Should Know

Featured image via Unsplash+.

Data is everywhere and has become crucial to businesses and developers all over the globe. One language that does exceptionally well with data is Python. Every data scientist knows this, and often has to depend on Python to get the job done.

Out of the box, Python has plenty of core features, but every serious data engineer knows that third-party libraries are a must to get the most out of the data your business has collected.

You’ll find libraries that are useful for data that cover various use cases and project needs, including data flow and pipelines, data analysis, cloud libraries, big data libraries, data parsing, machine learning and much more.

But which should you use? Let’s start with those that are best suited for beginners and work our way up to more advanced libraries.

Libraries for Beginners

Let’s first talk about libraries that are best suited for beginners who are just starting their journey with data engineering and Python.

Beautiful Soup 4

If you need to scrape information from websites, then Beautiful Soup 4 is the library you want. The official description of Beautiful Soup 4 is “a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching and modifying the parse tree.”

The steps for web scraping look like this:

Your application sends an HTTP request to the URL of the webpage you want to scrape.
The target server returns the HTML content of the webpage.
A parser (such as html5lib) is used to create a nested-tree structure of the HTML data.
BeautifulSoup then traverses the parse tree to extract the required data.

The data can then be extracted with Beautiful Soup 4 like this:

Requests

Another library for beginners is Requests, which is a simple, elegant HTTP library that allows you to send HTTP/1.1 requests without the need to add manual query strings to URLs. Request is a great library for retrieving data from RESTful APIs, fetch web pages for scraping, sending data to server endpoints and more. Requests provides a user-friendly API for making HTTP requests; supports HTTP methods such as GET, POST, PUT and DELETE; handles authentication, cookies and sessions; and supports SSL verification, timeouts and connection pooling.

The requests library can be employed very simply. Here’s an example:

Libraries for Intermediate Data Engineers

Let’s now take a look at some libraries for intermediate data engineers.

Airflow

Apache Airflow is a library that is used to author, schedule and monitor batch-oriented workflows. This library is a powerful tool for managing workflows in data engineering to make it possible for users to automate and monitor data pipelines effectively, and connect with virtually any technology. Airflow can be run on POSIX-compliant operating systems, and is regularly tested on modern Linux distributions and more recent releases of macOS.

Airflow includes a web interface to help manage the state of your workflows. It can be deployed in many ways, from a single process on a single machine to a distributed setup for very large workflows.

The airflow library in code looks something like this:

The above does the following:

Imports the necessary libraries.
Defines a simple function, print_hello(), to print a message.
Creates a dictionary of the following default arguments: the owner of the DAG, whether it depends on past runs, the start date and the number of retries in case of failure.
Creates a DAG instance named hello_airflow to run daily.
Defines three tasks: start_task (A dummy task to indicate the start of the workflow), hello_task (calls the print_hello function) and end_task (another dummy task to indicate the end of the workflow).
Chains the tasks together using the >> operator so that start_task runs first, followed by hello_task, and finally end_task.

Boto3

If you need to integrate your Python app with Amazon S3, EC2, Amazon DynamoDB or Amazon Lambda, you’re going to need Boto3, which is the AWS software development kit for Python.

Boto3 makes it possible to leverage AWS services in Python applications so you can easily build and manage cloud-based solutions.

Features of Boto3 include:

Includes two main interfaces: Resource API (a high-level abstraction for working with AWS services in a more typical Pythonic way) and client API (a low-level interface that provides direct access to AWS service APIs).
Simple configuration: Boto3 simplifies the process of configuring AWS credentials and settings.
Comprehensive documentation: You’ll find tons of documentation to hep guide you through the installation, configuration and usage of Boto3.
Community support: There’s a large community with plenty of resources, tutorials and examples online.

Here’s an example of how the boto3 library is used in Python code:

Libraries for Advanced Data Engineers

Pandas

Pandas is one of the most popular data manipulation and analysis libraries available. Pandas supports reading and writing data in several formats (such as CSV, Excel, SQL and more), and includes functions for filtering, grouping, merging and reshaping data.

Although basic Pandas usage can be employed by beginner and intermediate users, to really get the most out of this library, you’ll need to have a more advanced understanding of the language.

Features of Pandas include:

Data structures: Series (one-dimensional labeled array capable of holding any data type) and DataFrames (two-dimensional labeled data structures with columns that can be of different types; similar to a spreadsheet or SQL table).
Data manipulation: Data cleaning (for handling missing data, filtering and transforming datasets) and data transformation (for reshaping and pivoting datasets, meringue, and joining data from different sources).
Data analysis: Statistical functions (built-in methods for performing statistical operations, such as mean, median and standard deviation) and group-by operations (the ability to group data and perform aggregate functions on the new groups).
Data input/output: File handling (read from and write to various file formats, such as CSV, Excel, JSON and SQL databases).
Time series analysis: Date and time functions (specialized functions for working with time series data, such as date range generation and frequency conversion).
Integration with visualization libraries: Although Pandas isn’t a visualization library, it can integrate with libraries such as Matplotlib and Seaborn for data plotting.

An example of the pandas library in Python code might look something like this:

And there you have it — five Python libraries every data engineer should know. Yes, there are plenty more, but these five should serve as a solid launching point.

Jack Wallen is what happens when a Gen Xer mind-melds with present-day snark. Jack is a seeker of truth and a writer of words with a quantum mechanical pencil and a disjointed beat of sound and soul. Although he resides...