Direct Path Loads: Fast data ingestion with Python and Oracle Database
The fastest way to load very large datasets with Python into Oracle Database is python-oracledb’s direct_path_load() method. This can radically improve your ETL workflow performance.
This blog benchmarks various ways to load data into Oracle Database using Python.
The Background
ETL (“Extract, Transform, and Load”) pipelines are common in Python, specially in the world of data analysis and AI. Performance is critical, so how can you improve data loading times?
Direct Path Loading is a database feature most commonly known from SQL*Loader, but also exposed by JDBC and ODP.NET — and now by python-oracledb. It allows data being inserted into an Oracle Database table to bypass code layers such as the database buffer cache. Also there are no INSERT statements used. Direct Path Loads allow very fast ingestion of huge amounts of data.
Direct Path Load support was introduced in python-oracledb 3.4 with a direct_path_load() method off the connection object. It lets you pass in a list of sequences or a DataFrame. For example with a list:
DATA = [
(1, "First row"),
(2, "Second row"),
(3, "Third row"),
]
connection.direct_path_load(
schema_name="HR",
table_name="mytab",
column_names=["id", "name"]
data=DATA
)The API is simple. You pass in the schema, table name, column names, and the data. There is only one additional argument available in this initial release: a batch_size parameter that is used to split the processing of the supplied data into chunks of that number of rows, allowing you to more easily tune the performance without having to explicitly loop and make multiple calls to direct_path_load(). The method is supported only in python-oracledb Thin mode - there are no plans to add it to Thick mode.
The API documentation is here. The User Guide is here.
As a consequence of the specialized Database architecture, there are a few restrictions on when Direct Path Loads can be used: check the documentation on SQL*Loader Direct Path Loads and on the Oracle Call Interface Direct Path Load Interface.
The CSV file, the database table, the benchmark
I decided to measure not just insertion time, but the whole time to ingest a CSV file into an Oracle Database table. This is a task I know a lot of you are doing.
The CSV files: I used files with 100,000 lines, with 1,000,000 lines, and with 2,000,000 lines. The CSV files were like:
1,"23-Sep-2025","String for row 1"
2,"23-Sep-2025","String for row 2"
. . .The simple code I used to create the files is here.
The table: The database table was created like:
create table mytab (id number, dt date, name varchar2(50));The code: Each test read the whole file into Python memory, and wrote it to database in one go. My code was actually structured to allow loading in smaller batches so I could experiment with streaming data. In a real world scenario you might have too much data to hold in memory, or to efficiently send across the network in one operation. But, for this blog, I kept it simple and did everything in one shot. Wherever you see BATCH_SIZE or BLOCK_SIZE in the code snippets, assume that this is set to the appropriate value to read and insert all rows at once.
The complete code can be found here.
The data ingestion choices
There are three ways I used to read data from a CSV file into Python:
- With the Pandas
read_csv()method which reads into a DataFrame - With the standard Python library
csvmodule which returns each row as a list of strings - With PyArrow’s csv methods which read into a PyArrow Table interoperable with a DataFrame
After data was in Python memory, there were five ways to insert into the database:
- Calling the Pandas
to_sql()method - Passing a list to python-oracledb’s
executemany()method - Passing a DataFrame to python-oracledb’s
executemany()method - Passing a list to python-oracledb’s
direct_path_load()method - Passing a DataFrame to python-oracledb’s
direct_path_load()method
Obviously, not all combinations are possible.
The Solutions
I tried five solutions, which are listed here in slowest-to-fastest order. The complete code can be found here. The results are shown at the bottom.
1. Loading CSV Data with Pandas
Though simple to code, using Pandas to load data is the slowest solution — and not just because of extra initial overhead checking the schema. So unless you need specific Pandas data loading functionality, or performance really isn’t important, you should not use it.
Code is like:
csv_reader = pandas.read_csv(
"sample.csv",
header=None,
names=["id", "dt", "name"],
parse_dates=['dt'],
chunksize=BATCH_SIZE)
for d in csv_reader:
d.to_sql("mytab", engine, if_exists='append', index=False)To be roughly comparable with other solutions, I pre-created the table, so I used the append mode of Pandas.
With large data sets, the overhead Pandas incurs checking the schema before inserting data is relatively lower, but it is still a factor if ultimate speed is your goal.
2. Using Python’s csv module to read into a list, and executemany() to insert
Using the standard Python csv module and calling Oracle’s efficient executemany() method used to be the “best-practice” solution but now comes in as second slowest.
Code is like:
cursor.setinputsizes(None, None, 50)
sql = "insert into mytab (id, dt, name) values (:1, :2, :3)"
data = []
csv_reader = csv.reader(open("sample.csv", "r"), delimiter=",")
for line in csv_reader:
data.append((float(line[0]), datetime.strptime(line[1], "%d-%b-%Y"), line[2]))
if len(data) % BATCH_SIZE == 0:
cursor.executemany(sql, data)
data = []
if data:
cursor.executemany(sql, data)
connection.commit()I carefully used setinputsizes() so that python-oracledb’s executemany() call knew how much memory to allocate for each of the three fields / bind variables and didn’t have to do slow re-allocations as more data was parsed. The first column is numeric, so I passed None to use the default type handling. The second bind variable of the call is also None since the default python-oracledb date handling knows the size of dates. The third CSV field is a string so I chose the upper column size.
The dates are stored in the CSV file as strings “23-Sep-2025”. My code converts them to datetime objects for insertion. Without this explicit conversion, performance can be slow due to the type mismatch with the database type. See my earlier blog Application and database type mismatches slow down data loads.
3. Using PyArrow’s CSV loader and passing the DataFrame to executemany()
Using PyArrow’s CSV loader has some type handling efficiencies, so performance of this solution was another step better for me.
Code is like:
sql = "insert into mytab (id, dt, name) values (:1, :2, :3)"
read_options = pyarrow.csv.ReadOptions(
column_names=["id", "dt", "name"],
block_size=BLOCK_SIZE
)
convert_options = pyarrow.csv.ConvertOptions(
timestamp_parsers=["%d-%b-%Y"],
column_types={
"id":pyarrow.int64(),
"dt":pyarrow.timestamp("us"),
"name":pyarrow.string()
}
)
csv_reader = pyarrow.csv.open_csv(
"sample.csv",
read_options=read_options,
convert_options=convert_options)
for df in csv_reader:
if df is None:
break
cursor.executemany(sql, df)
connection.commit()For the read options, I set the column names since my CSV file didn’t have a header row.
When reading in batches, PyArrow takes a buffer size, not a row count. As previously noted, I set this to a size so that all the data was read and inserted in one go.
Since executemany() can take a DataFrame, I didn't need to do any conversion of the data to a list. This resulted in compact code iterating over the data.
4. Using Python’s csv module to read into a list, and direct_path_load() to insert
Now the fun begins: Using the standard Python csv module to construct a list, and then passing that to the new direct_path_load() method is getting speedy.
Code is like:
data = []
csv_reader = csv.reader(open("sample.csv", "r"), delimiter=",")
for line in csv_reader:
data.append((float(line[0]), datetime.strptime(line[1], "%d-%b-%Y"), line[2]))
if len(data) % BATCH_SIZE == 0:
connection.direct_path_load(
schema_name="HR",
table_name="mytab",
column_names=["id", "dt", "name"],
data=data)
data = []
if data:
connection.direct_path_load(
schema_name="HR",
table_name="mytab",
column_names=["id", "dt", "name"],
data=data)The code is very similar to solution #2, but calls direct_path_load() instead of executemany().
There is no explicit commit() call here, since direct_path_load() will commit the data.
5. Using PyArrow’s CSV loader and passing the DataFrame to direct_path_load()
The new best-practice solution takes advantage of both the PyArrow CSV loader to read into a DataFrame format, and then passing this DataFrame to direct_path_load().
Code is like:
read_options = pyarrow.csv.ReadOptions(
column_names=["id", "dt", "name"]
block_size=BLOCK_SIZE
)
convert_options = pyarrow.csv.ConvertOptions(
timestamp_parsers=["%d-%b-%Y"],
column_types={
"id": pyarrow.int64(),
"dt": pyarrow.timestamp("us"),
"name": pyarrow.string()
}
)
csv_reader = pyarrow.csv.open_csv(
"sample.csv", read_options=read_options, convert_options=convert_options)
for df in csv_reader:
if df is None:
break
connection.direct_path_load(
schema_name="HR",
table_name="mytab",
column_names=["id", "dt", "name"],
data=df)The code is similar to solution #3, but calls direct_path_load() instead of executemany().
For the read options, I set the column names since my CSV file didn’t have a header row.
When reading in batches, PyArrow takes a buffer size, not a row count. As previously noted, I set this to the maximum size so that all the data was read and inserted in one go.
Since direct_path_load() can take a DataFrame, I didn't need to do any conversion of the data to a list. This resulted in compact code iterating over the data.
There is no explicit commit() call here, since direct_path_load() will commit the data.
Results in a Picture
Here are my results. Less is better. The numbers are averages over a few runs. In each group, the bars represent the solutions in order described above.
For the three file sizes, loading with Pandas (solution #1) was slowest, while loading with PyArrow’s csv module and calling direct_path_load() (solution #5) was the fastest. The benefit increased as the data size increased. For my 2,000,000 row file, Direct Path Loading was 4 times faster than using Pandas, and 3 times faster than solution #2, our old recommendation.
On the numbers themselves, I am using an x86_64 version 23 database, emulating the architecture on an arm64 Mac, so it is inherently slow. Also I was running Python on the same Mac. Your results will vary for all the normal reasons, so do your own benchmarking. Reasons include how slow your machine is, the network speed, how much other work your database is doing, how much data you are loading, and what data types are involved. When testing, you may also want to measure the impact on the database itself. Even if the elapsed time for using direct_path_load() is slower than executemany() for a particular data set (e.g a small one), there may still be a reduced impact on the database, which may improve overall system performance.
Summary
The fastest way to load very large datasets into Oracle Database with Python is to use direct_path_load(). When loading a CSV file, you can also take advantage of PyArrow CSV methods to read into a DataFrame format. Direct Path Loading performance will vary with many factors, including the data types involved. Direct Path Loads are fast but have a few database restrictions, so review the database documentation before assuming you can magically drop them into an existing ETL pipeline. Unlike executemany(), there is currently no way to filter noisy data, so make sure your data is clean before trying to load it.
When evaluating which Oracle Database data loading solution to use, don’t forget to check out Oracle’s specialized SQL*Loader tool, and also External Tables.
Let us know what you think of Direct Path Loading. If you want to try (or improve) the five solutions shown in this blog, the code is here.
Thanks for using python-oracledb.
Python-oracledb Resources
Python-oracledb is an open source package for the Python Database API specification with many additions to support advanced Oracle Database features. By default, it is a ‘Thin’ driver that is immediately usable without needing any additional install e.g. no Instant Client is required. Python-oracledb is used by frameworks, ORMs, SQL generation libraries, and other projects.

