Showing posts with label batch. Show all posts
Showing posts with label batch. Show all posts

Tuesday, May 28, 2013

Batch Writing, and Dynamic vs Parametrized SQL, how well does your database perform?

One of the most effective database optimizations is batch writing. Batch writing is supported by most modern databases and part of the JDBC standard and is supported by most JPA providers.

Normal database access consists of sending each DML (insert, update, delete) statement to the database in a separate database/network access. Each database access has a certain amount of overhead to it, and the database must process each statement independently. Batch writing has two forms, dynamic and parametrized. Parametrized is the most common, and normally provides the best benefit, as dynamic can have parsing issues.

To understand batch writing, you must first understand parametrized SQL. SQL execution is composed of two parts, the parse and the execute. The parse consists of turning the string SQL representation to the database representation. The execute consists of executing the parsed SQL on the database. Databases and JDBC support bind parameters, so the the arguments to the SQL (the data) does not have to be embedded in the SQL. This avoids the cost of converting the data into text, and allows for the same SQL statement to be reused, with multiple executions. This allows for a single parse and multiple executes, aka "parametrized SQL". Most JDBC DataSource implementations and JPA providers support parametrized SQL and statement caching, this effectively avoids ever having a parse in a running application.

Example dynamic SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (34567, "Bob Smith")

Example parametrized SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (?, ?)

Parametrized batch writing involves executing a single DML statement, but with a set of bind parameters for multiple homogenous statements, instead of bind parameters for a single statement. This effectively allows for a large batch of homogenous inserts, updates, or deletes, to be processed by the database and network as a single operation, instead of n operations. The database only needs to perform the minimal amount of work, as there is only a single statement, so at most only a single parse. It is also compatible with statement caching, so no statement parsing needs to occur at all. The limitation is that all of the statement's SQL must be identical. So, it works really good for say inserting 1,000 Orders, as the insert SQL is the same for each Order, only the bind parameters differ. But it does not help for inserting 1 Order, or for inserting 1 Order, 1 OrderLine, and 1 Customer. Also, all of the statements must be part of the same database transaction.

Dynamic batch writing involves chaining a bunch of heterogeneous dynamic SQL statements into a single block, and sending the entire block to the database in a single database/network access. This is beneficial in that there is only a single network access, so if the database is remote or across a slow network, this can make a big difference. The drawback is that parameter binding is not allowed, and the database must parse this huge block of SQL when it receive it. It some cases the parsing costs can outweigh the network benefits. Also, dynamic SQL is not compatible with statement caching, as each SQL is different.

JDBC standardizes batch writing through its Statement and PrepareStatement batch APIs (as of JDBC 2.0, which was JDK 1.2, aka a long time ago). The JDBC batch API requires different JDBC code, so if you are using raw JDBC, you need to rewrite your code to switch between batching and non-batching APIs. Most JDBC drivers now support these APIs, but some do not actually send the DML to the database as a batch, they just emulate the APIs. So how do you know if you are really getting batch writing? The only real way is to test it, and measure the performance difference.

The JPA specification does not standardize batch writing configuration, but most JPA providers support it. Normally batch writing is enabled in JPA through persistence unit properties, so turning it on or off is a simple matter of configuration, and requires no coding changes. Some JPA providers may not support batch writing when using optimistic locking, and may not re-order SQL to enable it to be batched, so even with batch writing enabled, you may still not be getting batching writing. Always test your application with batch writing on and off, and measure the difference to ensure it is actually functioning.

EclipseLink supports both parametrized and dynamic batch writing (since EclipseLink 1.0). In EclipseLink, batch writing is enabled through the "eclipselink.jdbc.batch-writing" persistence unit property. EclipseLink provides three options, "JDBC", "Buffered", and "Oracle-JDBC". The "JDBC" option should always be used.

"Buffered" is for JDBC drivers that do not support batch writing, and chains dynamic SQL statements into a single block itself. "Buffered" does not support parametrized SQL, and is not recommended.

"Oracle-JDBC" uses the Oracle database JDBC API that predates the JDBC standard API, and is now obsolete. Previous to EclipseLink 2.5, this option allowed batch writing when using optimistic locking, but now the regular "JDBC" option supports optimistic locking.

EclipseLink 2.5 supports batch writing with optimistic locking on all (compliant) database platforms, where as previously it was only supported on selected database platforms. EclipseLink 2.5 also provides a "eclipselink.jdbc.batch-writing" query hint to disable batch writing for native queries that cannot be batched (such as DDL or stored procedures on some database platforms).

EclipseLink supports parametrized SQL through the "eclipselink.jdbc.bind-parameters", and "eclipselink.jdbc.cache-statements" persistence unit properties. However, these don't normally need to be set, as parameter binding is the default, so you would only set the property to disable binding. Statement caching is not on by default, but only relevant to EclipseLink if using EclipseLink's connection pooling, if you are using a JDBC or Java EE DataSource, then you must configure statement caching in your DataSource config.

When batch writing is enabled in EclipseLink, by default it is parametrized batch writing. To enable dynamic batch writing, you must disable parameter binding. This is the same to enable buffered batch writing.

Supporting batch writing is not incredibly difficult, most JPA providers support this, ordering the SQL such that it can be batched is the difficult part. During a commit or flush operation, EclipseLink automatically groups SQL by table to ensure homogenous SQL statements can be batched (and at the same time still maintains referential integrity constraints and avoids dead locks). Most JPA providers do not do this, so even if they support batch writing, a lot of the time the SQL does not benefit from batching.

To enabled batch writing in EclipseLink, add the following to persistence unit property;

"eclipselink.jdbc.batch-writing"="JDBC"
You can also configure the batch size using the "eclipselink.jdbc.batch-writing.size" persistence unit property. The default size is 100.
"eclipselink.jdbc.batch-writing.size"="1000"

Batch writing is very database, and JDBC driver dependent. So I was interested in which databases, drivers it worked with, and what the benefit was. I made two tests, one does a batch of 50 inserts, and one does a batch of 100 updates (using optimistic locking). I tried all of the batch writing options, as well as not using any batching.

Note, this is not a database benchmark, I am not comparing the databases between each other, only to themselves.

Each database is running on different hardware, some are local, and some are across a network, so do not compare one database to another. The data of interest is the percentage benefit enabling batch writing has over not using batch writing. For the insert test I also measured the difference between using parametrized versus dynamic SQL, and parametrized SQL without statement caching. The result is the number of transactions processed in 10 seconds (run 5 times, and averaged), so a bigger number is a better result.

Database: MySQL Version: 5.5.16
Driver: MySQL-AB JDBC Driver Version: mysql-connector-java-5.1.22

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch4830%
dynamic-sql, no batch4993%
parametrized-sql, no statement caching478-1%
dynamic-sql, batch4993%
parametrized-sql, batch5095%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2450%
dynamic-sql, batch2440%
parametrized-sql, batch2481%

So the results seem to indicating batch writing has no affect whatsoever (5% is within the variance). What this really means, is that the MySQL JDBC driver does not actually use batch processing, it just emulates the JDBC batch APIs and executes statements one by one underneath.

MySQL does have batch processing support though, it just requires different SQL. The MySQL JDBC driver does support this, but requires the rewriteBatchedStatements=true JDBC connect property to be set. This can easily be set by modifying your connect URL, such as;

jdbc:mysql://localhost:3306/test?rewriteBatchedStatements=true

MySQL: rewriteBatchedStatements=true

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch5040%
dynamic-sql, no batch5080%
parametrized-sql, no statement caching483-4%
dynamic-sql, batch1292156%
parametrized-sql, batch2181332%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2500%
dynamic-sql, batch669167%
parametrized-sql, batch699179%

So, it appears batch writing does make a big difference in MySQL, if configured correctly (why the JDBC driver does not do this by default, I have no idea). Parametrized batch writing does the best, being 332% faster for inserts, and 179% faster for updates. Dynamic batch writing also performs quite well. Interestingly there appears to be little difference between dynamic and parametrized SQL on MySQL (my guess is either MySQL is really faster at parsing, or does little optimization for prepared statements).

PostgreSQL Version: 9.1.1
PostgreSQL 8.4 JDBC4

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch4790%
dynamic-sql, no batch418-12%
parametrized-sql, no statement caching428-10%
dynamic-sql, buffered1127135%
dynamic-sql, batch1127135%
parametrized-sql, batch2037325%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2330%
dynamic-sql, batch39569%
parametrized-sql, batch707203%

The results show batch writing makes a big difference on PostgreSQL. Parametrized batch writing performs the best, being 325% faster for inserts, and 203% faster for updates. Dynamic batch writing also performs quite well. For PostgreSQL I also measure the performance of EclipseLink's buffered batch writing, which performs the same as dynamic JDBC batch writing, so I assume the driver is doing the same thing. Parametrized SQL outperforms dynamic SQL by about 10%, but parametrized SQL without statement caching performs similar to dynamic SQL.

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0
Oracle JDBC driver Version: 11.2.0.2.0

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch5480%
dynamic-sql, no batch494-9%
parametrized-sql, no statement caching452-17%
dynamic-sql, buffered383-30%
dynamic-sql, batch489-10%
parametrized-sql, batch3308503%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2820%
dynamic-sql, batch258-8%
parametrized-sql, batch1672492%

The results show parametrized batch writing makes a big difference on Oracle, being 503% faster for inserts, and 492% faster for updates. Dynamic batch writing does not provide any benefit, this is because Oracle's JDBC driver just emulates dynamic batch writing and executes statements one by one, so it has the same performance as dynamic SQL. Buffered batch writing actually has worse performance than not batching at all. This is because of the parsing cost for the huge block of dynamic SQL, this may vary in different configurations, if the database is remote or across a slow network, I have seen this provide a benefit.

Parametrized SQL with statement caching provides about a 10% benefit over dynamic SQL, and points out that to benefit from parametrized you need to use statement caching, otherwise the performance can be worse than dynamic SQL. Of coarse there are other benefits to parametrized SQL, as it removes CPU processing from the server, which may not help much in this single threaded case, but can make a huge difference in a multi-threaded case where the database is a bottleneck.

Apache Derby Version: 10.9.1.0 - (1344872)
Apache Derby Embedded JDBC Driver Version: 10.9.1.0 - (1344872)
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch30270%
dynamic-sql, no batch24-99%
parametrized-sql, no statement caching50-98%
dynamic-sql, batch24-99%
parametrized-sql, batch32527%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql14370%
dynamic-sql, batch6-99%
parametrized-sql, batch217251%

The results show parametrized batch writing makes a difference on Derby, being 7% faster for inserts, and 51% faster for updates. This result difference is not as much as other database because my database was local. For a networked database, it would be a bigger difference, but this does show that batch writing can provide a benefit even for local databases, so it is not just a network optimization. The really interesting results from Derby are the horrible performance of the dynamic and non-cached statements. This shows the Derby has a huge parsing cost, so if you are using Derby, using parametrized SQL with statement caching is really important.

DB2/NT64 Version: SQL09070
IBM Data Server Driver for JDBC and SQLJ Version: 4.0.100

The results are basically similar to Oracle, in that parametrized batch writing gives a big performance benefit. Dynamic batch writing has worse performance then no batching with parametrized SQL, and dynamic SQL and parametrized SQL without statement caching result in worse performance.

Microsoft SQL Server Version: 10.50.1617
Microsoft SQL Server JDBC Driver 2.0 Version: 2.0.1803.100

The results were similar to PostgreSQL, showing both parametrized and dynamic batch writing providing a significant benefit. Parametrized batch writing performed the best, and parametrized SQL outperformed dynamic SQL, and no statement caching.

** UPDATE **

It was requested that I also test H2 and HSQL, so here are the results.

Database: H2 Version: 1.3.167 (2012-05-23)
Driver: H2 JDBC Driver Version: 1.3.167 (2012-05-23)
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch47570%
dynamic-sql, no batch3210-32%
parametrized-sql, no statement caching47570%
dynamic-sql, buffered1935-59%
dynamic-sql, batch3293-30%
parametrized-sql, batch575320%

The results show H2 performs 20% faster with parametrized batch writing. H2 is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL. Interestingly using statement caching with parametrized SQL makes no difference. My assumption is that H2 is always caching prepared statements in its connection, so the user does not need to do their own statement caching.

Database: HSQL Database Engine Version: 1.8.1
Driver: HSQL Database Engine Driver Version: 1.8.1
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch73190%
dynamic-sql, no batch5054-30%
parametrized-sql, no statement caching6776-7%
dynamic-sql, batch5500-24%
parametrized-sql, batch917625%

The results show HSQL performs 25% faster with parametrized batch writing. HSQL is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL.

Thursday, June 9, 2011

How to improve JPA performance by 1,825%

The Java Persistence API (JPA) provides a rich persistence architecture. JPA hides much of the low level dull-drum of database access, freeing the application developer from worrying about the database, and allowing them to concentrate on developing the application. However, this abstraction can lead to poor performance, if the application programmer does not consider how their implementation affects database usage.

JPA provides several optimization features and techniques, and some pitfalls waiting to snag the unwary developer. Most JPA providers also provide a plethora of additional optimization features and options. In this blog entry I will explore the various optimizations options and techniques, and a few of the common pitfalls.

The application is a simulated database migration from a MySQL database to an Oracle database. Perhaps there are more optimal ways to migrate a database, but it is surprising how good JPA's performance can be, even in processing hundreds of thousand or even millions of records. Perhaps it is not a straight forward migration, or the application's business logic is required, or perhaps the application has already been persisted through JPA, so using JPA to migrate the database is just easiest. Regardless, this fictitious use case is a useful demonstration of how to achieve good performance with JPA.

The application consists of an Order processing database. The model contains a Customer, Order and OrderLine. The application reads all of the Orders from one database, and persists them to the second database. The source code for the example can be found here.

The initial code for the migration is pretty simple:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order");
EntityManager em = emf.createEntityManager();
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
Query query = emOld.createQuery("Select o from Order o");
List orders = query.getResultList();
em.getTransaction().begin();
// Reset old Ids, so they are assigned from the new database.
for (Order order : orders) {
order.setId(0);
order.getCustomer().setId(0);
}
for (Order order : orders) {
em.persist(order);
for (OrderLine orderLine : order.getOrderLines()) {
em.persist(orderLine);
}
}
em.getTransaction().commit();
em.close();
emOld.close();
emf.close();     
emfOld.close();

The example test runs this migration using 3 variables for the number of Customers, Orders per Customer, and OrderLines per Order. So, 1000 customers, each with 10 orders, and each with 10 order lines, would be 111,000 objects.

The test was run on a virtualized 64 bit Oracle Sun server with 4 virtual cores and 8 gigs of RAM. The databases run on similar machines. The test is single threaded, running in Oracle Sun JDK 1.6. The tests are run using EclipseLink JPA 2.3, and migrating from a MySQL database to an Oracle database.

This code functions fine for a small database migration. But as the database size grows, some issues become apparent. It actually handles 100,000 objects surprisingly well, taking about 2 minutes. This is surprisingly well, given it is thoroughly unoptimized and persisting all 100,000 objects in a single persistence context and transaction.

Optimization #1 - Agent

EclipseLink implements LAZY for OneToOne and ManyToOne relationships using byte code weaving. EclipseLink also uses weaving to perform many other optimizations, such as change tracking and fetch groups. The JPA specification provides the hooks for weaving in EJB 3 compliant application servers, but in Java SE or other application servers weaving is not performed by default. To enable EclipseLink weaving in Java SE for this example the EclipseLink agent is used. This is done using the Java -javaagent:eclipselink.jar option. If dynamic weaving is unavailable in your environment, another option is to use static weaving, for which EclipseLink provides an ant task and command line utility.

Optimization #2 - Pagination

In theory at some point you should run out of memory by bringing the entire database into memory in a single persistence context. So next I increased the size to 1 million objects, and this gave the expect out of memory error. Interestingly this was with only using a heap size of 512 meg. If I had used the entire 8 gigs of RAM, I could, in theory, have persisted around 16 million objects in a single persistence context. If I gave the virtualized machine the full 98 gigs of RAM available on the server, perhaps it would even be possible to persist 100 millions objects. Perhaps we are beyond the day when it does not make sense to pull an entire database into RAM, and perhaps this is no longer such as crazy thing to do. But, for now, lets assume it is an idiotic thing to do, so how can we avoid this?

JPA provides a pagination feature that allows a subset of a query to be read. This is supported in JPA in the Query setFirstResult,setMaxResults API. So instead of reading the entire database in one query, the objects will be read page by page, and each page will be persisted in its own persistence context and transaction. This avoids ever having to read the entire database, and also should, in theory, make the persistence context more optimized by reducing the number of objects it needs to process together.

Switching to using pagination is relatively easy to do for the original orders query, but some issues crop up with the relationship to Customer. Since orders can share the same customer, it is important that each order does not insert a new customer, but uses the existing customer. If the customer for the order was already persisted on a previous page, then the existing one must be used. This requires the usage of a query to find the matching customer in the new database, which introduces some performance issues we will discuss later.

The updated code for the migration using pagination is:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order");
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
Query query = emOld.createQuery("Select o from Order o order by o.id");
int pageSize = 500;
int firstResult = 0;
query.setFirstResult(firstResult);
query.setMaxResults(pageSize);
List orders = query.getResultList();
boolean done = false;
while (!done) {
if (orders.size() < pageSize) {
        done = true;
    }
    EntityManager em = emf.createEntityManager();
    em.getTransaction().begin();
    Query customerQuery = em.createNamedQuery("findCustomByName");
    // Reset old Ids, so they are assigned from the new database.
    for (Order order : orders) {
        order.setId(0);
        customerQuery.setParameter("name", order.getCustomer().getName());
        try {
            Customer customer = (Customer)customerQuery.getSingleResult();
            order.setCustomer(customer);
        } catch (NoResultException notPersistedYet) {
            // Customer does not yet exist, so null out id to have it persisted.
            order.getCustomer().setId(0);
        }
    }
    for (Order order : orders) {
        em.persist(order);
        for (OrderLine orderLine : order.getOrderLines()) {
            em.persist(orderLine);
        }
    }
    em.getTransaction().commit();
    em.close();
    firstResult = firstResult + pageSize;
    query.setFirstResult(firstResult);
    if (!done) {
        orders = query.getResultList();
    }
}
emOld.close();
emf.close();     
emfOld.close();

Optimization #3 - Query Cache

This will introduce a lot of queries for customer by name (10,000 to be exact), one for each order. This is not very efficient, and can be improved through caching. In EclipseLink there is both an object cache and a query cache. The object cache is enabled by default, but objects are only cached by Id, so this does not help us on the query using the customer's name. So, we can enable a query cache for this query. A query cache is specific to the query, and caches the query results keyed on the query name and its parameters. A query cache is enabled in EclipseLink through using the query hint "eclipselink.query-results-cache"="true". This should be set where the query is defined, in this case in the orm.xml. This will reduce the number of queries for customer to 1,000, which is much better.

There are other solutions to using the query cache. EclipseLink also supports in-memory querying. In-memory querying means evaluating the query on all of the objects in the object cache, instead of accessing the database. In-memory querying is enabled through the query hint "eclipselink.cache-usage"="CheckCacheOnly". If you enabled a full cache on customer, then as you persisted the orders all of the existing customers would be in the cache, and you would never need to access the database. Another manual solution is to maintain a Map in the migration code keying the new customer's by name. For all of the above solutions if the cache is made fixed sized (query cache defaults to a size of 100), you would never need all of the customers in memory at the same time, so there would be no memory issues.

Optimization #4 - Batch Fetch

The most common performance issue in JPA is in the fetch of relationships. If you query n orders, and access their order-lines, you get n queries for order-line. This can be optimized through join fetching and batch fetching. Join fetching, joins the relationship in the original query and selects from both tables. Batch fetch executes a second query for the related objects, but fetches them all at once, instead of one by one. Because we are using pagination, this make optimizing the fetch a little more tricky. Join fetch which still work, but since order-lines is join fetched, and there are 10 order-lines per order, the page size that was 500 orders, in now only 50 orders (and their 500 order-lines). We can resolve this by increasing the page size to 5000, but given in a real application the number of order-lines in not fixed, this becomes a bit of a guess. But the page size was just a heuristic number anyway, so no real issue. Another issue with join fetching with pagination is the last and first object may not have all of its related objects, if it falls in-between a page. Fortunately EclipseLink is smart enough to handle this, but it does require 2 extra queries for the first and last order of each page. Join fetching also has the draw back that it is selecting more data when a OneToMany is join fetched. Join fetching is enable in JPQL using join fetch o.orderLine.

Batch fetching normally works by joining the original query with the relationship query, but because the original query used pagination, this will not work. EclipseLink supports three types of batch fetching, JOIN, EXISTS, and IN. IN works with pagination, so we can use IN batch fetching. Batch fetch is enabled through the query hint "eclipselink.batch"="o.orderLines", and "eclipselink.batch.type"="IN". This will reduce the n queries for order-line to 1. So for each batch/page of 500 orders, there will be 1 query for the page of orders, and 1 query for the order-lines, and 50 queries for customer.

Optimization #5 - Read Only

The application is migrating from the MySQL database to the Oracle database. So is only reading from MySQL. When you execute a query in JPA, all of the resulting objects become managed as part of the current persistence context. This is wasteful in JPA, as managed objects are tracked for changes and registered with the persistence context. EclipseLink provides a "eclipselink.read-only"="true" query hint that allows the persistence context to be bypassed. This can be used for the migration, as the objects from MySQL will not be written back to MySQL.

Optimization #6 - Sequence Pre-allocation

We have optimized the first part of the application, reading from the MySQL database. The second part is to optimize the writing to Oracle.

The biggest issue with the writing process is that the Id generation is using an allocation size of 1. This means that for every insert there will be an update and a select for the next sequence number. This is a major issue, as it is effectively doubling the amount of database access. By default JPA uses a pre-allocation size of 50 for TABLE and SEQUENCE Id generation, and 1 for IDENTITY Id generation (a very good reason to never use IDENTITY Id generation). But frequently applications are unnecessarily paranoid of holes in their Id values and set the pre-allocaiton value to 1. By changing the pre-allocation size from 1 to 500, we reduce about 1000 database accesses per page.

Optimization #7 - Cascade Persist

I must admit I intentionally added the next issue to the original code. Notice in the for loop persisting the orders, I also loop over the order-lines and persist them. This would be required if the order did not cascade the persist operation to order-line. However, I also made the orderLines relationship cascade, as well as order-line's order relationship. The JPA spec defines somewhat unusual semantics to its persist operation, requiring that the cascade persist be called every time persist is called, even if the object is an existing object. This makes cascading persist a potentially dangerous thing to do, as it could trigger a traversal of your entire object model on every persist call. This is an important point, and I added this issue purposefully to highlight this point, as it is a common mistake made in JPA applications. The cascade persists causes each persist call to order-line to persist its order, and every order-line of the order again. This results in an n^2 number of persist calls. Fortunately there are only 10 order-lines per order, so this only results in 100 extra persist calls per order. It could have been much worse if the customer defined a relationship back to its orders, then you would have 1000 extra calls per order. The persist does not need to do anything, as the objects are already persisted, but the traversal can be expensive. So, in JPA you should either mark your relationships cascade persist, or call persist in your code, but not both. In general I would recommend only cascading persist for logically dependent relationships (i.e. things that would also cascade remove).

Optimization #8 - Batch Writing

Many databases provide an optimization that allows a batch of write operations to be performed as a single database access. There is both parametrized and dynamic batch writing. For parametrized batch writing a single parametrized SQL statement can be executed with a batch of parameter vales instead of a single set of parameter values. This is very optimal as the SQL only needs to be executed once, and all of the data can be passed optimally to the database.

Dynamic batch writing requires dynamic (non-parametrized) SQL that is batched into a single big statement and sent to the database all at once. The database then needs to process this huge string and execute each statement. This requires the database do a lot of work parsing the statement, so is no always optimal. It does reduce the database access, so if the database is remote or poorly connected with the application, this can result in an improvement.

In general parametrized batch writing is much more optimal, and on Oracle it provides a huge benefit, where as dynamic does not. JDBC defines the API for batch writing, but not all JDBC drivers support it, some support the API but then execute the statements one by one, so it is important to test that your database supports the optimization before using it. In EclipseLink batch writing is enabled using the persistence unit property "eclipselink.jdbc.batch-writing"="JDBC".

Another important aspect of using batch writing is that you must have the same SQL (DML actually) statement being executed in a grouped fashion in a single transaction. Some JPA providers do not order their DML, so you can end up ping-ponging between two statements such as the order insert and the order-line insert, making batch writing in-effective. Fortunately EclipseLink orders and groups its DML, so usage of batch writing reduces the database access from 500 order inserts and 5000 order-line inserts to 55 (default batch size is 100). We could increase the batch size using "eclipselink.jdbc.batch-writing.size", so increasing the batch size to 1000 reduces the database accesses to 6 per page.

Optimization #9 - Statement caching

Every time you execute an SQL statement, the database must parse that statement and execute it. Most of the time application executes the same set of SQL statements over and over. By using parametrized SQL and caching the prepared statement you can avoid the cost of having the database parse the statement.

There are two levels of statement caching. One done on the database, and one done on the JDBC client. Most databases maintain a parse cache automatically, so you only need to use parametrized SQL to make use of it. Caching the statement on the JDBC client normally provides the bigger benefit, but requires some work. If your JPA provider is providing you with your JDBC connections, then it is responsible for statement caching. If you are using a DataSource, such as in an application server, then the DataSource is responsible for statement caching, and you must enable it in your DataSource config. In EclipseLink, when using EclipseLink's connection pooling, you can enable statement caching using the persistence unit property "eclipselink.jdbc.cache-statements"="true". EclipseLink uses parametrized SQL by default, so this does not need to be configured.

Optimization #10 - Disabling Caching

By default EclipseLink maintains a shared 2nd level object cache. This normally is a good thing, and improves read performance significantly. However, in our application we are only inserting into Oracle, and never reading, so there is no point to maintaining a shared cache. We can disable this using the EclipseLink persistence unit property "eclipselink.cache.shared.default"="false". However, we are reading customer, so we can enable caching for customer using, "eclipselink.cache.shared.Customer"="true".

Optimization #11 - Other Optimizations

EclipseLink provides several other more specific optimizations. I would not really recommend all of these in general as they are fairly minor, and have certain caveats, but they are useful in use cases such as migration where the process is well defined.

These include the following persistence unit properties:
  • "eclipselink.persistence-context.flush-mode"="commit" - Avoids the cost of flushing on every query execution.
  • "eclipselink.persistence-context.close-on-commit"="true" - Avoids the cost of resuming the persistence context after the commit.
  • "eclipselink.persistence-context.persist-on-commit"="false" - Avoids the cost of traversing and persisting all objects on commit.
  • "eclipselink.logging.level"="off" - Avoids some logging overhead.
The fully optimized code:
EntityManagerFactory emf = Persistence.createEntityManagerFactory("order-opt");
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
System.out.println("Migrating database.");
Query query = emOld.createQuery("Select o from Order o order by o.id");
// Optimization #2 - batch fetch
// #2 - a - join fetch
//Query query = emOld.createQuery("Select o from Order o join fetch o.orderLines"); // #2 - b - batch fetch (batch fetch is more optimal as avoids duplication of Order data)
query.setHint("eclipselink.batch", "o.orderLines"); query.setHint("eclipselink.batch.type", "IN");
// Optimization #3 - read-only
query.setHint("eclipselink.read-only", "true");
// Optimization #4 - pagination int pageSize = 500; int firstResult = 0; query.setFirstResult(firstResult);
query.setMaxResults(pageSize); 
List orders = query.getResultList();
boolean done = false;
while (!done) {
    if (orders.size() < pageSize) {
        done = true;
    }
    EntityManager em = emf.createEntityManager();
    em.getTransaction().begin();
    Query customerQuery = em.createNamedQuery("findCustomByName");
    // Reset old Ids, so they are assigned from the new database.
    for (Order order : orders) {
        order.setId(0);
        customerQuery.setParameter("name", order.getCustomer().getName());
        try {
            Customer customer = (Customer)customerQuery.getSingleResult();
            order.setCustomer(customer);
        } catch (NoResultException notPersistedYet) {
            // Customer does not yet exist, so null out id to have it persisted.
            order.getCustomer().setId(0);
        }
    }
    for (Order order : orders) {
        em.persist(order);
        // Optimization #5 - avoid n^2 persist calls
        //for (OrderLine orderLine : order.getOrderLines()) {
        //    em.persist(orderLine);
        //}
    }
    em.getTransaction().commit();
    em.close();
    firstResult = firstResult + pageSize;
    query.setFirstResult(firstResult);
    if (!done) {
        orders = query.getResultList();
    }
}
emOld.close();
emf.close();     
emfOld.close();
The optimized persistence.xml:
<persistence-unit name="order-opt" transaction-type="RESOURCE_LOCAL">
    <!--  Optimization #7, 8 - sequence preallocation, query result cache -->
    <mapping-file>META-INF/order-orm.xml</mapping-file>
    <class>model.Order</class>
    <class>model.OrderLine</class>
    <class>model.Customer</class>
    <properties>
        <!-- Change this to access your own database. -->
        <property name="javax.persistence.jdbc.driver" value="oracle.jdbc.OracleDriver" />
        <property name="javax.persistence.jdbc.url" value="jdbc:oracle:thin:@ottvm028.ca.oracle.com:1521:TOPLINK" />
        <property name="javax.persistence.jdbc.user" value="jsutherl" />
        <property name="javax.persistence.jdbc.password" value="password" />
        <property name="eclipselink.ddl-generation" value="create-tables" />
        <!--  Optimization #9 - statement caching -->
        <property name="eclipselink.jdbc.cache-statements" value="true" />
        <!--  Optimization #10 - batch writing -->
        <property name="eclipselink.jdbc.batch-writing" value="JDBC" />
        <property name="eclipselink.jdbc.batch-writing.size" value="1000" />
        <!--  Optimization #11 - disable caching for batch insert (caching only improves reads, so only adds overhead for inserts) -->
        <property name="eclipselink.cache.shared.default" value="false" />
        <!--  Except for Customer which is shared by orders -->
        <property name="eclipselink.cache.shared.Customer" value="true" />
        <!--  Optimization #12 - turn logging off -->
        <!-- property name="eclipselink.logging.level" value="FINE" /-->
        <property name="eclipselink.logging.level" value="off" />
        <!--  Optimization #13 - close EntityManager on commit, to avoid cost of resume -->
        <property name="eclipselink.persistence-context.close-on-commit" value="true" />
        <!--  Optimization #14 - avoid auto flush cost on query execution -->
        <property name="eclipselink.persistence-context.flush-mode" value="commit" />
        <!--  Optimization #15 - avoid cost of persist on commit -->
        <property name="eclipselink.persistence-context.persist-on-commit" value="false" />
    </properties>
</persistence-unit>
The optimized orm.xml:
<?xml version="1.0" encoding="UTF-8"?>
<entity-mappings version="2.1"
    xmlns="http://www.eclipse.org/eclipselink/xsds/persistence/orm"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <named-query name="findCustomByName">
        <query>Select c from Customer c where c.name = :name</query>
        <hint name="eclipselink.query-results-cache" value="true"/>
    </named-query>
    <entity class="model.Order">
        <table-generator name="ORD_SEQ" allocation-size="500"/>
    </entity>
    <entity class="model.Customer">
            <table-generator name="CUST_SEQ" allocation-size="500"/>
    </entity>

</entity-mappings>
So, what is the result? The original un-optimized code took on average 133,496 milliseconds (~2 minutes) to process ~100,000 objects. The fully optimized code took only 6,933 milliseconds (6 seconds). This is very good, and means it could process 1 million objects in about 1 minute. The optimized code is an 1,825% improvement on the original code.

But, how much did each optimization affect this final result? To answer this question I ran the test 3 times with the fully optimized version, but with each optimization missing. This worked out better than starting with the unoptimized version and only adding each operation separately, as some optimizations get masked by the lack of others. So, in the table below the bigger the % difference, the better the optimization (that was removed) was.

OptimizationAverage Result (ms)% Difference
None133,4961,825%
All6,9330%
1 - no agent7,90614%
2 - no pagination8,67925%
3 - no read-only8,32320%
4a - join fetch11,83671%
4b - no batch fetch17,344150%
5 - no sequence pre-allocation30,396338%
6 - no persist loop7,94714%
7 - no batch writing75,751992%
8 - no statement cache7,2334%
9 - with cache7,92514%
10 - other7,3326%

This shows that batch writing was the best optimization, followed by sequence pre-allocation, then batch fetching.