Java Persistence Performance: index

Showing posts with label index. Show all posts

Monday, June 10, 2013

Cool performance features of EclipseLink 2.5

The main goal of the EclipseLink 2.5 release was the support of the JPA 2.1 specification, as EclipseLink 2.5 was the reference implementation for JPA 2.1. For a list of JPA 2.1 features look here, or here.

Most of the features that went into the release were to support JPA 2.1 features, so there was not a lot of development time for other features. However, I was still able to sneak in a few cool new performance features. The features are not well documented yet, so I thought I would outline them here.

Indexing Foreign Keys

The first feature is auto indexing of foreign keys. Most people incorrectly assume that databases index foreign keys by default. Well, they don't. Primary keys are auto indexed, but foreign keys are not. This means any query based on the foreign key will be doing full table scans. This is any OneToMany, ManyToMany or ElementCollection relationship, as well as many OneToOne relationships, and most queries on any relationship involving joins or object comparisons. This can be a major perform issue, and you should always index your foreign keys fields.

EclipseLink 2.5 makes indexing foreign key fields easy with a new persistence unit property:

"eclipselink.ddl-generation.index-foreign-keys"="true"

This will have EclipseLink create an index for all mapped foreign keys if EclipseLink is used to generate the persistence unit's DDL. Note that DDL generation is now standard in JPA 2.1, so to enable DDL generation in EclipseLink 2.5 you can now use:

"javax.persistence.schema-generation.database.action"="create"

EclipseLink 2.5 and JPA 2.1 also support several new DDL generation features, including allowing user scripts to be executed. See, DDL generation for more information.

Query Cache Invalidation

EclipseLink has always supported a query cache. Unlike the object cache, the query cache is not enabled by default, but must be enabled through the query hint "eclipselink.query-results-cache". The main issue with the query cache, is that the results of queries can change when objects are modified, so the query cache could become out of date. Previously the query cache did support time-to-live and daily invalidation through the query hint "eclipselink.query-results-cache.expiry", but would not be kept in synch with changes as they were made.

In EclipseLink 2.5 automatic invalidation of the query cache was added. So if you had a query "Select e from Employee e" and had enabled query caching, every execution of this query would hit the cache and avoid accessing the database. Then if you inserted a new Employee, in EclipseLink 2.5 the query cache for all queries for Employee will automatically get invalidated. The next query will access the database, and get the correct result, and update the cache so all subsequent queries will once again obtain cache hits. Since the query cache is now kept in synch, the new persistence unit property "eclipselink.cache.query-results"="true" was added to enable the query cache on all named queries. If, for some reason, you want to allow stale data in your query cache, you can disable invalidation using the QueryResultsCachePolicy.setInvalidateOnChange() API.

Query cache invalidation is also integrated with cache coordination, so even if you modify an Employee on another server in your cluster, the query cache will still be invalidated. The query cache invalidation is also integrated with EclipseLink's support for Oracle Database Change Notification. If you have other applications accessing your database, you can keep the EclipseLink cache in synch with an Oracle database using the persistence unit property "eclipselink.cache.database-event-listener"="DCN". This support was added in EclipseLink 2.4, but in EclipseLink 2.5 it will also invalidate the query cache.

Tuners

EclipseLink 2.5 added an API to make it easier to provide tuning configuration for a persistence unit. The SessionTuner API allows a set of tuning properties to be configured in one place, and provides deployment time access to the EclipseLink Session and persistence unit properties. This makes it easy to have a development, debug, and production configuration of your persistence unit, or provide different configurations for different hardware. The SessionTuner is set through the persistence unit property "eclipselink.tuning".

Concurrent Processing

The most interesting performance feature provided in EclipseLink 2.5 is still in a somewhat experimental stage. The feature allows for a session to make use of concurrent processing.

There is no public API to configure it as of yet, but if you are interested in experimenting it is easy to set through a SessionCustomizer or SessionTuner.


public class MyCustomizer implements SessionCustomizer {
  public void customize(Session session) {
    ((AbstractSession)session).setIsConcurrent(true);
  }
}

Currently this enables two main features, one is the concurrent processing of result sets. The other is the concurrent loading of load groups.

In any JPA object query there are three parts. The first is the execution of the query, the second is the fetching of the data, and the third is the building of the objects. Normally the query is executed, all of the data is fetched, then the objects are built from the data. With concurrency enabled two threads will be used instead, one to fetch the data, and one to build the objects. This allows two things to be done at the same time, allowing less overall time (but the same amount of CPU). This can provide a benefit if you have a multi-CPU machine, or even if you don't, it allows the client to be doing processing at the same time as the database machine.

The second feature allows all of the relationships for all of the resulting objects to be queried and built concurrently (only when using a shared cache). So, if you queried 32 Employees and also wanted each Employee's address, the address queries could all be executed and built concurrently, resulting in significant less response time. This requires the usage of a LoadGroup to be set on the query. LoadGroup defines a new API setIsConcurrent() to allow concurrency to be enabled (this defaults to true when a session is set to be concurrent).

A LoadGroup can be configured on a query using the query hint "eclipselink.load-group", "eclipselink.load-group.attribute", or through the JPA 2.1 EntityGraph query hint "javax.persistence.loadgraph".

Note that for concurrency to improve your application's performance you need to have spare CPU time. So, to benefit the most you need multiple-CPUs. Also, concurrency will not help you scale an application server that is already under load from multiple client requests. Concurrency does not use less CPU time, it just allows for the CPUs to be used more efficiently to improve response times.

Thursday, March 7, 2013

But what if I'm not querying by id? (database and cache indexes)

Most data models define a sequence generated numeric primary key. This is the most efficient key to use as it is a single efficient guaranteed unique value. Some applications also use a UUID, which is a little less efficient in terms of time and space, but has its advantages in distributed systems and databases.

In JPA the primary key is defined as the JPA Id, and since JPA's object cache is indexed by Id, any queries by this Id obtain cache hits, and avoid database access. This is great when querying by id, and for traversing relationships across foreign keys based on the id, but what about queries not using the id?

Sequence generated ids are great for computers and databases, but are not very useful to people. How many times how you been to a store or website, and to look up your record they asked your for your sequence generated id?

Probably not very often, you are more likely to be asked for your phone number, email address, ssn or other such key that is easy to remember. These data keys can be considered alternative or secondary keys, and are very common in databases. Querying using alternative keys are very common, so it is important that they perform optimally.

The most important thing to ensure is that the columns are properly indexed in the database, otherwise each query will require a full table scan. The second thing to ensure is that the columns are indexed in the JPA object cache.

EclipseLink Database Indexes

To create a database index on a column you can use your own DDL script, or use the @Index annotation in EclipseLink (since EclipseLink 2.2). JPA 2.1 will also defined its own @Index annotation. For the EclipseLink index annotation you can just put this on the attributes you would like to index.

@Entity
@Index(columnNames={"F_NAME", "L_NAME"})
public class Employee {
  @Id
  private long id;
  @Index
  @Column(name="F_NAME")
  private String firstName;
  @Index
  @Column(name="L_NAME")
  private String lastName;
  @Index(unique=true)
  private String ssn;
  @Index
  private String phoneNumber;
  @Index
  private String emailAddress;
  ...
}

JPA 2.1 Database Indexes

The upcoming JPA 2.1 spec (draft) also defines support for database indexes. EclipseLink 2.5 is the reference implementation for JPA 2.1, so in EclipseLink 2.5 (dev builds) you can also create database indexes using the JPA 2.1 annotations. This is a little bit more complex, as you cannot define the @Index on attributes, only inside table annotations.

@Entity
@Table(indexes={
  @Index(name="EMP_SSN_INDEX", unique=true, columnList={"SSN"}),
  @Index(name="EMP_PHONE_INDEX", columnList="PHONENUMBER"),
  @Index(name="EMP_EMAIL_INDEX", columnList="EMAILADDRESS"),
  @Index(name="EMP_F_NAME_INDEX", columnList="F_NAME"),
  @Index(name="EMP_L_NAME_INDEX", columnList="L_NAME"),
  @Index(name="EMP_NAME_INDEX", columnList={"F_NAME", "L_NAME"}) })
public class Employee {
  @Id
  private long id;
  @Column(name="F_NAME")
  private String firstName;
  @Column(name="L_NAME")
  private String lastName;
  private String ssn;
  private String phoneNumber;
  private String emailAddress;
  ...
}

EclipseLink Cache Indexes

EclipseLink also supports indexes on the object cache (since EclipseLink 2.4). This allows JPQL and Criteria queries on indexed fields to obtain cache hits, and avoid all database access. The @CacheIndex annotation is used to index an attribute, or set of columns. When a set of columns are indexed, any queries using the attributes that map to those columns will use the index.

Cache indexes only provide a benefit to queries that expect a single result. Indexing a field such as firstName would provide no benefit, as there are many results with the same first name, and EclipseLink could never be certain is has them all loaded in the cache, so must access the database.

@Entity
@CacheIndex(columnNames={"F_NAME", "L_NAME"})
public class Employee {
  @Id
  private long id;
  @Column(name="F_NAME")
  private String firstName;
  @Column(name="L_NAME")
  private String lastName;
  @CacheIndex
  private String ssn;
  @CacheIndex
  private String phoneNumber;
  @CacheIndex
  private String emailAddress;
  ...
}

So, is indexing worth it? I created an example benchmark that shows the difference between indexed and non-indexed queries. These results were obtained accessing an Oracle database across a LAN from a desktop machine. Results are queries per second, so a bigger number is better. The test consisted of querying a Customer object by name, with a database size of 1,000 customers.

Config	Average Result (q/s)	% Difference
No index, no cache	7,155	0%
No index, cache	8,095	13%
Database index, cache	9,900	38%
Database index, cache index	137,120	1,816%

The results show that the object cache indexed by id provides a little benefit (13%), as it avoids having to rebuild the object, but still has to access the database. Note the test did not access any relationships on Customer, if it had this object cache would still provide a major benefit in avoiding relationship queries. The database index provides a better benefit, (38%), this is a factor of the size of the table, the bigger the table the bigger the benefit. The object cache index provides the best benefit, almost a 20x improvement.

This shows it is important to properly index your database and your cache. Avoid indexing everything, as there is a cost in maintaining indexes. For the database, index any commonly queried columns, for the cache, index any secondary key fields.

See the EclipseLink UserGuide for more info on caching.

The source code for the benchmarks used in this post can be found here, or downloaded here.

Subscribe to: Comments ( Atom )

Pages