JPA has always been considered to be slower than plain JDBC or lighter frameworks like MyBatis. Bulk inserts have been considered the cultpit of JPA.
I recently had to implement a small report archiving for a client (financial organization) . The goal of the project was to archive status reports sent to customers, so that they could be searched for and retrieved.
The input to the archiving system is an XML file containing metadata regarding all reports and one big PDF file that includes all reports. The metadata is parsed and inserted into a database, the database can then be searched for a specific report or a set of reports. For each report there is an index into the large PDF file (using start page and number of pages) that can be used to extract a specific PDF report from the large PDF file.
The project included a batch process to import the data into the archive system as well as a web service front end that supports looking up reports and extracting the report from the large PDF file.
The batch process will need to support archiving about 400k reports in one run. Each report will create a ReportArchive entity in the DB which also contains a one-to-many relationship with an additional metadata entity.
While JPA is not the most efficient technology to use for bulk inserts, it is the standard API used in all Java project by the customer, so it was the default choice.
I used STAX to parse the XML file and IText to extract pages from the large PDF file, with eclipselink as the JPA provider, but what I wanted to talk about was the simple optimizations that helped me improve the JPA insert performance by a magnitude of over 13 (!!).
Initial Run
The initial run has a simple JPA persistence implementation with a commit on each insert of the ReportArchive entity. The main entity the ReportArchive entity had a Cascade.ALL on its child entity.
I had a sample batch file with 5786 reports and it took 7:43 minutes to insert those reports.
Optimization # 1
I decided to perform a commit every 500 inserts in order to improve performance.
I also configured eclipselinke to use JDBC batching using the following configuration settings in persistence.xml:
<property name="eclipselink.jdbc.batch-writing" value="JDBC"/>
<property name="eclipselink.jdbc.batch-writing.size" value="1000"/>
The result: performance improved to 5:09 minutes for inserting the same 5786 reports.
I also tried to change the commit batch size and played with various sizes from 200 to 1000, but at least in my limited testing 500 seemed to be the optimal size.
Optimization #2
The ReportArchive entity had a unique key in report id, but it also had a report sequence which was a database auto increment field (DB2 was the database). I decided to remove report sequence and use the report id as the key since it’s unique in any case.
The result: performance improved to 3:29 minutes for the same 5786 reports.
I also tried to further optimize performance by calling clear() on the entityManager between batch commits, however that did not produce any performance improvements.
Optimization #3
I added caching of sql statements and turned eclipselink’s JPA logging off by adding the following configuration settings to persistenc.xml :
<property name="eclipselink.jdbc.cache-statements" value="true"/>
<property name="eclipselink.logging.level" value="off"/>
The result: performance increased dramatically to 35 seconds (!!!) for the same 5786 reports.
Summary
In order to dramatically improve JPA insert performance if you are using EclipseLink :
- Remove database auto increment columns as your primary keys if possible
- Perform commits in batches - 500 was the magic number for my application.
- Add the following eclipseLink configuration settings:
<property name="eclipselink.jdbc.batch-writing" value="JDBC"/>
<property name="eclipselink.jdbc.batch-writing.size" value="1000"/>
<property name="eclipselink.jdbc.cache-statements" value="true"/>
<property name="eclipselink.logging.level" value="off"/>
nice post , thank you , very helpfull
ReplyDelete