The results displayed are generated using the debug build of shark, with assertions, that I made on the 6th of October compared against release builds of the pure zero cpp interpreter and the optimised zero assembler interpreter, both from Icedtea6-1.6.1.
Did we gain anything from having a jumping shark? By taking a quick peek at the graph you can quite quickly spot some 15X+ speed improvements so yes! yeah!, the shark JIT indeed got some sharp toots in its yaws! I am quite delighted to see that some parts of the benchmark got a 25x+ speed boost!
There are still some rough spots that can be identified that of course needs some polishing, so let me share some ideas on how to make the Shark JIT on ARM really shine.
As can be seen in the chart shark uses the zero cpp interpreter before the methods are jited and the extra overhead on running the JIT causes the zero interpreter to run slower during program launch on a single core ARM cpu, this penalty are removed once the initial warm-up have complete (somewhere around 300 to 500 compiled methods). New multi-core ARM Cortex-A9 CPU do not have this penalty since the compiler process are run in a separate thread and can be scheduled on a CPU of its own.
Some quick ways to fix the warm-up issue:
0. First of all I want to state that these results where generated using a debug build of shark, I have a build machine working on creating a release builds as I type so hopefully I will be able to generate some improved benchmark scores in the near future, especially to deal with the warm-up penalty.
1. A quick way to reduce the warmup penalty would be to make shark able to use the new assembler optimized interpreter instead of the pure cpp interpreter found in Icedtea6-1.6.1 and this could become a reality quite soon since they both share the same in memory structures. Also by using the new assembler optimizations would make Shark JIT more usable as a client JVM where initial GUI performance are crucial, and in this GUI area the assembler interpreter really shine.
2. I have also identified some parts in the LLVM JIT that could be quickly improved to make the LLVM JIT jitting faster. Basically I want to make the LLVM tablegen generate better lookuptables to speed up the instruction lowering, currently shark spends quite a large deal of time here running the LLVM ExecutionEngine::getPointerToFunction(). I think by generating some improved formatter code for the LLVM tablegen backend could quite quickly improve the autogenerated .inc files used for the target instruction lowering.
3. Examine the posibility to implement a bytecode cache in Shark to jumpstart the JIT even further. By making the JIT able to load precalculated LLVM IR or in memory representations of the methods would reduce some of the JIT overhead on program launch.
4. Add a PassManager framework to Shark to simplify the LLVM IR before it reaches the JIT. The tricky part are to select what passes to use and in what order to use them. If done correctly then this might both lower jitting time and improve the generated machine code quality.