zafena development

October 9, 2009

The results displayed are generated using the debug build of shark, with assertions, that I made on the 6th of October compared against release builds of the pure zero cpp interpreter and the optimised zero assembler interpreter, both from Icedtea6-1.6.1.

Did we gain anything from having a jumping shark? By taking a quick peek at the graph you can quite quickly spot some 15X+ speed improvements so yes! yeah!, the shark JIT indeed got some sharp toots in its yaws! I am quite delighted to see that some parts of the benchmark got a 25x+ speed boost!
There are still some rough spots that can be identified that of course needs some polishing, so let me share some ideas on how to make the Shark JIT on ARM really shine.

As can be seen in the chart shark uses the zero cpp interpreter before the methods are jited and the extra overhead on running the JIT causes the zero interpreter to run slower during program launch on a single core ARM cpu, this penalty are removed once the initial warm-up have complete (somewhere around 300 to 500 compiled methods). New multi-core ARM Cortex-A9 CPU do not have this penalty since the compiler process are run in a separate thread and can be scheduled on a CPU of its own.

Some quick ways to fix the warm-up issue:
0. First of all I want to state that these results where generated using a debug build of shark, I have a build machine working on creating a release builds as I type so hopefully I will be able to generate some improved benchmark scores in the near future, especially to deal with the warm-up penalty.
1. A quick way to reduce the warmup penalty would be to make shark able to use the new assembler optimized interpreter instead of the pure cpp interpreter found in Icedtea6-1.6.1 and this could become a reality quite soon since they both share the same in memory structures. Also by using the new assembler optimizations would make Shark JIT more usable as a client JVM where initial GUI performance are crucial, and in this GUI area the assembler interpreter really shine.
2. I have also identified some parts in the LLVM JIT that could be quickly improved to make the LLVM JIT jitting faster. Basically I want to make the LLVM tablegen generate better lookuptables to speed up the instruction lowering, currently shark spends quite a large deal of time here running the LLVM ExecutionEngine::getPointerToFunction(). I think by generating some improved formatter code for the LLVM tablegen backend could quite quickly improve the autogenerated .inc files used for the target instruction lowering.
3. Examine the posibility to implement a bytecode cache in Shark to jumpstart the JIT even further. By making the JIT able to load precalculated LLVM IR or in memory representations of the methods would reduce some of the JIT overhead on program launch.
4. Add a PassManager framework to Shark to simplify the LLVM IR before it reaches the JIT. The tricky part are to select what passes to use and in what order to use them. If done correctly then this might both lower jitting time and improve the generated machine code quality.

5 Comments »

  1. [...] one last thing I want to do before I step aside for a couple of months. Xerxes Rånby posted some benchmarks of Zero, Shark, and the assembler interpreter on ARM; Shark is gratifyingly faster than everything [...]

    Pingback by gbenson.net :: Long overdue update — October 9, 2009 @ 18:07

  2. Ah, good work! And luckily I don’t think I care about most of the parts that Shark is handling less well unless network I/O gets snarled up the same way in JNI stuff! B^>

    Rgds

    Damon

    Comment by Damon Hart-Davis — October 10, 2009 @ 09:36

  3. As to optimising start-up… The tiered compilation stuff that Steve Goldman of Sun was working on would of course help with that if you got a light C1 quickly and then C2 when needed:

    http://blogs.sun.com/fatcatair/category/Java

    But also I’d suggested to him to simply remember which methods had been hot at startup but didn’t run long enough after being compiled to get much benefit from compilation. On a subsequent run those would be compiled as soon as (and this before) they were called (C1 level I guess) like an old-fashioned JIT.

    Much safer and less data and dependencies to store than a code cache, and the worst that can really happen is some excess compilation for things that turned out not to be early/start-up hotspots *this* time, eg because the code was changed.

    Java could really benefit from having such work folded back in IMHO…

    Rgds

    Damon

    Comment by Damon Hart-Davis — October 10, 2009 @ 22:12

  4. I can happily announce that the warm-uptime that i have been worried about have been all removed by combining the Shark JIT with the optimized ASM interpreter on ARM.
    Edward Nevill are the one to thank that have made all this happen quite quickly, all sources can be found in the icedtea6 sourcetree and mailinglist.

    I have made some armv5 binarys that can be run on Ubuntu Jaunty systems like the Sheeva Plug and OpenRD-Client.
    Give it a spin and tell us what you think!
    http://labb.zafena.se/shark-testing/armv5/

    Cheers and thanks Ed!
    Xerxes

    Comment by xerxes — November 5, 2009 @ 18:16

  5. Testing started… Please fit protecting clothing and retreat to underground bunkers! B^>

    Rgds

    Damon

    Comment by Damon Hart-Davis — November 5, 2009 @ 18:37

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress