zafena development

October 9, 2009

The results displayed are generated using the debug build of shark, with assertions, that I made on the 6th of October compared against release builds of the pure zero cpp interpreter and the optimised zero assembler interpreter, both from Icedtea6-1.6.1.

Did we gain anything from having a jumping shark? By taking a quick peek at the graph you can quite quickly spot some 15X+ speed improvements so yes! yeah!, the shark JIT indeed got some sharp toots in its yaws! I am quite delighted to see that some parts of the benchmark got a 25x+ speed boost!
There are still some rough spots that can be identified that of course needs some polishing, so let me share some ideas on how to make the Shark JIT on ARM really shine.

As can be seen in the chart shark uses the zero cpp interpreter before the methods are jited and the extra overhead on running the JIT causes the zero interpreter to run slower during program launch on a single core ARM cpu, this penalty are removed once the initial warm-up have complete (somewhere around 300 to 500 compiled methods). New multi-core ARM Cortex-A9 CPU do not have this penalty since the compiler process are run in a separate thread and can be scheduled on a CPU of its own.

Some quick ways to fix the warm-up issue:
0. First of all I want to state that these results where generated using a debug build of shark, I have a build machine working on creating a release builds as I type so hopefully I will be able to generate some improved benchmark scores in the near future, especially to deal with the warm-up penalty.
1. A quick way to reduce the warmup penalty would be to make shark able to use the new assembler optimized interpreter instead of the pure cpp interpreter found in Icedtea6-1.6.1 and this could become a reality quite soon since they both share the same in memory structures. Also by using the new assembler optimizations would make Shark JIT more usable as a client JVM where initial GUI performance are crucial, and in this GUI area the assembler interpreter really shine.
2. I have also identified some parts in the LLVM JIT that could be quickly improved to make the LLVM JIT jitting faster. Basically I want to make the LLVM tablegen generate better lookuptables to speed up the instruction lowering, currently shark spends quite a large deal of time here running the LLVM ExecutionEngine::getPointerToFunction(). I think by generating some improved formatter code for the LLVM tablegen backend could quite quickly improve the autogenerated .inc files used for the target instruction lowering.
3. Examine the posibility to implement a bytecode cache in Shark to jumpstart the JIT even further. By making the JIT able to load precalculated LLVM IR or in memory representations of the methods would reduce some of the JIT overhead on program launch.
4. Add a PassManager framework to Shark to simplify the LLVM IR before it reaches the JIT. The tricky part are to select what passes to use and in what order to use them. If done correctly then this might both lower jitting time and improve the generated machine code quality.

October 6, 2009

picture of the day!

The picture that made my day!

Ok.. so what happened?

xerxes@babbage-karmic:/wd/icedtea6/openjdk/build/linux-arm/bin$ ./java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (IcedTea6 1.7pre-r2a3725ce72d4) (build 1.6.0_0-b16)
OpenJDK Shark VM (build 14.0-b16-product, mixed mode)

xerxes@babbage-karmic:/wd/icedtea6/openjdk/build/linux-arm/bin$ cat /proc/cpuinfo
Processor    : ARMv7 Processor rev 1 (v7l)
BogoMIPS    : 799.53
Features    : swp half thumb fastmult vfp edsp
CPU implementer    : 0x41
CPU architecture: 7
CPU variant    : 0x2
CPU part    : 0xc08
CPU revision    : 1
Hardware    : Freescale MX51 Babbage Board
Revision    : 51011
Serial        : 0000000000000000

xerxes@babbage-karmic:/wd/llvm$ svn info
URL: http://llvm.org/svn/llvm-project/llvm/trunk
Repository Root: http://llvm.org/svn/llvm-project
Repository UUID: 91177308-0d34-0410-b5e6-96231b3b80d8
Revision: 82896
Node Kind: directory
Schedule: normal
Last Changed Author: edwin
Last Changed Rev: 82896
Last Changed Date: 2009-09-27 11:08:03 +0000 (Sun, 27 Sep 2009)

xerxes@babbage-karmic:/wd/llvm$ quilt diff
Index: llvm/lib/Target/ARM/ARMInstrInfo.td
===================================================================
--- llvm.orig/lib/Target/ARM/ARMInstrInfo.td    2009-10-06 12:35:26.000000000 +0000
+++ llvm/lib/Target/ARM/ARMInstrInfo.td    2009-10-06 12:36:03.000000000 +0000
@@ -645,7 +645,7 @@
 IIC_Br, "mov lr, pc\n\tbx $func",
 [(ARMcall_nolink GPR:$func)]>,
 Requires<[IsARM, IsNotDarwin]> {
-    let Inst{7-4}   = 0b0001;
+    let Inst{7-4}   = 0b0011;
 let Inst{19-8}  = 0b111111111111;
 let Inst{27-20} = 0b00010010;
 }

The last patch on LLVM are currently a hack. basically it makes LLVM emit ARM BLX instructions instead of BX instructions for ARM::CALL_NOLINK. So why did this little hack make it work?

In order to understand that, one have to find out what made Shark on ARM crash before...

Lets rewind time to some days ago... 

Hi, i have been enjoying myself inside gdb for some days, and I have now at least found the reason why the cpu
ends up in garbage memory when running shark on arm.

The problem can be illustrated like this:

frame manager invokes jited code
entry_zero.hpp:57 invokes jit code at 0x67c9e990

jited code runs
0x67c9e990:    push    {r4, r5, r6, r7, r8, r9, r10, r11, lr}
0x67c9e994:    sub    sp, sp, #12    ; 0xc
0x67c9e998:    ldr    r12, [r3, #756]
0x67c9e99c:    ldr    lr, [r3, #764]
0x67c9e9a0:    sub    r4, lr, #56    ; 0x38
0x67c9e9a4:    cmp    r4, r12
0x67c9e9a8:    bcc    0x67c9ebd0
0x67c9e9ac:    mov    r5, r3
0x67c9e9b0:    str    r2, [sp, #4]
0x67c9e9b4:    mov    r6, r0
0x67c9e9b8:    str    r4, [r5, #764]
0x67c9e9bc:    str    r4, [r4, #20]
0x67c9e9c0:    ldr    r0, [pc, #640]    ; 0x67c9ec48
0x67c9e9c4:    str    r0, [r4, #28]
0x67c9e9c8:    ldr    r0, [r5, #768]
0x67c9e9cc:    str    r0, [r4, #32]
0x67c9e9d0:    add    r0, r4, #32    ; 0x20
0x67c9e9d4:    str    r0, [r5, #768]
0x67c9e9d8:    str    r6, [r4, #16]
0x67c9e9dc:    ldr    r7, [r1]
0x67c9e9e0:    ldr    r0, [r1, #4]
0x67c9e9e4:    str    r0, [sp]
0x67c9e9e8:    ldr    r8, [r1, #8]
0x67c9e9ec:    ldr    r9, [r1, #12]
0x67c9e9f0:    ldr    r0, [r1, #16]
0x67c9e9f4:    str    r0, [sp, #8]
0x67c9e9f8:    ldr    r10, [r1, #20]
0x67c9e9fc:    ldr    r2, [pc, #584]    ; 0x67c9ec4c   <------ jit code calls a jvm function stored in this address
0x67c9ea00:    mov    r0, r1
0x67c9ea04:    bx    r2 <---------------------------   problem!  should have been blx!

(gdb) x 0x67c9ec4c
0x67c9ec4c:    0x40836d9c
(gdb) x 0x40836d9c
0x40836d9c <_ZN13SharedRuntime17OSR_migration_endEPi>:    0xe92d41f0
(gdb)

so lets check out _ZN13SharedRuntime17OSR_migration_endEPi

0x40836d9c <_ZN13SharedRuntime17OSR_migration_endEPi+0>:    push    {r4, r5, r6, r7, r8, lr}    <------  lr are backed up..  but bx did not update lr..
0x40836da0 <_ZN13SharedRuntime17OSR_migration_endEPi+4>:    ldr    r4, [pc, #284]    ; 0x40836ec4 <_ZN13SharedRuntime17OSR_migration_endEPi+296>
0x40836da4 <_ZN13SharedRuntime17OSR_migration_endEPi+8>:    ldr    r7, [pc, #284]    ; 0x40836ec8 <_ZN13SharedRuntime17OSR_migration_endEPi+300>
0x40836da8 <_ZN13SharedRuntime17OSR_migration_endEPi+12>:    ldr    r6, [pc, #284]    ; 0x40836ecc <_ZN13SharedRuntime17OSR_migration_endEPi+304>
0x40836dac <_ZN13SharedRuntime17OSR_migration_endEPi+16>:    add    r4, pc, r4
0x40836db0 <_ZN13SharedRuntime17OSR_migration_endEPi+20>:    ldr    r12, [r4, r7]
0x40836db4 <_ZN13SharedRuntime17OSR_migration_endEPi+24>:    ldr    r1, [r4, r6]
0x40836db8 <_ZN13SharedRuntime17OSR_migration_endEPi+28>:    ldr    r5, [r12]
0x40836dbc <_ZN13SharedRuntime17OSR_migration_endEPi+32>:    ldrb    r2, [r1]
0x40836dc0 <_ZN13SharedRuntime17OSR_migration_endEPi+36>:    add    r3, r5, #1    ; 0x1
0x40836dc4 <_ZN13SharedRuntime17OSR_migration_endEPi+40>:    cmp    r2, #0    ; 0x0
0x40836dc8 <_ZN13SharedRuntime17OSR_migration_endEPi+44>:    sub    sp, sp, #24    ; 0x18
0x40836dcc <_ZN13SharedRuntime17OSR_migration_endEPi+48>:    str    r3, [r12]
0x40836dd0 <_ZN13SharedRuntime17OSR_migration_endEPi+52>:    mov    r7, r0
0x40836dd4 <_ZN13SharedRuntime17OSR_migration_endEPi+56>:    bne 0x40836e74 <_ZN13SharedRuntime17OSR_migration_endEPi+216>
0x40836dd8 <_ZN13SharedRuntime17OSR_migration_endEPi+60>:    ldr    r2, [pc, #240]    ; 0x40836ed0 <_ZN13SharedRuntime17OSR_migration_endEPi+308>
0x40836ddc <_ZN13SharedRuntime17OSR_migration_endEPi+64>:    ldr    r12, [r4, r2]
0x40836de0 <_ZN13SharedRuntime17OSR_migration_endEPi+68>:    ldrb    r3, [r12]
0x40836de4 <_ZN13SharedRuntime17OSR_migration_endEPi+72>:    cmp    r3, #0    ; 0x0
0x40836de8 <_ZN13SharedRuntime17OSR_migration_endEPi+76>:    beq 0x40836e20 <_ZN13SharedRuntime17OSR_migration_endEPi+132>
0x40836dec <_ZN13SharedRuntime17OSR_migration_endEPi+80>:    ldr    r6, [pc, #224]    ; 0x40836ed4 <_ZN13SharedRuntime17OSR_migration_endEPi+312>
0x40836df0 <_ZN13SharedRuntime17OSR_migration_endEPi+84>:    ldr    r5, [r4, r6]
0x40836df4 <_ZN13SharedRuntime17OSR_migration_endEPi+88>:    add    r0, r4, r6
0x40836df8 <_ZN13SharedRuntime17OSR_migration_endEPi+92>:    tst    r5, #1    ; 0x1
0x40836dfc <_ZN13SharedRuntime17OSR_migration_endEPi+96>:    beq 0x40836e8c <_ZN13SharedRuntime17OSR_migration_endEPi+240>
0x40836e00 <_ZN13SharedRuntime17OSR_migration_endEPi+100>:    ldr    r5, [pc, #208]    ; 0x40836ed8 <_ZN13SharedRuntime17OSR_migration_endEPi+316>
0x40836e04 <_ZN13SharedRuntime17OSR_migration_endEPi+104>:    ldr    r3, [r4, r5]
0x40836e08 <_ZN13SharedRuntime17OSR_migration_endEPi+108>:    cmp    r3, #0    ; 0x0
0x40836e0c <_ZN13SharedRuntime17OSR_migration_endEPi+112>:    movne r0, r3
0x40836e10 <_ZN13SharedRuntime17OSR_migration_endEPi+116>:    ldrne r6, [r3]
0x40836e14 <_ZN13SharedRuntime17OSR_migration_endEPi+120>:    ldrne r12, [r6, #16]
0x40836e18 <_ZN13SharedRuntime17OSR_migration_endEPi+124>:    movne lr, pc
0x40836e1c <_ZN13SharedRuntime17OSR_migration_endEPi+128>:    bxne    r12
0x40836e20 <_ZN13SharedRuntime17OSR_migration_endEPi+132>:    add    r6, sp, #20    ; 0x14
0x40836e24 <_ZN13SharedRuntime17OSR_migration_endEPi+136>:    mov    r0, r6
0x40836e28 <_ZN13SharedRuntime17OSR_migration_endEPi+140>:    bl 0x40596c84 <NoHandleMark>
0x40836e2c <_ZN13SharedRuntime17OSR_migration_endEPi+144>:    mov    r0, sp
0x40836e30 <_ZN13SharedRuntime17OSR_migration_endEPi+148>:    bl 0x4057909c <JRT_Leaf_Verifier>
0x40836e34 <_ZN13SharedRuntime17OSR_migration_endEPi+152>:    ldr    r3, [pc, #160]    ; 0x40836edc <_ZN13SharedRuntime17OSR_migration_endEPi+320>
0x40836e38 <_ZN13SharedRuntime17OSR_migration_endEPi+156>:    mov    r5, sp
0x40836e3c <_ZN13SharedRuntime17OSR_migration_endEPi+160>:    ldr r12, [r4, r3]
0x40836e40 <_ZN13SharedRuntime17OSR_migration_endEPi+164>:    ldrb r0, [r12]
0x40836e44 <_ZN13SharedRuntime17OSR_migration_endEPi+168>:    cmp    r0, #0    ; 0x0
0x40836e48 <_ZN13SharedRuntime17OSR_migration_endEPi+172>:    movne r0, r7
0x40836e4c <_ZN13SharedRuntime17OSR_migration_endEPi+176>:    blne 0x4039b20c <_Z15trace_heap_freePv>
0x40836e50 <_ZN13SharedRuntime17OSR_migration_endEPi+180>:    mov    r0, r7
0x40836e54 <_ZN13SharedRuntime17OSR_migration_endEPi+184>:    bl 0x407b6a94 <_ZN2os4freeEPv>
0x40836e58 <_ZN13SharedRuntime17OSR_migration_endEPi+188>:    mov    r0, sp
0x40836e5c <_ZN13SharedRuntime17OSR_migration_endEPi+192>:    bl 0x40578c5c <~JRT_Leaf_Verifier>
0x40836e60 <_ZN13SharedRuntime17OSR_migration_endEPi+196>:    mov    r0, r6
0x40836e64 <_ZN13SharedRuntime17OSR_migration_endEPi+200>:    bl 0x40596b04 <~NoHandleMark>
0x40836e68 <_ZN13SharedRuntime17OSR_migration_endEPi+204>:    add    sp, sp, #24    ; 0x18
0x40836e6c <_ZN13SharedRuntime17OSR_migration_endEPi+208>:    pop {r4, r5, r6, r7, r8, lr}
0x40836e70 <_ZN13SharedRuntime17OSR_migration_endEPi+212>:    bx    lr <------  and woho. lets enjoy a trip to garbage memory!

So when the function that the jit calls returns we find ourself eating
garbage memory.

So the small hack fixed this issue quite well but broke armv4t compatibility for the moment.

My next task would be to fix this properly in LLVM.

Powered by WordPress