Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm stumbling onto a strange Java performance irregularity, especially easy to hit with Docker on Linux. A super straightforward implementation of Eratosthenes sieve using a Java BitSet for storage, sometimes drops in performance by 50%. Even more with JDK8. Any JAVA/JDK/JVM experts here that can shed some light on the mystery for me? I've been at it a while, but it is quite a bit outside my field of knowledge, and I am out of ideas for how to proceed. The blog article linked + the Git repository is an attempt to summarize/minimize things.


Do you have performance traces of good and bad runs to compare? Instead of trying to come up with more theories you should start by looking at where the time is being spent.

I'm not experienced with doing micro-optimisations in Java, but I assume you can profile it to find out which individual operations are taking up the time just via a sampling trace.

Given it's common on a particular OS setup, I wouldn't be surprised if it's in the System.currentTimeMillis() call that you're doing every loop.


Yes, I should try to profile each run and compare the profiles of the bad and good ones.

Would be quite funny if it's in the System.currentTimeMillis() call this happens. I'd try to figure out a way to remove that from the picture...


I'm not familiar with Java profiling. What's the alternative?


I've now made an experiment that removes System.currentTimeMillis() as a suspect: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...


> System.currentTimeMillis() is pretty platform-dependent, so it's a good place to look


you could try running it with the linux perf command.

`perf stat -d java ...`

if your CPU supports performance counters it should give you L1 cache misses/branch misses/etc. that might give you some insight into what is different between the runs. someone else mentioned it could be the memory alignment. i think java might allocate with 8 byte alignment and maybe something funny goes on with the L1-caching if the bitset allocation is not 16 byte or 32 byte aligned.

if you are running it within docker you might need to use: ` --security-opt seccomp:unconfined --cap-add SYS_PTRACE --cap-add CAP_SYS_ADMIN`


This would be my first go-to as well. See if you are retiring less instructions, and if yes, see if the counters tell what the CPU is stalling on vs the faster runs.


I’m not a Java expert, but potentially due to the CPU/memory OS scheduler?

Try running the workload with pinned CPUs assigned and not across NUMA nodes.


Try running the workload with pinned CPUs assigned ^^^--- it will be this.

I've run into the same weirdness on other things and this always solves it. Some cores are better at some things than others.


It was indeed this.


I think you Ryzen CPU is being throttled. It is not able to sustain it's burst speed. You should keep an eye on CPU temp and clock.

My benchmarks on hardware that have less single thread power than yours:

Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz - Linux - Java 8

0 - Passes: 3200, count: 78498, Valid: true 1 - Passes: 3197, count: 78498, Valid: true 2 - Passes: 3200, count: 78498, Valid: true 3 - Passes: 3202, count: 78498, Valid: true 4 - Passes: 3207, count: 78498, Valid: true 5 - Passes: 3207, count: 78498, Valid: true 6 - Passes: 3202, count: 78498, Valid: true 7 - Passes: 3205, count: 78498, Valid: true 8 - Passes: 3209, count: 78498, Valid: true 9 - Passes: 3178, count: 78498, Valid: true

Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz - Linux - Java 18

0 - Passes: 3445, count: 78498, Valid: true 1 - Passes: 3443, count: 78498, Valid: true 2 - Passes: 3408, count: 78498, Valid: true 3 - Passes: 3449, count: 78498, Valid: true 4 - Passes: 3439, count: 78498, Valid: true 5 - Passes: 3442, count: 78498, Valid: true 6 - Passes: 3450, count: 78498, Valid: true 7 - Passes: 3445, count: 78498, Valid: true 8 - Passes: 3438, count: 78498, Valid: true 9 - Passes: 3447, count: 78498, Valid: true


UPDATING again with that the solution is found, thanks to user CraigJPerry. The blog post is updated with some details: https://blog.agical.se/en/posts/java-bitset-performance-myst...


Up to a few years ago java couldn't see the correct size of memory allocated to the container. https://blog.softwaremill.com/docker-support-in-new-java-8-f...


Updating this to thank you all for this amazing help!

I've been mostly afk, and haven't been able to act on all the clues yet, but I have conducted some of the experiments and tried to summarize them in a reproducible way here: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...


Are you testing on a x64 laptop? It could possibly be CPU throttling? Maybe try run the benchmark written in a non-JVM language to see if you get the same beahviour?


It's a mini-pc. Could be throttling. There are benchmarks written in tons of languages here: https://github.com/PlummersSoftwareLLC/Primes I've so far only seen this behaviour with the JVM involved.


Would also be interesting to see if this is a Java issue, or a JVM implementation issue: ie, does an OpenJ9 based JVM (Available under the weird IBM Semeru name here [1]) have similar behaviour?

[1]: https://developer.ibm.com/languages/java/semeru-runtimes/


A datapoint I just discovered. On JDK7 the performance is stable at the 0.5X level. It starts to toggle between 0.5X and 1X on JDK8.


Could it be that "0.5x" speed is actually 1x, and you're seeing your algorithm run at 2x sometimes on JDK8+ due to some optimization?

(Not a JVM user, so take with a grain of salt).


Yes, 1X, 0.5X are a bit arbitrary. I mean 1X as the speed I know it can run at. So it could well be that with JDK7 the optimizations could not happen and that the question then is why the optimizations do not always happen.

Making each run longer, 1 minute instead of 5 seconds displays the same performance toggling:

0 - Passes: 72705, count: 78498, Valid: true 1 - Passes: 33140, count: 78498, Valid: true 2 - Passes: 71867, count: 78498, Valid: true 3 - Passes: 33137, count: 78498, Valid: true 4 - Passes: 72264, count: 78498, Valid: true 5 - Passes: 72062, count: 78498, Valid: true 6 - Passes: 33021, count: 78498, Valid: true 7 - Passes: 33053, count: 78498, Valid: true 8 - Passes: 71902, count: 78498, Valid: true 9 - Passes: 33024, count: 78498, Valid: true

(JDK17 in this case)


I see you have a bright future in marketing: the actual facts didn’t change at all, but it now sounds twice as fast!


Are you certain it’s the jvm? Have you tried this with another lang/vm?


Not certain. But there are a lot of implementations very similar to mine in tons of other languages and vms. None display this afaik. Also I get different behavior with JVM/JDK 7.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: