More

aengelke · 2026-03-04T13:08:39 1772629719

That's not generally correct. Compile-time is a concern for several databases.

SigmundA · 2026-03-04T15:05:59 1772636759

Most systems submit many of the same queries over and over again.

Ad-hoc one off queries usually can accept higher initial up-front compile cost because the main results usually take much longer anyway, vs worrying about an extra 100ms of compile.

Maybe it was too strong to say its not a concern at all, but nothing like PG where every single request needs to replan and potentially jit unless the client manually prepares and keeps the connection open.

aengelke · 2026-03-04T13:05:55 1772629555

> It's very difficult to do low-latency queries if you cannot cache the compiled code

This is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.

Postgres is very robust and very powerful, but simply not designed for fast execution of queries.

Disclosure: I work in the group that develops Umbra.

[1]: https://umbra-db.com/

Asm2D · 2026-03-05T07:29:33 1772695773

If I recall research papers regarding Umbra it's also using AsmJit as a JIT backend, which means that theoretically the compilation times would be comparable if you only consider code emitting overhead.

The problem will always be queries where the compilation is orders of magnitude more expensive than the query itself. I can imagine indexed lookup of 1 or few entries, etc... Accessing indexed entries like these are very well optimized by SQL query engines and possibly make no sense JIT optimizing.

aengelke · 2026-03-08T19:29:44 1772998184

I'm a bit late, but: Umbra doesn't use AsmJIT anymore since many years, it was too slow.

vladich · 2026-03-05T08:08:00 1772698080

Interesting... AsmJit is pretty fast for compilation, but about 3x than sljit. The only way I can see how to make it fast enough, in theory (i.e. without slowing down point-lookup queries and such) would be to fuse planning with code generation - i.e. a single pass plan builder + compiler essentially. Not sure if Umbra tries to do that, and AsmJit is not the best choice for it anyway, but with sljit it could be on par with interpreter even for fastest queries I believe. Pretty hard (likely impossible) to implement though, planning is inherently a non-linear process...

Asm2D · 2026-03-05T13:00:20 1772715620

Because pg_jitter uses AsmJit's Compiler, which also allocates registers. That's much more work than using hardcoded physical registers in SLJIT case. There is always a cost of such comfort.

I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.

vladich · 2026-03-05T18:16:52 1772734612

There are other issues with that auto-allocation. I tested all 3 backends on very large queries (hundreds of KBs) per query. Performance of all of them (+LLVM, but -sljit) was abysmal - the compiler overhead was in seconds to tens(!) of seconds. They have some non-linear components in their optimization algorithms. While sljit was scaling linearly and almost as fast as for smaller queries. So yes, it gives higher run-time performance but the cost of that performance grows non-linearly with code size and complexity. While you still can have good performance with manual allocations. I also don't believe you can make AsmJit 2x faster without sacrificing that auto-allocation algorithm.

Asm2D · 2026-03-05T20:44:23 1772743463

AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).

If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.

But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.

So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.

I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.

vladich · 2026-03-05T20:53:06 1772743986

I have a test of 200Kb query that AsmJit takes 7 seconds to compile (that's not too bad both LLVM and MIR take ~20s), while sljit does it in 50ms. 200Kb is a pathological case, but it's not unheard of in the area I'm working on. It's realistic, although a rare case. Last 10-15 years most OLTP workloads became CPU bound, because active datasets of most real databases fully fit in memory. There are exceptions, of course.

Asm2D · 2026-03-05T21:55:34 1772747734

That's interesting - 200kB should not be a big deal for it - maybe it uses something that I usually don't, like many function calls, or insane number of branches, etc... I would be interested in that case, but I'm not sure whether I would be able to blindly improve AsmJit without a comprehensive test.

Definitely good to know though. When it comes to low-latency compilation my personal goal is to make it even faster when generating small functions.

vladich · 2026-03-05T19:06:51 1772737611

SLJIT is a bit smarter than just to use hardcoded registers. It's multi-platform anyway, so it uses registers when they are available on the target platform, if not it will use memory, that's why performance can differ between Windows and Linux on x64 for example - different number of available registers.

Asm2D · 2026-03-05T20:48:37 1772743717

Indeed, but this also means that you would get drastically different performance on platforms that have more physical registers vs on platforms that have less. For example x86_64 only has 16 GP registers, while AArch64 has 32 - if you use 25 registers without any analysis and just go to stack with 10 of them, the difference could be huge.

But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.

vladich · 2026-03-05T21:00:51 1772744451

It's definitely different, and for Postgres specifically, they may complement each other. SLJit can be used for low latency queries where codegen time is more important than optimizations, also for other platforms like s390x / PPC / SPARC, etc. AsmJit can be used for SIMD optimizations for x86_64 and ARM64. MIR is kinda in the middle - it does auto-allocations of registers, doesn't support SIMD, but also it's multiplatform. The only thing that doesn't fit well here is LLVM :). It has some advantages in some edge cases, but... It really needs a separate provider, the current one is bad. I'll probably create another LLVM backend for pg_jitter in the future to utilize it properly...

vladich · 2026-03-05T19:03:25 1772737405

Good point about SIMD opportunities though - it's something other 2 JITs lack.

aengelke · 2025-12-13T07:21:32 1765610492

> What's the rationale?

Gift cards are used by phishers. In our institution, we routinely get personalized spam mails (in the name of the corresponding group lead of the recipient, sent via GMail -- this is not low-effort) that ask whether they are available and, when (accidentally) responding, ask for Apple gift cards.

kstrauser · 2025-12-13T07:40:37 1765611637

My coworkers report these to me every single business day. They’re usually like:

> Hey, it’s me, your CEO. I’m in a meeting with our big customer and I need an urgent favor. Thanks! You’re a life saver.

> - Mr. CEO

aengelke · 2025-12-02T16:26:22 1764692782

I fully agree, but:

> these are the string instructions like REP MOVSB

AArch64 nowadays has somewhat similar CPY* and SET* instructions. Does that make AArch64 CISC? :-) (Maybe REP SCASB/CMPSB/LODSB (the latter being particularly useless) is a better example.)

aengelke · 2025-12-02T16:20:31 1764692431

> LEA happens to be the unique instruction where the memory operand is not dereferenced

Not quite unique: the now-deprecated Intel MPX instructions had similar semantics, e.g. BNDCU or BNDMK. BNDLDX/BNDSTX are even weirder as they don't compute the address as specified but treat the index part of the memory operand separately.

aengelke · 2025-11-21T12:41:56 1763728916

Been there, done that during my PhD (code: [1]). Works reasonably well, except for compile times (for which I implemented a caching strategy). However, due to calling conventions, using LLVM isn't going to give the best possible performance. Some features like signal handling are extremely hard to implement with LLVM (I didn't, therefore). Although the overall performance results have been good, it's not an approach that I could strongly recommend.

[1]: https://github.com/aengelke/instrew

aengelke · 2025-10-29T20:53:14 1761771194

The same site hosts [1], but that's not nearly as nice as the 32-bit version. It's also a bit outdated.

[1]: https://www-user.tu-chemnitz.de/~heha/hs/chm/x86.chm/x64.htm

fweimer · 2025-10-29T21:07:17 1761772037

Thanks. Looks like the original now has some clarifications, including more detail regarding the REX prefixes: https://wiki.osdev.org/X86-64_Instruction_Encoding

non_obsolete · 2025-10-29T21:21:40 1761772900

sandpile.org is your friend.

aengelke · 2025-10-29T19:10:54 1761765054

> I’d suggest starting with arm

I agree: AArch64 is a nice instruction set to learn. (Source: I taught ARMv7, AArch64, x86-64 to first-year students in the past.)

> how simple instruction encoding is on arm64

Having written encoders, decoders, and compilers for AArch64 and x86-64, I disagree. While AArch64 is, in my opinion, very well designed (also better than RISC-V), it's certainly not simple. Here's some of my favorite complexities:

- Many instructions have (sometimes very) different encodings. While x86 has a more complex encoding structure, most encodings follow the same structure and are therefore remarkably similar.

- Huge amount of instruction operand types: memory + register, memory + unsigned scaled offset, memory + signed offset, optionally with pre/post-increment, but every instruction supports a different subset; vector, vector element, vector table, vector table element; sometimes general-purpose register encodes a stack pointer, sometimes a zero register; various immediate encodings; ...

- Logical immediate encoding. Clever, but also very complex. (To be sure that I implemented the decoding correctly, I brute-force test all inputs...)

- Register constraints: MUL (by element) with 16-bit integers has a register constraint on the lowest 16 registers. CASP requires an even-numbered register. LD64B requires an even-numbered register less than 24 (it writes Xt..Xt+7).

- Much more instructions: AArch64 SIMD (even excluding SVE) has more instructions than x86 including up to AVX-512. SVE/SME takes this to another level.

userbinator · 2025-10-30T05:57:45 1761803865

A32 is simpler, but looking at A64 instructions certainly begs the question "they still call this a RISC?"

aengelke · 2025-10-30T15:11:39 1761837099

Actually, nowadays Arm describes the ISA as a load-store architecture. The RISC vs. CISC debate is, in my opinion, pretty pointless nowadays and I'd prefer if we'd just stop using these words to describe ISAs.

Joker_vD · 2025-10-30T12:39:59 1761827999

Hey, if you can't write "ADD [r0], r1" but have instead to do "LDR r2, [r0]; ADD r2, r1; STR [r0], r2", it means it's still RISC.

aengelke · 2025-09-30T16:58:04 1759251484

TPDE co-author here. Nice work, this was easier than expected; so we'll have better upstream ORC support soon [1].

The benchmark is suboptimal in multiple ways:

- Multi-threading makes things just slower. When enabling multi-threading, LLJIT clones every module into a new context before compilation, which is much more expensive than compilation. There's also no way to disable this. This causes a ~1.5x (LLVM)/~6.5x (TPDE) slowdown (very rough measurement on my laptop).

- The benchmark compares against the optimizing LLVM back-end, not the unoptimizing back-end (which would be a fairer comparison) (Code: JTMB.setCodeGenOptLevel(CodeGenOptLevel::None);). Additionally, enabling FastISel helps (command line -fast-isel; setting the TargetOption EnableFastISel seems to have no effect). This gives LLVM a 1.6x speedup.

- The benchmark is not really representative, as it causes FastISel fallbacks to SelectionDAG in some very large basic blocks -- i24 occurs rather rarely in real-world code. This is the reason why the speedup from the unoptimizing LLVM back-end is so low. Replacing i24 with i16 gives LLVM another 2.2x speedup. (Hint: to get information on FastISel fallbacks, enable FastISel and pass the command line options "-fast-isel-report-on-fallback -pass-remarks-missed=sdagisel" to LLVM. This is really valuable when optimizing for compile times.)

So we get ~140ms (TPDE), ~730ms (LLVM -O0), or 5.2x improvement. This is nowhere near the 10-20x speedup that TPDE typically achieves. Why? The new bottleneck is JITLink, which is featureful but slow -- profiling indicates that it consumes ~55% of the TPDE "compile time" (so the net compile time speedup is ~10x). TPDE therefore ships its own JIT mapper, which has fewer features but is much faster.

LLVM is really powerful, and despite being not particularly fast, the JIT API makes it extremely difficult to make it not extra-slow, even for LLVM experts.

[1]: https://github.com/tpde2/tpde/commit/29bcf1841c572fcdc75dd61...

weliveindetail · 2025-09-30T19:14:19 1759259659

Please note that the post didn't mention the word benchmark a single time ;) It does a "basic performance measurement" of "our csmith example". Anyway, thanks for your notes, they are very welcome and valid.

Comparing TPDE against the default optimization level in ORC is not fair (because that is -O2 indeed), but that's what we get off-the-shelf. I tested the explicit FastISel setting and it didn't help on the LLVM side, as you said. I didn't try the command-line option though, thanks for the tip! (Especially the -pass-remarks-missed will be useful.)

And yeah, csmith doesn't really generate representative code, but again that was not stated either. I didn't dive into JITLink as it would be a whole post on its own, but yes feature-completeness prevailed over performance here as well -- seems characteristic for LLVM and isn't soo surprising :)

Last but not least, yes multi-threading isn't working as good as the post indicates. This seems related to the fix that JuliaLang did for the TaskDispatcher [1]. I will correct this in the post and see which other points can be addressed in the repo.

Looking forward for your OrcCompileLayer in TPDE!

[1] https://github.com/JuliaLang/julia/pull/58950

aengelke · 2025-09-07T17:26:45 1757266005

Or rather: There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

(source: https://martinfowler.com/bliki/TwoHardThings.html)