I had wondered, why are the L1 caches not growing, while L2 and L3 capacities co...

CalChris · on June 7, 2017

I think a simpler argument is that for L1 you want fast, not big. Same thing with registers (a form of cache at a lower level). Why did MIPS only have 32 registers?

Design Principle 2: Smaller is faster. [1]

BTW, if you look at Agner Fog's latency tables [2], mov mem,r (load) went from 3 cycles in Haswell to 2 cycles in Skylake. So Intel has been concentrating on faster which is nice.

And by way of comparison, AMD increased their μop cache size in Ryzen but then only slightly. Way size went from 6 μops to 8. This matches their increase in EUs.

[1] Patterson and Hennessy. Computer Organization and Design, 5th edition, p. 67.

[2] http://www.agner.org/optimize/instruction_tables.pdf

etep · on June 7, 2017

At a high level it's true that smaller is faster, but it's also true that those L1s could have grown by adding sets (not ways) and achieved the same latency. L2 has grown, but stayed iso-latency. This seems to say that "smaller is faster" does not always hold.

Always impressed that Agner Fog takes the time to publish his results. Pretty amazing. But I think focusing your thinking on the register count in MIPs or the the uarch for some random opcode does not get into the real constraints on L1 cache design at all. One could say that x86 should be even faster, because hey, far less than 32 registers (or historically at least).

My response is like this: yes, the L1 has to be small to be fast, but it has been stuck at 32KB forever now. It could have grown! So it's not as simple as small is fast.

xorblurb · on June 7, 2017

L1 size is probably constrained by a trade-off and competition for area between different CPU parts. If it can be increased with an overall positive effect on performance (while still being economically competitive to build), then I have no doubt Intel will do it... It is probably not crucial to have a big L1 on modern x86 arch because of very deep OOO queues, HT, speculative exec, prefetching, and all the other improvements on IPC and overall package perf that need to keep some efficiency even when L1 can't keep up anyway.

I also vaguely remember the Mill cpu guy talking about cache size constraints just because of the speed of light, but given node size has continued to decrease during the last decade while frequency has nearly stopped to increase, this might be less an issue than basic area optimizations. Or this might be an interesting consideration on Mill only because it is a radically different architecture, and needs different area ratios.

Only wild guesses though, I don't even have tried to confirm any of that with any kind of research or back of the envelop calculations.

CalChris · on June 7, 2017

x86_64 has 16 integer registers but Haswell has a 192 entry ROB. Skylake has 224. So Intel does increase these numbers. It's just that there has to be a good reason. In the 90s maybe something like clock speed could win a marketing spec battle. Not today.

I think at 6 transistors per bit we really aren't talking about a lot of die area. Still I'm stone cold certain the Intel architects would increase L1 cache size if that was beneficial, if it modeled out. (However they may want to keep performance similar+predictable unless there's a solid win.)

Agner is showing they've reduced L1 latency. So this smaller is faster seems to have gotten them something.

So you really have to work backwards and ask why they didn't/don't. There may more than one reason; but they don't and haven't in quite some time.

I'm old school assembly/compiler hack. I read Agner and the Intel Optimization Manual a lot. VTune, IACA and the PMCs. Someone has to do it.

etep · on June 8, 2017

I think maybe we were talking past each other. Yes there is more than one reason.

It's far easier to add capacity by adding sets, as opposed to ways. But they can't add sets in the L1 because of the aliasing problem. When they do increase L1 capacity, if nothing else has changed, then it will be by adding ways.

marcosdumay · on June 7, 2017

Increasing the register count spends opcodes. That leads to less available instructions, or at a minimum constraints opcode optimization.

CalChris · on June 7, 2017

As we saw with AMD64, x86 is a variable length ISA, up to 15 bytes long, allowing for quiet a flexible (and complex) encoding. With a fixed width RISC, yeah registers are going to eat into opcode space. And in both cases, register renaming will allow more renamed (180) registers than architectural registers.

BTW, renamed != ROB. I got that wrong above.

marcosdumay · on June 8, 2017

In varying length architectures it will constrain opcode optimization, making your binaries larger (requiring more cache). It's not as big a problem as in fixed length instruction machines, but adding named registers is never free.

fsaintjacques · on June 7, 2017

It seems you know this subject very well :) https://github.com/etep/resume

On an unrelated subject, do you know if the next desktop generation (coffeelake?) will support AVX512 or should I just buy a skylake-x?

redcalx · on June 7, 2017

Yes (partially). See the table in this article:

https://www.kitguru.net/components/cpu/anton-shilov/intel-sk...

fsaintjacques · on June 7, 2017

Coffee lake is not the same as cannon lake. This chart is from 2015 and suggests the cannon lake would be released in early 2017.

I doubt we'll see cannon lake this year, since they released kaby lake early 2017...

https://en.wikipedia.org/wiki/Coffee_Lake https://en.wikipedia.org/wiki/Cannonlake

keldaris · on June 7, 2017

Since I don't see any flaw in your reasoning here, the obvious question becomes - why haven't they just moved to 16-way associative L1 yet? What are the hurdles?

Tuna-Fish · on June 7, 2017

The reason he gave is not the main reason L1s don't grow. The main reason is latency.

Increasing cache size grows latency in two ways: every doubling of the cache size adds one mux to the select path, and every doubling increases the wire delay from the most distant element by ~sqrt(2). Both of these are additively on the critical path, and spending more time on them would require increasing the cache latency.

The size of a cache is always a tradeoff against the latency of the cache. If this was not true, there would only be a single cache level that was both large and fast. However, making something both large and fast is impossible, so instead we have stacked cache levels starting with a very fast but small cache followed by increasingly slower and larger ones.

etep · on June 7, 2017

Hi, it is the main reason L1 hasn't grown.

By your reasoning, no cache should be able to grow, because then their latency would increase too much. But instead, all other CPU caches are growing basically with iso-latency. The reason this is possible is technology scaling. Anyway...

But yes, the L1 does have to be small and fast, but it doesn't have to be that small to be that fast. It has to be that small because of virtual indexing combined with the cost of adding ways breaking other design constraints (possibly a latency constraint, fine). But you could grow the L1 by adding sets and get your required latency.

Tuna-Fish · on June 7, 2017

> By your reasoning, no cache should be able to grow, because then their latency would increase too much. But instead, all other CPU caches are growing basically with iso-latency. The reason this is possible is technology scaling. Anyway...

The problem is that technology scaling only gives you roughly enough to keep up with the speed of the CPU. Cache latencies measured as nanoseconds keep going down, but cache latencies measured in clock cycles are pretty stagnant at the same sizes. And when Intel added some more L3, they also relaxed latency to it, and when they recently cut the latency to it a little, they did so by cutting the amount.

> But yes, the L1 does have to be small and fast, but it doesn't have to be that small to be that fast. It has to be that small because of virtual indexing combined with the cost of adding ways breaking other design constraints (possibly a latency constraint, fine). But you could grow the L1 by adding sets and get your required latency.

No, you couldn't. The added latency of increasing the size would be simply too much. I know for a fact that the L1 load latency is currently one of the most important critical paths in Intel CPUs -- any increase in L1 size would mean that you have to reduce clock speeds.

etep · on June 7, 2017

If it was easy to build high performance caches with high associativity, we would certainly see higher associativity. Ideally, you want a fully associative cache, but it's too expensive. In CPU caches, once a set is selected, all N associative ways are compared simultaneously. So growing associativity costs area and power for extra comparators. This growth could cause timing issues, i.e. add latency to the memory access or cause CPU freq. to be lowered.

gchadwick · on June 7, 2017

You can also build a memory system that is able to deal with the aliasing. Then you don't have the dependence on page size.