SPARC CPUs do functionally have an on chip stack via register windows.
Basically they have something like 256 registers set up so the output registers overlap the input registers of the next callframe. e.g.
iiirrrooo
iiirrrooo
iiirrrooo
Where i == input register, r == general purpose register, and o == output register. When code calls another function it places the first N arguments in the output registers. The called function reads arguments out of the input registers.
This results in very high call and return performance. Until you exhaust the available register windows. At which point you fault, and the kernel has to manually copy the register window stack to memory. Similarly on return you may have reached the top of the stack you fault and the kernel has to copy the windows back from memory into the register windows.
For SPARC this was apparently an ok tradeoff as the "big iron" machines of the past generally did not recurse heavily.
One of the biggest problems with such an architecture is that you cannot predict whether a given function call will overflow the on-chip physical registers, and therefore will be expensive. So a simple change like adding a wrapper function, or a compiler choosing a different inline heuristic can affect the performance of your code in strange and unpredictable ways.
Context switches also get more expensive, because you have to swap out the entire windowed register file, and not just what is visible to software. (That can be mitigated by additional hardware support, but the complexity is high.)
Yeah, that why it was considered acceptable for big iron/servers of the time.
There not a huge amount of context switching (basically few processes relative to the number of cpus in the system), and known types of programs.
They knew in general most code they ran would not blow the stack. The cost of course if you ran atypical (for their designed purpose) your perf would be clobbered.
Xtensa (c.f. ESP-32 et. al. -- it's actually a pretty popular embedded architecture) also has a register window scheme, simultaneously more and less complicated than SPARC's.
Likewise the ia64 registers were rotating, with a hardware stack engine.
To answer the main question: this wasn't abandoned in the industry. CPUs still do it. It just doesn't really provide much of a competetive benefit to architectures that do it, so it's largely been abandoned.
You got the arguments backward. SPARC uses the proper order (source,dest)
So it would be something like:
ld [%l0],%o0 ; Load whatever is pointed to by %l0 into %o0
call some_function ; Stores the old pc in %o7
; Result is now in %o0
some_function:
save ; Move the register window
; The argument is now in %i0
ld %i0,%l0 ; This does not overwrite the caller's l0
ret ; Return, essentially a jump to %i7
restore ; Called due to delay slot, restores the window
x86 does not rotate its registers (though the 8087 did). I said "ia64", which is the Itanium architecture. It actually had a mix of fixed, caller-save registers and rotating argument registers, and an outrageously complicated hardware engine for moving them to and from the stack.
Xtensa is simpler than Itanium, in that the windows are fixed 4-register sets and the spilling and filling is done in an exception handler. But it's more complicated than SPARC for sure. The details are in the "Xtensa Instruction Set Architecture Reference Manual", which is a PDF that is commonly available on the internet but not AFAIK actually distributed by Cadence itself.
In modern CPUs, the last few items on the stack rarely leave the vicinity of the CPU and its caches. The programmer's model shows them as being stored in addressable memory, but that's part of the illusion of superscalar cached CPUs.
I believe newer x86 actually has a set of registers that cache the top of the stack and is as fast as the other registers, but it's hard to find information about this feature. One of the ways in which this is visible is that using the push/pop instructions is faster than manually writing to memory at the stack pointer and then adjusting the stack pointer.
"The reason" why memory is slower than registers (apart from size obviously) is that register dependencies are static and can be easily tracked by the core. Memory is way harder since you don't know all the addresses in advance (any other random store could actually be storing to the stack).
That's why they have to use costly (associative) structures like the store queue.
The way cores track the stack is usually for return address prediction because it's a huge win for little cost and almost no well behaved program overwrites return addresses manually (low mispredict rate).
As far as I know, all x86 cores emit 2 uops for a push or a pop (load/store + sp adjust). I guess it can save you some frontend bandwidth but that's about it.
> a set of registers that cache the top of the stack and is as fast as the other registers
I could have sworn I'd read something like this - the 'top' words of the stack always being kept in a very low-level cache - but I was unable to find a source on it.
It's a mechanism from few real-world RISC cpus that was used in MMIX, i.e. register windows which form a stack (RISC cpus otherwise usually didn't have a stack other than by convention). One CPU that used that approach was SPARC, although the difference is that MMIX has variable-length windows.
This is not only an excellent answer but also a fantastic history of the development of the stack.
> "I've found no particular evidence that the stack pointer was made a full 16 because they felt any need for a stack to be that large. It's clear that at least some experienced microprocessor developers (the MOS 6502 team) felt that an 8 bit stack pointer (256 byte stack) was plenty. It's possible that the 8080 designers disagreed, or it's possible that they felt they couldn't force a particular area to be RAM, as the 6502 designers could. (Even more than the MC6800, the 6502 design strongly encouraged page $00 to be RAM, so forcing page $01 to be RAM was no hardship.) Or perhaps it just didn't occur to them that registers pointing into memory could be any less than 16 bits."
To add a little bit of detail to the preceding paragraph:
A page in 6502 parlance is a contiguous block of 256 bytes in address space. Page $00 are the first 256 bytes of address space, page $01 are bytes $0000 to $00FF and so on. Page $00 (so called zeropage) is treated specially by the processor and is therefore required to be backed by RAM (not ROM or IO).
The 6502 has an 16 bit address bus but only an 8 bit stack pointer. The stack was fixed at 256 bytes (a page) in size and fixed in location at
$0100-$01FF (page $01).
So the point made above is that because the 6502 already required the zeropage to be RAM it was easy to require the RAM for the stack in a fixed place too, while the designers of other contemporary processors didn't have that luxury.
> Page $00 are the first 256 bytes of address space, page $01 are bytes $0000 to $00FF and so on.
Er, you mean page $01 is bytes $0100 to $01FF, right? And 6502 in general didn't use flat addressing, IIRC; eg `12FE,x` would address 12FE,12FF,1200,1201,... with increasing x register, rather than 1300,etc.
> Er, you mean page $01 is bytes $0100 to $01FF, right?
Yes, yes, thanks for the correction.
> And 6502 in general didn't use flat addressing, IIRC;
Not so sure about that.
> eg `12FE,x` would address 12FE,12FF,1200,1201,... with increasing x register, rather than 1300,etc.
This is correct but I think this is more due to the fact that the index register wraps around, and I would still call it flat addressing. A better argument to not call it flat would be the use of bank switching, which I think was common in 6502 based designs. But, yeah, this is probably just splitting hairs over terminology and I agree with you. Thanks for the corrections.
IA-64 (Itanium) had a register stack that was used for a lot of the things memory stacks are used for[1]. It's basically a set of virtual registers into a large register file that's then spilled out to main memory as necessary if the stack grows large enough.
So I don't think it's an idea that's completely gone away, it's probably just that until relatively recently the complexity involved in doing it with modern code (with deep stacks with large stack frames) was not really a worthwhile use of die space. And now x86 and arm are so thoroughly dominant that other paradigms have difficulty getting traction.
My first computer was a wire wrapped home brew with a Signetics 2650 cpu, and a handful of 74LS logic and 1kx4 static ram chips. The Signetics 2650 cpu had an 8 level on chip return stack. It was a good microprocessor for the time, easy to interface, and generally nice to program. But, there was no way to extend the stack or even access it so it could only be used when the call depth, including interrupts, could be guaranteed never to exceed seven levels. It enjoyed some success as an embedded cpu, but was never used in personal computers, probably because of the stack limitation.
Curiously, this is an idea who's time has come again. Keeping the return address space (call stack) separate from the display containing addressable data, could avoid some nasty security issues that plague most architectures.
Seems to me the issue of overflows corrupting return addresses could be avoided if the stack just grew upwards instead of downwards. Wouldn't prevent data corruption from overflows, but then neither would what you're suggesting...
Forth CPUs worked like that. But it would break setjump/longjmp, exception unwinding, debuggers, "green threads", goroutines, "async", coroutines - everything that messes with control flow from within the process. All that stuff assumes return points are in the same address space as the other data data.
Some of that is kernel prerogative. Still could be coded. Just avoid user code manipulating stacks. The introduction of instructions to support those features is conceivable. Still while avoiding "unrestricted addressing of the entire call stack at will".
I don't think it actually avoids those nasty security issues--just makes attacking them somewhat harder. Userspace still needs to read and write the call stack to support features such as debugging and unwind support, not to mention green threads, so you can probably still leverage existing libraries in malicious ways (à la return-oriented programming) to do the stack manipulation necessary to get you to return-oriented programming attacks.
The call stack need only be inaccessible to normal data operations. If it is solely user-addressable with call and return instructions, the vulnerability is gone.
That's kind of the whole point - call/return can be vastly more secure than 'part of a data space' as it is now. Benefits would accrue. And challenges.
How would you implement something like _Unwind_Resume, which requires returning to a different return address that's based on but not identical to the one on the stack?
The Saturn CPU that was used in HP calculators back in the 90s had a hardware return stack, with only eight levels. It was only possible to use six levels safely, as the interrupt routine used the remaining two.
The hardware stack enabled nice tricks back in the day, at the whole CPU was a fun little beast.
4-bits bus and addressable units (nibbles), plenty of 64bits registers, 20 bits addresses...
The x87 had (and still has) a stack. Itanium tried something as well.
The problem is that it does not really fit to complied code. The compiler usually has no clue how deep the stack is when a function is called (esp. with function pointer, virtual methods, etc), so it will not or can not generate optimal code for this. Plus you get the trouble with interrupts.
However, modern CPUs understand how the stack is used and actually will maintain and on-chip stack for you. That is, if you are using the standard function prologue and you dont play any tricks with the stack-modifying instructions.
Some current real-time processors still have an on-chip stack for return addresses, not entirely unlike the one the 8008 had. One example is the ARM Cortex-R which can store up to 4 addresses on its dedicated return stack.
The AT&T Hobbit chips were designed to run C, and were used in the AT&T EO portable tablets. https://en.wikipedia.org/wiki/AT%26T_Hobbit It was entirely stack-based. Another 'stack' machine was the HP-3000.
Basically they have something like 256 registers set up so the output registers overlap the input registers of the next callframe. e.g.
Where i == input register, r == general purpose register, and o == output register. When code calls another function it places the first N arguments in the output registers. The called function reads arguments out of the input registers.This results in very high call and return performance. Until you exhaust the available register windows. At which point you fault, and the kernel has to manually copy the register window stack to memory. Similarly on return you may have reached the top of the stack you fault and the kernel has to copy the windows back from memory into the register windows.
For SPARC this was apparently an ok tradeoff as the "big iron" machines of the past generally did not recurse heavily.