On the PC side start by reading some basics like https://archive.org/details/URP...

On the PC side start by reading some basics like https://archive.org/details/URP_8th_edition/ (never editions require logging in and borrowing)

>What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?

Long time ago you would memory map the framebuffer and just write directly to it.

Then first 2D acceleration showed up in 1987 in form of IBM 8514 (later cloned by ATI/Matrox/S3/Tseng and others). You wrote commands one at a time using I/O port access to FIFO with pooling for idle/full, no direct access to the framebuffer http://www.os2museum.com/wp/the-8514a-graphics-accelerator/

Next evolution was MMIO - memory mapped IO. You no longer executed dedicated CPU IO instruction (assembler IN/OUT), IO ports were simply addresses in memory. You still had FIFOs and wrote one command at a time http://www.o3one.org/hwdocs/video/voodoo_graphics.pdf

Then someone threw DMA into the mix. Now you could DMA contents of a circular buffer filled with your commands http://www.bitsavers.org/components/s3/DB019-B_ViRGE_Integra...

We finally got command list/command buffer/bundle copied directly to the GPU.

Nowadays you have multiple command lists/command buffers/bundles going in parallel https://developer.nvidia.com/blog/advanced-api-performance-c...

On a hardware side 8/16 bit ISA bus was a shared parallel connection to CPU bus at fixed clock (4.77-10MHz, 4 clocks per transfer, ~5MB/s max speed).

It took us up to 1992 to get the next commonly used solution, a "rogue" consortium of companies tired of IBM shit designed VESA Local Bus (a true hack) in form of slapping expansion cards direct on the raw 32bit CPU bus of 486 processors. Cheap, no licensing fees, extremely fast (40MHz x 32bit = potentially faster than later PCI), easy to implement.

This got replaced with the advent of Pentium (64bit external CPU data bus) and introduction of PCI. PCI is still a shared parallel bus, but this time 32bits at 33MHz with packetized transactions.

AGP was "just" a faster PCI on its own dedicated separate controller (no contention with other PCI devices) and optimized addressing (sideband). 32bit at 66MHz, then x2 DDR, x4 QDR, x8 ODR. Last one means there are 8 transfers taking place between one clock cycle for a nice 2GB/s.

PCI-E is faster bidirectional serial point-to-point PCI with ability to combine links into bundles (x1-x16). PCI-E devices live on a network switch and dont block each other from talking simultaneously. You could think of PCI-E as every PCI device getting its own dedicated dual direction AGP connector.

Some vintage hands on coding examples:

2D Tseng Labs ET4000 coding https://www.youtube.com/watch?v=K8kZ4BFxOtc

2D Cirrus Logic https://www.youtube.com/watch?v=WoAE7x-u1g0

"How 3D acceleration started 20 years ago: S3/Virge register level programming" https://www.youtube.com/watch?v=fXJ11_wG_0U

"Acceleration code working on real S3 Virge/DX" https://www.youtube.com/watch?v=Hsg1N4IqXac

"Direct hardware accelerated 3d in 20kB code" https://www.youtube.com/watch?v=n509_wN02u8

"Bare metal hardware 3d texturing in 23kb of code w/ S3/Virge" https://www.youtube.com/watch?v=UgvBGXiw6LY

"Testing our latest low-level hardware 3d code on real S3/Virge hardware" https://www.youtube.com/watch?v=px--LWdRoYA

"Live coding and testing more low-level 3D w/ S3/Virge" https://www.youtube.com/watch?v=l3lH0cIZUSA

"Finishing low-level hardware S3/Virge acceleration demo" https://www.youtube.com/watch?v=JmfeB2LEDbc

"3dfx Voodoo: Low-level & bare-metal driver-less code" https://www.youtube.com/watch?v=LDT6KlfOG2k

"Finally 3dfx Voodoo triangles" https://www.youtube.com/watch?v=ZWaDqY4gqhw

"More GPU programming Voodoo case study" https://www.youtube.com/watch?v=AYZvNyxFHqk

"Quite final 3dfx Voodo low-level code working" https://www.youtube.com/watch?v=2ADQgIEWrx4