AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.
Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).