The report does not detail hardware -- though it states that SDXL has 2.6B parameters in its UNet component, compared to SD 1.4/1.5 with 860M and SD 2.0/2.1 with 865M. So SDXL has roughly 3x more UNet parameters.
In January, MosaicML claimed a model comparable to Stable Diffusion V2 could be trained with 79,000 A100-hours in 13 days.
Some sort of inference can be made from this information, would be interested to hear someone with more insight here provide more perspective.
bitsandbytes is only used during training with these models tho (the 8-bit Adamw) quantizing the weights and the activations to a range of 256 values when the model needs to output a range 256 values creates noticeable artifacts as they are not going to map 1-to-1.
Draw Things recently released a 8-bit quantized SD model that has comparable output as the FP16. It does use k-means based LUT and separate weights into blocks to minimize quantization errors.
I was going to search on the internet about it, but then I realized you are the author (and there is nothing online I think). I imagine that the activations are left in FP16 and the weights are converted in FP16 during inference, right?
Yes, computes are carried out in FP16 (so there is no compute efficiency gains, might be latency reductions due to memory-bandwidth saving). These savings are not realized yet because no custom kernels introduced yet.