Grayskull is supposed to be A100 performance for $1000, with some cool features (horizontally scalable by plugging them to each other over ethernet, C++ programmable, sparse computation, etc.)
I really wonder how well it's going to perform given that it's 600 TOPS / 16GB DRAM 200 Gbit/s setup. I was told that in Transformer training, memory bandwidth is key. However, neither 16GB single-card memory capacity, nor bandwidth sounds terribly attractive but most important of all, judging from their FAQ, their driver is likely to be proprietary and would require Internet access which is worrying still.
Over the last couple weeks, I've seriously considered purchasing AMD Instinct MI50 (32GB HBM2, 1 Gbit/s) card which goes for under a $1000. I know it lacks Tensor Cores, and can only offer 53 TOPS which sounds silly compared to Grayskull's 600 TOPS. However, isn't it the case that for the most part these cores are idle, & waiting for memory still? At any rate, you're not going be able to run, say Llama 30B which won't fit already into 16GB but must fit comfortably in a 32GB system— on a single card. But perhaps most importantly, amdgpu driver is actually open source and allows PCIe passthrough unlike other vendors so considering all of the above, it almost seems like a no-brainer for a trusted computing setup.
The idea is that the tenstorrent cards dynamically prune out codepaths that aren't used.
Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.
So my understanding is that the tenstorrent cards are drastically more efficient, even on "dense" models like transformers because of the sparsity of any specific forward pass.
Also: I wouldn't bet on AMD accelerators for ML. They've disappointed every time. I would trust Jim Keller, whose every project in the last decade ended up being impactful.
I think the internet access is just to download the driver. It's not some sort of DRM setup where it needs to be always-online.
I understand this pruning is the difference between "silicon"and "software-assisted" TOPS in their documentation but I still don't see how exactly that's going to address the fact that to fit a 30B parameters model into memory, you need at least that. So basically to go 30B and up, you would need at least two, if not three cards and I couldn't find any details on how interlink is going to be implemented except that it's apparently done via Ethernet, limiting it to 100 Gbps which again, seems like a hard bandwidth limitation out-of-place when comparing it to impressive compute. Also: "online installer required" is not a good look, at least for me personally and the security model of my system. AMD cards, however, are affordable commodity hardware, offering 32GB HBM2 and the driver is open source so I wouldn't really discount it even considering their lacklustre performance. Cloud VMs do away nicely for most use-cases but as long as soon as hard security comes into focus, you just can't afford to have unauditable blobs from NVIDIA, or any other vendor for that matter. I'm still looking forward to learn more about Grayskull, especially on its memory capacity/bandwidth limitations and what it means for ever-growing large language models. Hopefully, they can open-source their driver stack.
Notice they require 32gb RAM and encourage 64gb RAM.
Presumably the architecture keeps the model in CPU RAM and shuffles it dynamically to the PCIe card network? I'm guessing here.
Whatever quirks come out of their hardware, I want to keep an eye on it.
I also think that comparing the current generation of models, which are built and trained to maximize GPU or TPU bandwidth, could be improved if someone architected models to maximize their model to greyskull advantages. Given PyTorch runs on it, I don't think it'd be too hard to do.
> Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.
> You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.
From what I've read that was just an error in reading memory consumption after switching to the mmap version and not more memory efficient at all in the end.
Grayskull is supposed to be A100 performance for $1000, with some cool features (horizontally scalable by plugging them to each other over ethernet, C++ programmable, sparse computation, etc.)