Cant wait for tenstorrent card to come to public. Grayskull is supposed to be A1...

tucnak · on April 5, 2023

I really wonder how well it's going to perform given that it's 600 TOPS / 16GB DRAM 200 Gbit/s setup. I was told that in Transformer training, memory bandwidth is key. However, neither 16GB single-card memory capacity, nor bandwidth sounds terribly attractive but most important of all, judging from their FAQ, their driver is likely to be proprietary and would require Internet access which is worrying still.

Over the last couple weeks, I've seriously considered purchasing AMD Instinct MI50 (32GB HBM2, 1 Gbit/s) card which goes for under a $1000. I know it lacks Tensor Cores, and can only offer 53 TOPS which sounds silly compared to Grayskull's 600 TOPS. However, isn't it the case that for the most part these cores are idle, & waiting for memory still? At any rate, you're not going be able to run, say Llama 30B which won't fit already into 16GB but must fit comfortably in a 32GB system— on a single card. But perhaps most importantly, amdgpu driver is actually open source and allows PCIe passthrough unlike other vendors so considering all of the above, it almost seems like a no-brainer for a trusted computing setup.

I wonder if my logic is correct.

MI50: https://www.amd.com/en/products/professional-graphics/instin...

VHRanger · on April 5, 2023

The idea is that the tenstorrent cards dynamically prune out codepaths that aren't used.

Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.

So my understanding is that the tenstorrent cards are drastically more efficient, even on "dense" models like transformers because of the sparsity of any specific forward pass.

Also: I wouldn't bet on AMD accelerators for ML. They've disappointed every time. I would trust Jim Keller, whose every project in the last decade ended up being impactful.

I think the internet access is just to download the driver. It's not some sort of DRM setup where it needs to be always-online.

tucnak · on April 5, 2023

I understand this pruning is the difference between "silicon"and "software-assisted" TOPS in their documentation but I still don't see how exactly that's going to address the fact that to fit a 30B parameters model into memory, you need at least that. So basically to go 30B and up, you would need at least two, if not three cards and I couldn't find any details on how interlink is going to be implemented except that it's apparently done via Ethernet, limiting it to 100 Gbps which again, seems like a hard bandwidth limitation out-of-place when comparing it to impressive compute. Also: "online installer required" is not a good look, at least for me personally and the security model of my system. AMD cards, however, are affordable commodity hardware, offering 32GB HBM2 and the driver is open source so I wouldn't really discount it even considering their lacklustre performance. Cloud VMs do away nicely for most use-cases but as long as soon as hard security comes into focus, you just can't afford to have unauditable blobs from NVIDIA, or any other vendor for that matter. I'm still looking forward to learn more about Grayskull, especially on its memory capacity/bandwidth limitations and what it means for ever-growing large language models. Hopefully, they can open-source their driver stack.

VHRanger · on April 5, 2023

Notice they require 32gb RAM and encourage 64gb RAM.

Presumably the architecture keeps the model in CPU RAM and shuffles it dynamically to the PCIe card network? I'm guessing here.

Whatever quirks come out of their hardware, I want to keep an eye on it.

I also think that comparing the current generation of models, which are built and trained to maximize GPU or TPU bandwidth, could be improved if someone architected models to maximize their model to greyskull advantages. Given PyTorch runs on it, I don't think it'd be too hard to do.

tucnak · on April 6, 2023

Thank you

lostmsu · on April 6, 2023

> Transformers naturally have a big part of the network that is unused on any particular token flowing through the model. You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.

Dense transformers (GPT-3 AFAIK is dense) don't.

VHRanger · on April 6, 2023

They do though. A ton of the neurons end up not activating on any particular layer, leading to a huge waste as you're passing a zero down the layers.

I spoke with a PM at TT and he told me an important idea is that we're spending a lot of electricity multiplying things with zeros.

rattt · on April 5, 2023

> You could see that by how little RAM ended up being used in llama.cpp when they moved to mmaping the model.

From what I've read that was just an error in reading memory consumption after switching to the mmap version and not more memory efficient at all in the end.

VHRanger · on April 6, 2023

Not exactly. It's that the model is loading less stuff out of the mmap'ed weights that you would expect.

The author of the mmap patch chimes in here:

https://news.ycombinator.com/item?id=35393615

ricv · on April 5, 2023

Seeing people calling this an NVDA A100 killer, being passed around. Any thoughts on this having as much of an impact as stated.

VHRanger · on April 5, 2023

To be honest even if it fulfills half the expectations itll sell like hotcakes