...And a massive pile of cash/compute hardware.

kiney · on April 17, 2024

not that massive, we're talking six figures. There was a blogpost about this a while back on the startpage of HN.

moffkalast · on April 17, 2024

6 figures are a massive pile of cash.

squigz · on April 18, 2024

It's... really not, considering the audience here. Even less massive if 2-3 engineers get together to do it.

moffkalast · on April 18, 2024

It is considering what you get for it, and it's not lower end six figures and most likely seven. The JetMoe team released their training cost estimate and it took them $100k to train what's effectively a 2.2B model for 1.25 T tokens. Compare that to the still tiny Mistral 7B which is 3x larger and was trained on 4x more data you get a figure more around $1.7M. These are the absolute smallest production-viable LLMs.

For something like Mixtral 8X22B with 40B active params you'd looking at the $10M range, and if something gets screwed up during training you can be left with a dud and nothing to show for it, like LLama-2-33B. It's like buying millions worth of lootboxes and hoping something good drops.

htrp · on April 17, 2024

for finetuning or parameter training from scratch?

kiney · on April 17, 2024

from scratch: https://research.myshell.ai/jetmoe

kaibee · on April 17, 2024

That's for an 8B model.

cptcobalt · on April 17, 2024

This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.

lostmsu · on April 17, 2024

Assuming linear growth to number of parameters that's 7.5 figures instead of 6 for 8x22B model.