Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

...And a massive pile of cash/compute hardware.


not that massive, we're talking six figures. There was a blogpost about this a while back on the startpage of HN.


6 figures are a massive pile of cash.


It's... really not, considering the audience here. Even less massive if 2-3 engineers get together to do it.


It is considering what you get for it, and it's not lower end six figures and most likely seven. The JetMoe team released their training cost estimate and it took them $100k to train what's effectively a 2.2B model for 1.25 T tokens. Compare that to the still tiny Mistral 7B which is 3x larger and was trained on 4x more data you get a figure more around $1.7M. These are the absolute smallest production-viable LLMs.

For something like Mixtral 8X22B with 40B active params you'd looking at the $10M range, and if something gets screwed up during training you can be left with a dud and nothing to show for it, like LLama-2-33B. It's like buying millions worth of lootboxes and hoping something good drops.


for finetuning or parameter training from scratch?



That's for an 8B model.


This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.


Assuming linear growth to number of parameters that's 7.5 figures instead of 6 for 8x22B model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: