It is considering what you get for it, and it's not lower end six figures and most likely seven. The JetMoe team released their training cost estimate and it took them $100k to train what's effectively a 2.2B model for 1.25 T tokens. Compare that to the still tiny Mistral 7B which is 3x larger and was trained on 4x more data you get a figure more around $1.7M. These are the absolute smallest production-viable LLMs.
For something like Mixtral 8X22B with 40B active params you'd looking at the $10M range, and if something gets screwed up during training you can be left with a dud and nothing to show for it, like LLama-2-33B. It's like buying millions worth of lootboxes and hoping something good drops.
This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.