Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

blog post: https://x.ai/blog/grok-os

  * 314B parameters (86B active at a time)
  * mixture of experts 8 (2 active at a time)
  * weights and architecture licensed under Apache 2.0
(edit:) announcement blog post from last year with benchmarks compared to Claude 2, GPT-3.5 and GPT-4: https://x.ai/blog/grok

(edit2:)TL;DR: somewhat comparable to GPT-3.5, Mixtral and Qwen-1.5-72B in capability but way larger than the open weight models



Is a model so huge that’s only at the level of GPT 3.5 actually good? That seems incredibly inefficient to me.


OpenAI is valued at 90 billion and all they do is make GPT; Twitter is valued at 40 billion and this was essentially a vanity side-project by a cowboy CEO. Presuming that benchmarks and general “it’s about the level of 3.5” is accurate, it’s inefficient, but not incredibly inefficient imho


xAI is a separate entity, and not a X/Twitter subsidiary.


> Twitter is valued at 40 billion

WAS vaulued at 44B.

Now?

Maybe 5 billion.


LOL @ $5 billion, but if it that was the valuation, you'd be making parent's point stronger.


Last I heard they lost 15% of their users, so let's call it 36 billion.


They weren't even 44B when elon took the keys - he specifically tried to back out of the deal because 44B was insane peak '21 asset bubble price. In truth they were probably like 10-15B at that moment. And now that bunch of advertisers left due to we know who it's probably about 10B



Twitter didn't have direct competitors other than Mastodon when it was taken at 44B. Now there's Threads, Bluesky and bigger Mastodon.


Honestly, none of those look like meaningful competitors at the moment.


None of these matter


twitter was valued around 30 billion when musk tried getting out of buying it (then the market cap went up when it became clear that he would be forced to pay full price)


Since it is MoE, quantized it could be able to run on cheaper hardware with just consumer networking inbetween instead of needing epyc/xeon levels of PCI-e lanes, nvlink, or infiniband type networking. Or it could even run with people pooling smaller systems over slow internet links.


It’s designed to be actively searching real-time posts on X. Apples and oranges.


Why is that relevant to the size?

Post search on X is done as it is with any other data from any other source, you use RAG and function calling to insert the context.

< 7B open source models can function call very well. In fact, Nous Hermes 2 Pro (7B) is benchmarking better at that then GPT-3.5.

Not related to the size, if I'm not mistaken.


Isn't that... the same thing as search?


The data pipeline isn't included in this release, and we already know it is a pretty simple RAG pipeline using qdrant, https://twitter.com/qdrant_engine/status/1721097971830260030.

Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case.


Agreed. We have been building our real-time GPT flows for news & social as part of Louie.AI, think monitoring & and investigations... long-term, continuous training will become amazing, but for the next couple of years, most of our users would prefer GPT4 or Groq vs what's here and much smarter RAG. More strongly, the interesting part is how the RAG is done. Qdrant is cool but just a DB w a simple vector index, so nothing in Grok's release is tech we find relevant to our engine.

Eg, there is a lot of noise in social data, and worse, misinfo/spam/etc, so we spend a lot of energy on adverserial data integration. Likewise, queries are often neurosymbolic, like on a data range or with inclusion/exclusion criteria. Pulling the top 20 most similar tweets to a query and running through a slow, dumb, & manipulated LLM would be a bad experience. We have been pulling in ideas from agents, knowledge graphs, digital forensics & SNA, code synthesis, GNNS, etc for our roadmap, which feels quite different from what is being shown here.

We do have pure LLM work, but more about fine-tuning smaller or smarter models, and we find that to be a tiny % of the part people care about. Ex: Spam classifications flowing into our RAG/KG pipelines or small model training is more important to us than it flowing into a big model training. Long-term, I do expect growing emphasis on the big models we use, but that is a more nuanced discussion.

(We have been piloting w gov types and are preparing for next cohorts, in case useful on real problems for anyone.)


According to their benchmarks it is superior to GPT-3.5


I love the citation for image in the article

> The cover image was generated using Midjourney based on the following prompt proposed by Grok: A 3D illustration of a neural network, with transparent nodes and glowing connections, showcasing the varying weights as different thicknesses and colors of the connecting lines.


Mixtral is also comparable to gpt 3.5 and open.

At 8x7B it's also a fraction of the size. Are there any benchmarks comparing Mixtral to Grok?


Mixtral announcement is here: https://mistral.ai/news/mixtral-of-experts/

Mixtral looks more economical @ capability to size (similar also for Qwen 1.5 72b)


How is it that OpenAI was touted like it was some massive years-long effort that blew all AI research out of the water and now we have so many competitors popping up one after another?


You don't need to be a cutting edge research scientist to train a SOTA LLM. You just need money for scaling. OpenAI's "secret" was just their willingness to spend tens/hundreds of millions without guaranteed returns, and RLHF/instruct fine tuning, both of which are out of the bag now.


Disagree. It took more than 12 months from the release of GPT-4 to someone else producing a model of equivalent quality, and that definitely wasn't due to a shortage of investment from the competition.

There's a huge amount of depth in training a really good LLM. Not helped by the fact that iteration is incredibly expensive - it might take several months (and millions of dollars) before you can tell if your new model is working well or if there was some mistake in the pipeline that lead to a poor quality result.

Almost all of the world-class LLMs outside of OpenAI/DeepMind have been trained by people who previously worked at those organizations - giving them invaluable experience such that they could avoid the most expensive mistakes while training their new models.


Don’t overlook the training data (used for both training and instruction fine-tuning), it is one of the most crucial aspects, if not the most critical, given the significant differences observed in models with similar architectures.


While I do agree there is some amount of secret sauce, keep in mind the training takes several months. So from someone to see the success of GPT4, decide they want to invest that amount of money to train the same, raise the money to train the model, find someone competent to supervise the training, train the model for several months, then test and integrate it could easily be a year long even if there was no secret sauce.


That only remains an advantage if they can continue climbing the gradient from their lead position. If they hit a snag in scaling, methodology, or research, everyone else on the planet catches up, and then it's anyone's game again.


There's still no model of equivalent quality to GPT-4.


Claude 3 Opus is reporting superior metrics, particularly in its coding ability, and in the LLM Arena it is statistically tied with GPT-4.


When it comes to LLMs, metrics are misleading and easy to game. Actually talking to it and running it through novel tasks that require ability to reason very quickly demonstrates that it is not on par with GPT-4. As in, it can't solve things step-by-step that GPT-4 can one-shot.


This was exactly my experience. I have very complex prompts and I test them on new models and nothing performs as well as GPT-4 that I've tried (Claude 3 Opus included)


It's a bit better at writing jokes. GPT is stiff and unfunny - which is why the twitter spambots using it to generate text are so obvious.


Claude opus is better in my experience


LLM training is arcane and expensive to experiment with. So OpenAI had to waste a lot of time and GPU-hours on things that didn't work to learn the tricks that did work.

Most of the competitors have lineage straight back to OpenAI, eg the lead of x.ai was previously at OpenAI and Deepmind. Likewise with Mistral and especially Anthropic.


OpenAI still seems to be at the top, except for Anthropic, who may be close, in terms of the capabilities comparing gpt-4 and claude-opus.

This Grok-1 is a large model (~314B), which matches gpt-3.5 released 2 years ago, and at about the same level of much smaller models like, mixtral (~47B) and qwen-1.5 (~72B). Do you think it's competitive?


Egg of Columbus.

Also, the general architecture is well documented, ChatGPT (specifically the chat interface, not GPT-3, not InstructGPT) is what made a lot of people care, and actually reproducing it requires someone wanting to in the first place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: