Benchmarks here: https://huggingface.co/databricks/dolly-v2-12b#benchmark-metric...

omneity · on April 12, 2023

> As outlined above, these results demonstrate that dolly-v2-12b is not state of the art, and in fact underperforms dolly-v1-6b in some evaluation benchmarks. We believe this owes to the composition and size of the underlying fine tuning datasets, but a robust statement as to the sources of these variations requires further study.

Taking a moment to appreciate the integrity of the team.

ingenieroariel · on April 12, 2023

Ditto, this is release early release often without necessarily meaning move fast and break things. Other teams can do the equivalent of Alpaca to Llama and we can all learn for the next round.

xatalytic · on April 12, 2023

One of the creators here - yeah, the thing we have our eyes on is the vector not the point.

It’s astounding how adaptable these open models are, even with just a quarter of the Alpaca data. We’re a team of machine learning engineers and hackers, not an AI science lab, but that’s kind of the point frankly - this whole exercise appears to be far easier that it might at first seem.

itake · on April 12, 2023

Why are they not doing metrics against GPT-3.5 and GPT-4? My understanding is Dolly performs significantly worse.

thewataccount · on April 12, 2023

I haven't played with the model just yet - but just eye balling it's performance it's significantly worse. I'm surprised they don't have Pythia on there as that's what they're based on from my understanding.

At their performance level it's the most important to compare to GPT-neoX, and I do appreciate they aren't making the "95% of GPT4" claims that some fine-tuned llama models are.

EDIT: For databricks people: I'd love to see this compared with Pythia, LLaMa, Alpaca, and vicuna/gpt4all if possible.

ankitmathur · on April 12, 2023

Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that