Curious why you think this? PaLM2 looks great, and Google has been productizing ...

MacsHeadroom · on May 11, 2023

PaLM 2 can't even solve "Write three sentences ending in the word Apple."

It's worse than GPT-3.5. Go see for yourself at bard.google.com, which is running on PaLM 2 everywhere but the EU as of yesterday.

Garrrrrr · on May 11, 2023

Ah yes, the famous benchmark for all LLMs. I just tried your novel example with GPT-3.5 and it couldn't solve it either:

> After lunch, I like to snack on a juicy and crisp apple to satisfy my sweet tooth.

> In the fall, many families enjoy going to apple orchards to pick their own apples and make homemade apple pies.

> The new MacBook Pro features a powerful M1 chip and a stunning Retina display, making it the perfect tool for creative professionals who work with Apple software.

mustacheemperor · on May 11, 2023

Eh, I think as "human evaluated" metrics go, it's a decent test of how well it can parse a reasonably complex sentence and reply accurately.

For me:

GPT4 3/3: I couldn't resist the temptation to take a bite of the juicy, red apple. Her favorite fruit was not a pear, nor an orange, but an apple. When asked what type of tree to plant in our garden, we unanimously agreed on an apple.

GPT3.5 2/3: "After a long day of hiking, I sat under the shade of an apple tree, relishing the sweet crunch of a freshly picked apple." "As autumn approached, the air filled with the irresistible aroma of warm apple pie baking in the oven, teasing my taste buds." "The teacher asked the students to name a fruit that starts with the letter 'A,' and the eager student proudly exclaimed, 'Apple!'"

Bard 0/3: Sure, here are three sentences ending in the word "apple": I ate an apple for breakfast.The apple tree is in bloom. The apple pie was delicious. Is there anything else I can help you with?

Bard definitely seems to fumble the hardest, it's pretty funny how it brackets the response too. "Here's three sentences ending with the word apple!" nope.

Edit: Interesting enough, Bard seems to outperform GPT3.5 and at least match 4 on my pet test prompt, asking it "What’s that Dante quote that goes something like “before me there were no something, and only something something." 3.5 struggled to find it, 4 finds it relatively quickly, Bard initially told me that quote isn't in the poem but when I reiterated I couldn't remember the whole thing it found it immediately and sourced the right translation. It answered as if it were reading out of a specific translation too - "The source I used was..." Is there agent behavior under the hood of bard or is just how the model is trained to communicate?

sebzim4500 · on May 11, 2023

I guess PaLM2 is competitive with GPT-3.5 so for people not willing to pay it will be an attractive offering.

I'm not sure that counts as 'great' though.

rsstack · on May 11, 2023

Based on what do you think it's comparable to GPT-3.5 and not to 4? Did we see a lot of public performance?

sebzim4500 · on May 11, 2023

They claim it is already being used in Bard, also if you read the paper it does much worse at the important benchmarks.