Eh, I think as "human evaluated" metrics go, it's a decent test of how well it c...

Eh, I think as "human evaluated" metrics go, it's a decent test of how well it can parse a reasonably complex sentence and reply accurately.

For me:

GPT4 3/3: I couldn't resist the temptation to take a bite of the juicy, red apple. Her favorite fruit was not a pear, nor an orange, but an apple. When asked what type of tree to plant in our garden, we unanimously agreed on an apple.

GPT3.5 2/3: "After a long day of hiking, I sat under the shade of an apple tree, relishing the sweet crunch of a freshly picked apple." "As autumn approached, the air filled with the irresistible aroma of warm apple pie baking in the oven, teasing my taste buds." "The teacher asked the students to name a fruit that starts with the letter 'A,' and the eager student proudly exclaimed, 'Apple!'"

Bard 0/3: Sure, here are three sentences ending in the word "apple": I ate an apple for breakfast.The apple tree is in bloom. The apple pie was delicious. Is there anything else I can help you with?

Bard definitely seems to fumble the hardest, it's pretty funny how it brackets the response too. "Here's three sentences ending with the word apple!" nope.

Edit: Interesting enough, Bard seems to outperform GPT3.5 and at least match 4 on my pet test prompt, asking it "What’s that Dante quote that goes something like “before me there were no something, and only something something." 3.5 struggled to find it, 4 finds it relatively quickly, Bard initially told me that quote isn't in the poem but when I reiterated I couldn't remember the whole thing it found it immediately and sourced the right translation. It answered as if it were reading out of a specific translation too - "The source I used was..." Is there agent behavior under the hood of bard or is just how the model is trained to communicate?