Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: