And I think this is a common problem actually — figuring out what to measure and how to measure it – it's not black and white. What I do is have a few dimensions to measure it against (this may or may not fit your use case): relevance, instruction following, clarity, hallucination rate, etc. but even then, it becomes hard to measure things like 'clarity'.