So it fails in situations where there are precisely correct answers, and thrives in vagueness. I suppose that shouldn't surprise me.
You could think about coupling it with an inference engine, and letting the inference engine win if it can generate a result, and otherwise going with the ChatGPT output. That might fix it to some degree.
You could think about coupling it with an inference engine, and letting the inference engine win if it can generate a result, and otherwise going with the ChatGPT output. That might fix it to some degree.