Something I'm not sure I understand. It seems there are a set of expectations that you are evaluating against. Are these expectations also written in the prompt?
The expectations/rules are usually written in the prompt. However, we see that prompts get big and the model has too much to keep track of, which leads to it not following all instructions.