Have you tried any of the state-of-the-art out there?
ChatGPT is easily circumvented to violate all its own guardrails with a few instructions.
Knowledge without understanding.
It clearly doesn't understand it's own rules well enough to realize when it is asked to violate the rule "don't do C" that being asked to do A->B->C is the same as being asked to do C.
ChatGPT is easily circumvented to violate all its own guardrails with a few instructions.
Knowledge without understanding.
It clearly doesn't understand it's own rules well enough to realize when it is asked to violate the rule "don't do C" that being asked to do A->B->C is the same as being asked to do C.