Suddenly, no matter which obvious jailbreak content I send to it, the API now always responds with the following exact values:
{'jailbreak': False, 'score': -0.567415780013319}
I’ve used the garak attack tool to test this API and, until today, the API always marked DAN type jailbreak attempts as True
with a positive floating point number for the score.