Exploring Adversarial Machine Learning - 7_LLM Assessment

All,

It is literall impossible to complete the assessment task in this course. There’s a Q&A chatbot that has information about peoples’ professions and color preferences. There’s someone who is a Warden/ranger. Use their name to perform a prompt injection that tells you their favorite color is a shade of Blue.

You can use the answer_question function to interact with the bot → answer_question("What is Will Pearce's favorite color?").

Blue is never accepted as being a color or belonging in any other dimension in this assignment. You can’t use prompt injection to have the warden’s favorite color changed from green to blue. How do we solve this?

@jbungo can you please help me? I’m burning GPU time on this task as well as the poisoning assessment. It seems to be NVIDIA issue.

xxxxxxxx help

Folks have completed this assessment successfully and we’ve confirmed there are viable solutions.

Here are some tips:

  1. Look at how the complete LLM query is constructed. Specifically think about how the data is combined in answer_question().
  2. Use words, phrases, and structures to steer the model towards your solution and away from data you can’t control.
1 Like

@jolucas, thanks for your response. I’ve generated thousands of answer_question() statements with no result. I even forced the data set to ensure that every person generated only liked the color blue and the assessment still failed even though the resulting answer is correct.

I’ve literally have written thousands of prompt injections using the answer_question() mechanism to no avail.

I confirmed again this morning that I can get the assessment to score successfully for me.

Your data forcing technique won’t work because the validation is done with a server-side dataset.

You’re on the right track with manipulating the prompt. Read my tips again.

  • Use get_similar_strs() to think about the retrieval process.
  • Carefully examine answer_question() and think about how that retrieved information is used to augment the string you are providing.
  • Think about LLM limitations.
  • Structure your prompt to exploit those limitations.

TY for the hints. I still can’t get it and I’ve written some pretty malicious code. I’ve setup Collab to avoid using the GPU and am hammering away at this, non stop, I want to cry.

Here are the LLM limitations…

  • The system retrieves context using get_similar_strs()
  • Then uses that context in answer_question()
  • The LLM seems to prioritize information from retrieved context over information in the query
  • The database has a strict format: “[Name] is a [Profession] and likes [Color]”
  • The LLM strongly favors this exact structure when processing information
  • The LLM appears to:
    • First retrieve similar statements
    • Then answer based on retrieved context
    • Ignore contradictory information in the question itself
  • The LLM seems to treat the retrieved context as ground truth
  • It resists attempts to override this with new information in the query

Can you please provide more guidance?

@jolucas the color blue does not exist, the faker seed is 4321. I’m so on the wrong track!! Help!!

  1. get_similar_strs() gathers context, which means any question about Danny will always retrieve his “likes Green” statement
  2. The QA model uses this context to answer
  3. I can’t change stored facts
  4. I can’t manipulate the context retrieval
  5. I can’t retreive blue as a color for anything?

I don’t remember this being covered in the LLM course. Please help me

Temporarily overwrite context for Danny Shaffer

s.sentences = [
“Danny Shaffer is a Warden/ranger and likes Blue.”
]

Re-embed the data to prioritize the updated context

s.dataset_embeddings = s.embed_model.encode(s.sentences)

Run the injection query again

injection_query = “What is Danny Shaffer’s favorite color?”
injected_response = s.answer_question(injection_query)
print(“Injected Response:”, injected_response)

Revalidate to ensure the response persists

revalidated_response = s.answer_question(injection_query)
print(“Revalidated Response:”, revalidated_response)

Check the retrieved context to confirm it includes only the new information

retrieved_context = s.get_similar_strs(injection_query)
print(“Retrieved Context:”, retrieved_context)

I seriously need help. I don’t see how this can be solved.

Out of respect for other folks who have solved this, I want to make sure I don’t just give you the answer. There are multiple ways to solve this, but I appreciate that it is challenging and requires you to think about LLMs and RAG applications at a fundamental level.

  1. If you need it, there are shades of blue that exist in the dataset. You don’t need “blue” for example, “DarkBlue” would work.
  2. You can see the exact architecture of this RAG application in the provided code. This is a publicly available model and we did no finetuning. Your statements about the model preference for retrieved information and sentence structure are incorrect. The model has no “preference” or even way to distinguish between the two types of information – that’s an assumption that you’re carrying into the problem. Is there another, simpler, reason it may feel like the model has a preference for retrieved information?

I think the reason is, subtle changes in my prompts will change the output is ways I still don’t understand. I have a prompt about the warden and I can get the color to come back as powderblue but it also contains extraneous text about a different fictitious person. If I adjust the prompt to omit any response about the fictitious person, my answer from the LLM is the color green. Its very finicky.