It is literall impossible to complete the assessment task in this course. There’s a Q&A chatbot that has information about peoples’ professions and color preferences. There’s someone who is a Warden/ranger. Use their name to perform a prompt injection that tells you their favorite color is a shade of Blue.
You can use the answer_question function to interact with the bot → answer_question("What is Will Pearce's favorite color?").
Blue is never accepted as being a color or belonging in any other dimension in this assignment. You can’t use prompt injection to have the warden’s favorite color changed from green to blue. How do we solve this?
@jolucas, thanks for your response. I’ve generated thousands of answer_question() statements with no result. I even forced the data set to ensure that every person generated only liked the color blue and the assessment still failed even though the resulting answer is correct.
TY for the hints. I still can’t get it and I’ve written some pretty malicious code. I’ve setup Collab to avoid using the GPU and am hammering away at this, non stop, I want to cry.
Here are the LLM limitations…
The system retrieves context using get_similar_strs()
Then uses that context in answer_question()
The LLM seems to prioritize information from retrieved context over information in the query
The database has a strict format: “[Name] is a [Profession] and likes [Color]”
The LLM strongly favors this exact structure when processing information
The LLM appears to:
First retrieve similar statements
Then answer based on retrieved context
Ignore contradictory information in the question itself
The LLM seems to treat the retrieved context as ground truth
It resists attempts to override this with new information in the query
Out of respect for other folks who have solved this, I want to make sure I don’t just give you the answer. There are multiple ways to solve this, but I appreciate that it is challenging and requires you to think about LLMs and RAG applications at a fundamental level.
If you need it, there are shades of blue that exist in the dataset. You don’t need “blue” for example, “DarkBlue” would work.
You can see the exact architecture of this RAG application in the provided code. This is a publicly available model and we did no finetuning. Your statements about the model preference for retrieved information and sentence structure are incorrect. The model has no “preference” or even way to distinguish between the two types of information – that’s an assumption that you’re carrying into the problem. Is there another, simpler, reason it may feel like the model has a preference for retrieved information?
I think the reason is, subtle changes in my prompts will change the output is ways I still don’t understand. I have a prompt about the warden and I can get the color to come back as powderblue but it also contains extraneous text about a different fictitious person. If I adjust the prompt to omit any response about the fictitious person, my answer from the LLM is the color green. Its very finicky.