An Easy Introduction to Multimodal Retrieval Augmented Generation

Originally published at:

A retrieval-augmented generation (RAG) application has exponentially higher utility if it can work with a wide variety of data types—tables, graphs, charts, and diagrams—and not just text. This requires a framework that can understand and generate responses by coherently interpreting textual, visual, and other forms of information.  In this post, we discuss the challenges of…

Hi, Figure 4 in this article needs a correction. Figure 4 is same as Figure 2 but with a different caption. Seems a human error to me.

what about tables in a given pdf? any other article from nvidia blog which focuses on tables ingestion pipeline for RAG applications?

Good catch! I’ve updated Fig. 4. Let us know if you find any other bugs…

What about when you have Engineering documents where the figures directly relate to the text?
For example, you would have numbered / lettered parts and then further, elsewhere in the document, it would refer to part XYZ in the figure 123 and provide instructions related to it.

For Fig.5 , under what scenario will the user query go directly to the LLM? (an bypass the guardrails and RAG?)

@jia.yi.wee, Along with the new processesd context, we present the User Query back to the LLM before providing a response.

@gabemv Ideally, if the referencing text has useful information related to the query, the embedding model should take care of retrieving that particular chunk. Often the pre-processing flow should be customized based on the kind of documents we work with. I would look more closely at how I am chunking my text.