Retrieval 1 - Steps to Perform RAG

January 21, 2024

As we seen the below given link, LLMs are trained with large amount of data.

https://gaillms.blogspot.com/2024/01/training-llm-dataset.html

But this data does not include proprietary data and data not available for free on the internet. When we give prompts to an LLM, it replies based on the data in which it was trained. The problem arises when we want answers for data that the LLM was not trained on. Two options are generally considered for this purpose.

1. Model Fine Tuning

2. Retrieval Augmented Generation(RAG)

Based on our requirement we can decide between the 2. Our topic of discussion is RAG.

Retrieval

It is to retrieve external data like pdf, html, videos etc.

Generation

It refers to generating output with the external data retrieved.

The steps used in RAG are as follows:

Why we need RAG?

Consider a pdf document with 1000 pages about my company. This is the context within which we pose questions to the llm. That is the prompt expects answers based on the document.

For eg., if we need to find when the company was founded from the document, the following prompt is to be given:

prompt : "When was the company founded",

We need to pass the entire 1000 pages along with the prompt to the llm model. The llm will return the response

Response : "The company was founded in the year 2011".

Pitfalls in this approach

Working with LLMs is:

1. Expensive

The cost is determined by the number of tokens in the prompt and the number of tokens in the response and this gets expensive as the size of the prompt increases. The prompt token cost includes the number of tokens in the prompt and the 1000 pages.

2. Context Window

Every llm gives a context window that determines the maximum number of tokens an llm can take. This again includes the 1000 pages we pass along with the prompt.

OpenAI

We can check the cost incurred and context length here :

https://platform.openai.com/docs/models

What RAG does?

To avoid the pitfalls we make use of RAG. We know that the answer to the prompt can be found in 1 or 2 pages. Lets see a very simple implementation of RAG.

1. Document Loading

We take the document and convert it to a Document.

2. Text Splitters

We split the Document into different chunks based on some logic. Eg. 1 or 2 paragraphs.

3. Embedding

The chunks in the previous step are embedded using embedding models. This is termed as Document embedding.

4. Vector Store

Store these embedding vectors in specialized databases used to store and retrieve vectors.

5. Retrievers

Their task is to retrieve relevant chunks(stored as vectors) from the vector store. First, the prompt(query) is embedded with the same embedding model. Second, using some similarity score(like cosine similarity) between the embedded query and the chunks in the vector store, the m closest chunks are retrieved.

Prompting the LLM

The few(assume 3) relevant chunks from the 1000 page pdf that might have the answer along with the query is passed to the LLM. The response is returned with the 3 chunks as context. As you should have guessed by now, if the context(3 chunks) does not possess the right answer, everything goes wrong.

The rest of the blog focusses on performing the individual steps.

Search This Blog

Langchain Complete Production Guide