Retrieval Augmented Generation


Retrieval-Augmented Generation (RAG) is a powerful method in Natural Language Processing (NLP) that leverages the capabilities of pre-trained language models and information retrieval systems to generate responses. It's especially beneficial when a language model needs to access information beyond its training data.

RAG operates by initiating with a user's question or query, followed by the retrieval of relevant documents or passages using an information retrieval system. This could be based on keyword matching, semantic search, or other techniques. The retrieved documents serve as context for the language model, which then generates a fluent, coherent, and factually correct response.

Here are the steps for RAG:

1. **Embedding Creation**: First, each document in the database is transformed into a semantic embedding using an embedding model. This model could be a pre-trained language model, a transformer model, or any other model capable of creating meaningful embeddings.

2. **Embedding Storage**: These embeddings are then stored in a database or an index. This allows for efficient retrieval of documents based on their embeddings.

3. **Query Embedding**: When a user asks a question, the question is also transformed into a semantic embedding using the same embedding model.

4. **Document Retrieval**: The system then retrieves the documents whose embeddings are most similar to the question's embedding. This is typically done using a method like cosine similarity or nearest neighbor search.

5. **Answer Generation**: The retrieved documents are then provided as context to a language model, which generates a response.

Parametric versus source memory

The effectiveness of RAG lies in its combination of parametric and source memory. Parametric memory refers to the static knowledge encoded in the model's parameters (weights and biases) during its training phase. On the other hand, source memory pertains to the dynamic, external knowledge that the model accesses during inference, such as the retrieved documents in response to a user's question. This combination allows RAG to generate responses that are not only fluent and coherent but also factually accurate and up-to-date.

Semantic embedding

A critical component of RAG is semantic embedding, a technique that transforms words, phrases, sentences, or documents into vectors of real numbers, capturing their semantic content. In RAG, semantic embeddings are crucial in the document retrieval step. Each document in the database is transformed into a semantic embedding using an embedding model and stored in a database or index for efficient retrieval. When a user asks a question, it's also transformed into a semantic embedding, and documents with similar embeddings are retrieved.

The advantage of semantic embeddings is their ability to retrieve documents based on semantic content rather than just keyword matching. This allows the system to retrieve relevant documents even if they don't contain the exact words from the user's question, making RAG a more powerful and flexible approach compared to traditional information retrieval methods.


James Briggs. „Better Llama 2 with Retrieval Augmented Generation (RAG)“. Youtube Channel von James Briggs, 29. Juli 2023,