RAG#

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.

The benefits of RAG include:

Updated information: provide LLMs with the latest information (LLMs are trained with data until a certain date)
Access to domain-specific information not seen during training without the need of fine-tuning the LLM.
Factual grounding: reduce hallucinations by providing LLMs with access to a knowledge base

RAG diagram — Pipeline of the RAG process#

The key components of RAG are:

Embeddings model: A model trained to generate an embedding for a given input. This embedding is a high-dimensional vector that captures the information about the input, such as the semantics. This way, we can determine the similarity between 2 inputs by calculating the distance between their embeddings.
Vector Store: It is a specific kind of database for efficient storage, indexing and querying of vector embeddings, or numerical representations of unstructured data, such as text, images or audio.
Retriever: Given an input, it gets the most similar chunks from the vector store by comparing the vector embeddings.
Generator (LLM): It receives the original query and the retrieved data, and generates the answer.

BAF allows you to integrate this process into your agent. Our implementation uses LangChain, which is a framework for developing apps with LLMs. It comes with a wide library of resources we can use to customize our RAG’s components.

Let’s see how to seamlessly integrate a RAG component into our agent. You can also check the RAG agent for a complete example.

First of all, we create our agent:

from baf.core.agent import Agent

agent = Agent('greetings_agent')
agent.load_properties('config.yaml')
websocket_platform = agent.use_websocket_platform(use_ui=True)

Now, we have to create the RAG components using the LangChain library.

We will need:

Embeddings#

An embeddings model. We can use an OpenAI embeddings as an example:

from langchain_community.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key='api-key')

Note

Full list of LangChain Embeddings implementations.

Vector Store#

A VectorStore. We can use the Chroma vector store as an example:

from langchain_community.vectorstores import Chroma

vector_store: Chroma = Chroma(
    embedding_function=embeddings,
    persist_directory='vector_store'  # directory where we store the vector store, optional
)

Note

Full list of LangChain VectorStore implementations.

Text Splitter#

A TextSplitter to divide the documents into smaller chunks (only necessary if we want to load data into the vector store). LangChain provides different splitters for specific splitting criteria:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

Note

Full list of LangChain TextSplitter implementations.

LLM#

An LLM, using the BAF LLM wrappers:

from baf.nlp.llm.llm_openai_api import LLMOpenAI

gpt = LLMOpenAI(agent=agent, name='gpt-4o-mini')

RAG#

Now we can create the RAG

rag = RAG(
    agent=agent,
    vector_store=vector_store,
    splitter=splitter,
    llm_name='gpt-4o-mini',
    k=4,  # Number of chunks to retrieve
    num_previous_messages=0  # Number of previous messages to add to the query
)

Note

The API docs contain full details on the RAG parameters

Import data#

If you want to load data into the vector store, our implementation with LangChain’s PDF loader:

rag.load_pdfs('./pdfs')

Or use any of the LangChain’s document loaders, for instance:

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./index.md")
documents = loader.load()
chunked_documents = splitter.split_documents(documents)
vector_store.add_documents(chunked_documents)

Execution#

Finally, let’s use the RAG within a state (it can be used in both the body and the fallback body):

def rag_body(session: Session):
    # Option 1: it uses the last user message by default
    rag_message: RAGMessage = session.run_rag()
    # Option 2: use a custom message as input
    rag_message: RAGMessage = session.run_rag(message='custom message')
    # Option 3: run RAG without the session
    rag_message: RAGMessage = rag.run(message='custom message')

    # Reply the generated answer
    session.reply(rag_message.answer)
    # Or a specific method to reply RAG messages (to display the answer and the retrieved documents)
    websocket_platform.reply_rag(session, rag_message)

A RAGMessage is the return object of the RAG. It contains the generated answer together with the retrieved documents, and additional metadata.

The WebSocket platform includes a method to reply this kind of messages, and our Streamlit UI can display them within expander containers that show the retrieved documents to the user.

API References#

Agent: baf.core.agent.Agent
Agent.load_properties(): baf.core.agent.Agent.load_properties()
Agent.use_websocket_platform(): baf.core.agent.Agent.use_websocket_platform()
LLMOpenAI: baf.nlp.llm.llm_openai_api.LLMOpenAI
RAG: baf.nlp.rag.rag.RAG
RAG.load_pdfs(): baf.nlp.rag.rag.RAG.load_pdfs()
RAG.run(): baf.nlp.rag.rag.RAG.run()
RAGMessage: baf.nlp.rag.rag.RAGMessage
Session: baf.core.session.Session
Session.reply(): baf.core.session.Session.reply()
Session.run_rag(): baf.core.session.Session.run_rag()
WebSocketPlatform.reply_rag(): baf.platforms.websocket.websocket_platform.WebSocketPlatform.reply_rag()