Building a Scalable Chatbot Backend with Vertex AI and Retrieval-Augmented Generation (RAG)

Learn how to build a powerful, scalable chatbot backend using Google Cloud's Vertex AI and Retrieval-Augmented Generation (RAG) techniques. This blog covers the integration of Gemini models, the setup of a RAG corpus, and deploying via Google Cloud Functions, with key insights on overcoming data access challenges.

In a recent project, I developed a backend for a chatbot leveraging Google Cloud's Vertex AI and its Generative AI capabilities, specifically using Gemini models. Additionally, I implemented a Retrieval-Augmented Generation (RAG) technique to enhance the chatbot's responses with contextually relevant information from a predefined corpus. In this post, I'll walk you through the key components of this project, including how I learned and applied RAG, how I used Google Cloud Functions to expose a single HTTP endpoint for the frontend, and some important lessons learned along the way.

Overview of the Project

The goal was to create a backend service capable of handling user prompts sent from a frontend application and responding with contextually enriched answers generated by a Gemini model. To achieve this, I combined Vertex AI's generative models with RAG, a process that retrieves information from a structured corpus before generating a response. The integration with Google Cloud Functions made it easy to expose this functionality through a simple HTTP endpoint.

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a method that enhances generative models by incorporating relevant information retrieved from a specific data source. The key advantage of RAG is that it allows the model to generate more accurate and contextually aware responses by utilizing external data, which is particularly useful in scenarios where the model's knowledge may be outdated or incomplete.

To implement RAG in this project, I followed the Google Cloud Vertex AI documentation on RAG. The documentation provided a comprehensive guide on how to configure a RAG corpus and how to use it effectively with the generative models available in Vertex AI.

Setting Up the RAG Corpus

The first step in implementing RAG was to create and configure a RAG corpus. This involved specifying the embedding model that would be used for vectorizing the data and defining the data source from which information would be retrieved. Here’s how I set it up:

def _create_rag():
    display_name = os.getenv('RAG_DISPLAY_NAME')
    paths = [os.getenv('LANDING_PAGE_PATH')]  # Supports Google Cloud Storage and Google Drive Links
 
    # Configure embedding model, for example "text-embedding-004".
    embedding_model_config = rag.EmbeddingModelConfig(
        publisher_model="publishers/google/models/text-embedding-004"
    )
 
    rag_corpus = rag.create_corpus(
        display_name=display_name,
        embedding_model_config=embedding_model_config,
    )
 
    # Import Files to the RagCorpus
    rag.import_files(
        rag_corpus.name,
        paths,
        chunk_size=512,  # Optional
        chunk_overlap=100,  # Optional
        max_embedding_requests_per_min=900,  # Optional
    )

This function initializes a RAG corpus using a specified embedding model. The import_files function loads the data into the corpus from a defined path (e.g., Google Cloud Storage). The data is then chunked into manageable pieces, with optional overlap, to ensure the RAG process is effective.

After creating the RAG corpus, I hardcoded its ID into an environment variable. This allowed the cloud function to reference the pre-existing corpus without needing to create a new instance each time, streamlining the process and reducing latency.

A Key Gotcha: Accessing Paths for Importing Data

One challenge I encountered during this project was related to accessing the data paths required for importing files into the RAG corpus. Initially, I tried using Google Drive links as the data source. However, I quickly ran into issues because the Cloud Function, which was running the RAG import process, couldn't easily access the Google Drive links. The function lacked the necessary permissions and encountered difficulty with the authentication flow required to access Google Drive.

Solution: Moving Data to Google Cloud Storage

To resolve this, I moved the data to a Google Cloud Storage bucket. This change made it easier for the Cloud Function to access the data because Google Cloud Storage integrates seamlessly with Google Cloud Functions.

However, simply moving the data wasn't enough. I had to ensure that the Cloud Function's service account had the correct permissions to read from the bucket. To do this, I assigned the "Storage Legacy Bucket Reader" role to the service account. This role provided the necessary read permissions, enabling the function to access the bucket paths I passed in.

Here's a snippet that reflects the correct setup:

paths = [os.getenv('GCS_BUCKET_PATH')]  # Now using Google Cloud Storage paths

This simple change—moving from Google Drive to Google Cloud Storage and adjusting the service account permissions—resolved the access issues and allowed the RAG import process to function smoothly.

Retrieving Relevant Information with RAG

Once the RAG corpus was set up, the next step was to create a retrieval tool that could be used to fetch relevant data during inference. This retrieval tool is essential for the chatbot to pull in contextually relevant information before generating a response.

def retrieve_tool():
    # Create a RAG retrieval tool
    rag_retrieval_tool = Tool.from_retrieval(
        retrieval=rag.Retrieval(
            source=rag.VertexRagStore(
                rag_resources=[
                    rag.RagResource(
                        rag_corpus=os.getenv('RAG_ID'),
                    )
                ],
                similarity_top_k=3,  # Optional
                vector_distance_threshold=0.5,  # Optional
            ),
        )
    )
    return rag_retrieval_tool

This retrieval tool searches the RAG corpus based on the prompt's content, returning the most relevant data points. By adjusting parameters like similarity_top_k and vector_distance_threshold, I could fine-tune the retrieval process to ensure the chatbot had access to the most pertinent information.

Integrating with Google Cloud Functions

To make the backend accessible to the frontend application, I leveraged Google Cloud Functions to expose a single HTTP endpoint. This approach simplified the deployment process and allowed for easy scaling and maintenance.

Google Cloud Functions allows developers to write simple, event-driven functions that can be triggered via HTTP requests. I referenced this documentation to create a function that handled incoming chatbot prompts, passed them through the RAG process, and returned the generated response.

Here’s the main function that ties everything together:

@functions_framework.http
def run_inference(request):
    request_json = request.get_json(silent=True)
 
    if request_json and "prompt" in request_json:
        prompt = request_json["prompt"]
        if "history" in request_json:
            history_parsed = request_json["history"]
            contents = []
            for history in history_parsed:
                content = Content(role=history["role"], parts=[Part.from_text(history["message"])])
                contents.append(content)
            history = contents
        else:
            history = []
        logger.log(f"Received request for prompt: {prompt}")
        vertexai.init(project=PROJECT_ID, location=LOCATION)
        rag_retrieval_tool = retrieve_tool()
        model = GenerativeModel(model_name="gemini-1.5-flash-001", tools=[rag_retrieval_tool])
 
        # Generate response
        chat = model.start_chat(history=history)
        response = chat.send_message(prompt)
        prompt_response = response.text
    else:
        prompt_response = "No prompt provided."
 
    return json.dumps({"response_text": prompt_response})

This function handles the HTTP request by extracting the user's prompt and any conversation history, initializing the Vertex AI environment, and generating a response using the RAG-enhanced Gemini model. The response is then returned as a JSON object.

Conclusion

By combining RAG with Vertex AI's generative models and deploying the backend via Google Cloud Functions, I was able to create a powerful, scalable chatbot backend. This project not only deepened my understanding of RAG and its practical applications but also introduced me to the benefits of using Google Cloud Functions for serverless deployments.

One of the critical lessons learned was the importance of correctly configuring data access for the RAG corpus. Ensuring the Cloud Function had the necessary permissions to read from Google Cloud Storage was a crucial step that resolved access issues and streamlined the development process.

If you're looking to build a chatbot or any application that requires intelligent, contextually aware responses, I highly recommend exploring RAG with Vertex AI. The flexibility and power it provides can significantly enhance the quality of your application's responses.

Table of Contents