Monday, December 23, 2024

A practical guide to making your AI chatbot smarter with RAG

Must read

Hands on If you’ve been following enterprise adoption of AI, you’ve no doubt heard the term “RAG” tossed around.

Short for retrieval augmented generation, the technology has been heralded by everyone from Nvidia’s Jensen Huang to Intel’s savior-in-chief Pat Gelsinger as the thing that’s going to make AI models useful enough to warrant investment in relatively pricey GPUs and accelerators.

The idea behind RAG is simple: Instead of relying on a model that’s been pre-trained on a finite amount of public information, you can take advantage of an LLM’s ability to parse human language to interpret and transform information held within an external database.

Critically, this database can be updated independently of the model, allowing you to improve or freshen up your LLM-based app without needing to retrain or fine-tune the model every time new information is added or old data is removed.

But before we demo how RAG can be used to make pre-trained LLMs such as Llama3 or Mistral more useful and capable, let’s talk a little more about how they work.

At a very high level, RAG uses an embedding model to convert a user’s prompt into a numeric format. This so-called embedding is then matched against information stored in a vector database. This database can contain all manner of information, such as for example, a business’s internal processes, procedures, or support docs. If a match is found, the prompt and the matching information are then passed on to a large language model (LLM), which uses them to generate a response.

It essentially makes the output from the LLM much more focused on the specific context of the given database, as opposed to having the model solely rely on what it learned during its general-purpose training. That should, ideally, result in more relevant and accurate answers, making it all more useful.

Now obviously, there’s a lot more going on behind the scenes, and if you are really curious we recommend checking out Hugging Face’s extensive post on the subject. But, the main takeaway is that RAG allows pre-trained LLMs to generate responses beyond the scope of their training data.

Turning an AI chatbot into your RAG-time pal

There are a number of ways to augment pre-trained models using RAG depending on your use case and end goal. Not every AI application needs to be a chatbot. However, for the purposes of this tutorial, we’re going to be looking at how we can use RAG to turn an off-the-shelf LLM into an AI personal assistant capable of scouring our internal support docs and searching the web.

To do this, we’ll be using a combination of the Ollama LLM runner, which we looked at a while back, and the Open WebUI project.

As its name suggests, Open WebUI is a self-hosted web GUI for interacting with various LLM-running things, such as Ollama, or any number of OpenAI-compatible APIs. It also can be deployed as a Docker container which means it should run just fine on any system that supports that popular container runtime.

More importantly for our purposes Open WebUI is one of the easiest platforms for demoing RAG on LLMs like Mistral, Meta’s Llama3, Google’s Gemma, or whatever model you prefer.

Prerequisites

  1. You’ll need a machine that’s capable of running modest LLMs such as LLama3-8B at 4-bit quantization. For this we recommend a compatible GPU — Ollama supports Nvidia and select AMD cards, you can find a full list here — with at least 6 GB of vRAM, but you maybe able to get by with less by switching to a smaller model like Gemma 2B. For Apple Silicon Macs, we recommend one with at least 16 GB of memory.
  2. This guide assumes you’ve already have Ollama setup and running on a compatible system. If you don’t, you can find our guide here, which should have you up and running in less than ten minutes.
  3. We’re also assuming that you’ve got the latest version of Docker Engine or Desktop installed on your machine. If you need help with this, we recommend checking out the docs here.

Deploying Open Web UI using Docker

The easiest way to get Open WebUI running on your machine is with Docker. This avoids having to wrangle the wide variety of dependencies required for different systems so we can get going a little faster.

Assuming Docker Engine or Desktop is installed on your system — we’re using Ubuntu Linux 24.04 for our testing, but Windows and macOS should also work — you can spin up a new Open WebUI container by running the following command:

docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Depending on your system you may need to run this command with elevated privileges. For a Linux box you’d use sudo docker run or in some cases doas docker run.

If you plan to use Open-WebUI in a production environment that’s open to public, we recommend taking a closer look at the project’s deployment docs here, as you may want to deploy both Ollama and Open-WebUI as containers. However, doing so will require passing through your GPU to a Docker container, which is beyond the scope of this tutorial.

Note: Windows and macOS users will need to enable host networking under the “Features in Development” tab in the Docker Desktop settings panel.

Mac and Windows users will need to enable host networking in Docker Desktop before spinning up the Open-WebUI container.

Mac and Windows users will need to enable host networking in Docker Desktop before spinning up the Open-WebUI container (Click to enlarge any image)

After about a minute the container should be running and you can access the dashboard by visiting http://localhost:8080. If you’re running Open WebUI on a different machine or server, you’ll need to replace localhost with its IP address or hostname, and make sure port 8080 is open on its firewall or otherwise reachable by your browser.

If everything worked correctly, you should be greeted with Open WebUI’s login page, where you can click the sign up button to create an account. The first account you create will automatically be promoted to the admin user.

The first user you create in Open WebUI will automatically be promoted to administrator.

The first user you create in Open WebUI will automatically be promoted to administrator

Connecting Open WebUI to Ollama

Open WebUI is only the front end, and it needs to connect via an API locally with Ollama or remotely using OpenAI to function as a chatbot. When we created our Open WebUI container it should have configured itself to look for the Ollama webserver at http://127.0.0.1:11434. However, if Ollama is running on a different port or machine you can adjust this under connections in the settings menu.

Open WebUI should automatically connect to Ollama on its default port, but if it doesn't you can manual set its API address in settings.

Open WebUI should automatically connect to Ollama on its default port, and if it doesn’t, you can manually set its API address in settings

Downloading a model

Now that we’ve got Open WebUI talking to Ollama, we can test and make sure it’s actually working by downloading a model and asking it a question.

From the WebUI homepage start by clicking select model and then typing in the name and tag of the model you’d like to use and clicking “pull” to download it to your system.

To downloading a model is rather straight forward. Just enter the name of the LLM you want and press

Downloading a model is rather straight forward. Just enter the name of the LLM you want and press ‘pull’

You can find a full list of models available on Ollama’s website here, but for the purposes of this tutorial we’re going to use a 4-bit quantized version of Meta’s recently announced Llama3 8B model. Depending on the speed of your connection and the model you choose, this could take a few minutes.

If you’re having trouble running LLama3-8B, your GPU may not have enough vRAM. Try using a smaller model like Gemma:2B instead.

Next, let’s query the chatbot with a random question to make sure Open WebUI and Ollama are actually talking to one another.

If everything is set up properly, the model should rattle off a response to your prompts just as soon as its been loaded in to vRAM.

If everything is set up properly, the model should rattle off a response to your prompts just as soon as it’s been loaded into vRAM

Integrating RAG

Now that we’ve got a working chatbot, we can start adding documents to your RAG vector database. To do this head over to the “Workspace” tab and open “Documents.” From there you can upload all manner of documents including PDFs.

You can upload your docs on the Documents page under the Workplace tab.

You can upload your docs on the Documents page under the Workplace tab

In this example, we’ve uploaded a PDF support document containing instructions for installing and configuring the Podman container runtime in a variety of scenarios.

By default, Open WebUI defaults to using the Sentence-Transformers/all-MiniLM-L6-v6 model to convert your documents into embeddings that Llama3 or whatever LLM you’re using can understand. In “Document Settings” (located under “Admin Settings” in the latest release of Open WebUI) you can change this to use one of Ollama or OpenAI’s embedding models instead. However, for this tutorial we’re going to stick with the default.

You can also change the embedding model under "Document Settings" if you want to try something different.

You can also change the embedding model under ‘Document Settings’ if you want to try something different

Putting it to the test

Now that we’ve uploaded our documents. WebUI can use Llama3, or whatever model you prefer, to answer queries about information that the neural network may not have been trained on.

To test this out, we’ll first ask the chatbot a question relevant to the document we uploaded earlier. In this case we’ll be asking Llama3: “How do I install Podman on a RHEL-based distro like Rocky Linux?”

Unless we tell the model to reference our doc, it'll make something up on its own.

Unless we tell the model to reference our doc, it’ll make something up on its own

In this case, Llama3 quickly responds with a generic answer that, for the most part, looks accurate. This shows how widely trained Llama3 is, but it’s not actually using RAG to generate answers yet.

To do that we need to tell the model which docs we’d like to search by typing “#” at the start of your query and selecting your file from the drop down.

To query a document, start your prompt with an # and select it from the drop down.

To query a document, start your prompt with an ‘#’ and select the file from the drop down

Now when we ask the same question, we get a far more condensed version of the instructions that not only more closely reflects the content of our Podman support document, but also includes additional details that we’ve deemed useful, such as installing podman-compose so we can use docker-compose files to spin up Podman containers.

With the document selected, the model response is based on the information available in it

With the document selected, the model response is based on the information available in it

You can tell the model is using RAG to generate this response because Open WebUI shows the document that it based its response on. And, if we click on it, we can look at the specific embeddings used.

Tagging documents

Naturally, having to name the specific file you’re looking for every time you ask a question isn’t all that helpful if you don’t already know which doc to search. To get around this, we can actually tell Open WebUI we can query all documents with a specific tag, such as “Podman,” or “Support.”

We apply these tags by opening up our “Documents” panel under the “Workspace” tab. From there, click the edit button next to the document we’d like to tag, then add the tag in the dialogue box before clicking save.

If you want to query multiple documents you can tag with a common phrase, like support.

If you want to query multiple documents you can tag it with a common phrase, such as ‘support’

We can now query all documents with that tag by typing “#” followed by the tag at the start of our prompt. For example, since we tagged the Podman doc as “Support” we’d start our prompt with “#Support”.

Your personal Perplexity

Open WebUI’s implementation of RAG isn’t limited to uploaded documents. With a few tweaks you can use a combination of RAG and large language models to search and summarize the web, similar to the Perplexity AI service.

Perplexity works by converting your prompt into a search query, and then summarizing what it believes to be the most relevant results, with footnotes linking back to its sources. We can do something incredibly similar using Ollama and Open WebUI to search Google or some other search provider and take its top three results and use them to generate a cited answer to our prompt.

In this tutorial we’ll be using Google’s Programmable Search Engine (PSE) API to create a web-based RAG system for querying El Reg articles, but you can configure yours to search the entire web or specific sites. To do this we’ll need to get both an PSE API key and Engine ID. You can find Google’s documentation on how to generate both here.

Next, we’re going to take the PSE API key and Engine ID, enable Web Search under the “Web Search” section of Open WebUI’s “Admin Settings” page, select “google_pse” as our search engine, enter our API and Engine IDs in the relevant forms, and click save.

To take advantage of web search based RAG, you'll need to obtain a API and Engine ID for your search provider.

To take advantage of web search based RAG, you’ll need to obtain a API and Engine ID for your search provider

In this section we can also adjust the number of sites to check for information relevant to our prompt.

Testing it out

Once we’ve done that, all we need to do to make use our personal Perplexity is to tell Open WebUI to search the web for us. In a new chat, click the “+” button and check “search web”, then enter your prompt as you normally would.

Open WebUI's web search function isn't enable by default, so be sure to enable it before entering your prompt.

Open WebUI’s web search function isn’t enabled by default, so be sure to enable it before entering your prompt

In this example, we’re asking Llama3 a question about an event that occurred after the model was trained and thus would have no knowledge of it. However, because the model is only summarizing an online article, it’s able to respond.

The sources used to generate the models response are listed at the bottom.

The sources used to generate the model’s response are listed at the bottom

Now, it’s important to remember that it’s still an LLM interpreting these results and thus it still can and will make mistakes or potentially hallucinate. In this example, Llama3 seems to have pulled the relevant details, but as you can see, its search didn’t exclude forum posts that are also indexed by Google.

It could just as easily have pulled and summarized a comment or opinion with incorrect, misleading, or biased information, so, you still have to check your sources. That, or block-list URLs you don’t want included in your queries.

The Register aims to bring you more on using LLMs and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we’d love to hear about them in the comments section below. ®


Full disclosure: Nvidia loaned The Register an RTX A6000 Ada Generation graphics card for us to use to develop stories such as this one after we expressed an interest in producing coverage of practical AI applications. Nvidia had no other input.

Latest article