Intel Tech - Medium

Kubernetes autoscaling in OPEA v1.4

Intel — Wed, 03 Sep 2025 18:25:25 GMT

Deploying AI Agents Locally with Qwen3, Qwen-Agent, and Ollama

Benjamin Consolvo — Wed, 28 May 2025 19:46:29 GMT

Image generated by author. Prompt: “use your tools like my_image_gen to generate a bear with a hat”.

Ever wanted to run your own AI agent locally — without sending data to the cloud? As an AI software engineer at Intel, I’ve been exploring how to run open-source LLMs locally on AI PCs. With the smaller Qwen3 models, it’s totally possible. These models are compact enough to run on an AI PC and powerful enough to call tools and handle real tasks. Even the smaller variants of Qwen3 do allow for tool-calling, enabling you to build agentic workflows to do things like looking up live websites, function calling, and code execution. This guide walks through how to build your own agentic workflows using Qwen3, Qwen-Agent, and Ollama — without relying on the cloud.

Ollama Setup and Qwen3 Model Hosting

To keep everything local and private, I used Ollama — a lightweight way to run open-source models right on your machine. Here’s how I got Qwen3 running on my AI PC using WSL2 (Windows Subsystem for Linux).

Install Ollama with the Linux command, taken from the Ollama website:

curl -fsSL https://ollama.com/install.sh | sh

Ollama makes it easy to host your model. After installing Ollama, simply run

ollama run qwen3:8b

and the ~5.2GB Qwen3:8b model should download and run locally by default at the local address of https://localhost:11434/. We will use this address later when building the agents with Qwen-Agent.

Qwen-Agent Python Library

After installing Ollama, I populated a requirements.txt file with the Qwen-Agent library,

qwen-agent[gui,rag,code_interpreter,mcp]
qwen-agent

and installed from the command line with

pip install -r requirements.txt

Sample AI agent with Qwen3 using Qwen-Agent

To build a sample AI agent with Qwen3, you can use the code snippet found on the Qwen-Agent GitHub repository here. The only modifications I made are to the llm_cfg to change the model toqwen3:8b, the model server to https://localhost:11434/v1, and a PDF file of a research paper called Zheng2024_LargeLanguageModelsinDrugDiscovery.pdf. In my case, I only made use of the built-in tool called my_image_gen to ask the LLM agent to use a tool to generate an image, but feel free to experiment with your own Qwen-Agent workflow. In this walkthrough, I’m showing how to create a simple AI agent that can generate an image based on your request — entirely locally using Qwen3.

#from https://github.com/QwenLM/Qwen-Agent

import pprint
import urllib.parse
import json5
from qwen_agent.agents import Assistant
from qwen_agent.tools.base import BaseTool, register_tool
from qwen_agent.utils.output_beautify import typewriter_print


# Step 1 (Optional): Add a custom tool named `my_image_gen`.
@register_tool('my_image_gen')
class MyImageGen(BaseTool):
    # The `description` tells the agent the functionality of this tool.
    description = 'AI painting (image generation) service, input text description, and return the image URL drawn based on text information.'
    # The `parameters` tell the agent what input parameters the tool has.
    parameters = [{
        'name': 'prompt',
        'type': 'string',
        'description': 'Detailed description of the desired image content, in English',
        'required': True
    }]

    def call(self, params: str, **kwargs) -> str:
        # `params` are the arguments generated by the LLM agent.
        prompt = json5.loads(params)['prompt']
        prompt = urllib.parse.quote(prompt)
        return json5.dumps(
            {'image_url': f'https://image.pollinations.ai/prompt/{prompt}'},
            ensure_ascii=False)


# Step 2: Configure the LLM you are using.
llm_cfg = {
    # Use the model service provided by DashScope:
    # 'model': 'qwen-max-latest',
    # 'model_type': 'qwen_dashscope',
    # 'api_key': 'YOUR_DASHSCOPE_API_KEY',
    # It will use the `DASHSCOPE_API_KEY' environment variable if 'api_key' is not set here.

    # Use a model service compatible with the OpenAI API, such as vLLM or Ollama:
    'model': 'qwen3:8b',
    # 'model_server': 'http://localhost:8000/v1',  # base_url, also known as api_base
    'model_server': 'http://localhost:11434/v1',  # Ollama
    'api_key': 'EMPTY',

    # (Optional) LLM hyperparameters for generation:
    'generate_cfg': {
        'top_p': 0.8
    }
}

# Step 3: Create an agent. Here we use the `Assistant` agent as an example, which is capable of using tools and reading files.
system_instruction = '''After receiving the user's request, you should:
- first draw an image and obtain the image url,
- then run code `request.get(image_url)` to download the image,
- and finally select an image operation from the given document to process the image.
Please show the image using `plt.show()`.'''
tools = ['my_image_gen', 'code_interpreter']  # `code_interpreter` is a built-in tool for executing code.
files = ['Zheng2024_LargeLanguageModelsinDrugDiscovery.pdf']  # Give the bot a PDF file to read.
bot = Assistant(llm=llm_cfg,
                system_message=system_instruction,
                function_list=tools,
                files=files)

# Step 4: Run the agent as a chatbot.
messages = []  # This stores the chat history.
while True:
    # For example, enter the query "draw a dog and rotate it 90 degrees".
    query = input('\nuser query: ')
    # Append the user query to the chat history.
    messages.append({'role': 'user', 'content': query})
    response = []
    response_plain_text = ''
    print('bot response:')
    for response in bot.run(messages=messages):
        # Streaming output.
        response_plain_text = typewriter_print(response, response_plain_text)
    # Append the bot responses to the chat history.
    messages.extend(response)

Prompt and Response

I provided a prompt to ask qwen3:8b to generate an image:

user query: use your tools like my_image_gen to generate a bear with a hat

Here is the response:

bot response:
2025-05-19 17:34:22,554 - doc_parser.py - 93 - INFO - Read chunked Zheng2024_LargeLanguageModelsinDrugDiscovery.pdf from cache.
2025-05-19 17:34:22,555 - base_search.py - 56 - INFO - all tokens: 50347

Okay, the user wants me to generate an image of a bear with a hat using the my_image_gen tool. Let me check the function's requirements. The function requires a prompt in English that describes the image details. I need to make sure the prompt is detailed enough. Let me think about the elements: a bear, a hat. Maybe specify the type of hat, like a top hat or a funny hat. Also, the setting—maybe a forest or a snowy mountain. Adding adjectives like "cute" or "cartoon-style" could help. Let me put it all together. The prompt should be something like "A cute cartoon bear wearing a top hat in a forest setting." That should work. Now, I'll format the tool call with this prompt.

[TOOL_CALL] my_image_gen
{"prompt": "A cute cartoon bear wearing a top hat in a forest setting"}
[TOOL_RESPONSE] my_image_gen
{image_url: "https://image.pollinations.ai/prompt/A%20cute%20cartoon%20bear%20wearing%20a%20top%20hat%20in%20a%20forest%20setting"}

Okay, let's see what the user is asking for here. They provided an image URL of a cartoon bear wearing a top hat in a forest setting. The image was generated using the my_image_gen function, which takes a prompt and returns the URL.

First, I need to check if there's any specific request related to this image. The user might want to know more about the image, like details about the bear, the forest, or maybe how to generate similar images. Alternatively, they might be testing if I can process the image URL or describe tSince the user hasn'tSince the user hasn't asked a direct question but provided the image, my role is to acknowledge the image and offer further assistance. I should confirm that the image was generated as expected and ask if they need anything else, like modifying the image, getting more images, or information about the generation process.

I should also make sure to keep the response friendly and open-ended, encouraging them to ask for more help if needed. No need to call any functions here since the user hasn't requested an action, just shared theSince the user hasn't asked a direct question but provided the image, my role is to acknowledge the image and offer further assistance. I should confirm that the image was generated as expected and ask if they need anything else, like modifying the image, getting more images, or information about the generation process.

![](https://image.pollinations.ai/prompt/A%20cute%20cartoon%20bear%20wearing%20a%20top%20hat%20in%20a%20forest%20setting)

This is an AI-generated cartoon image of a cute bear wearing a top hat in a forest setting. The image appears to be created using text-to-image generation technology. Would you like me to help you with anything related to this image or AI-generated content?
user query:

We can see that it did properly call the tool called “my_image_gen” and the generated image is at the top of this article. It is also hosted at the URL it cites near the end of its response: https://image.pollinations.ai/prompt/A%20cute%20cartoon%20bear%20wearing%20a%20top%20hat%20in%20a%20forest%20setting.

My Device

The AI PC laptop used in my testing has an Intel Core Ultra 7 155H 3.80 GHz processor with 32 GB of RAM.

What We Built:

✅ Installed and ran Qwen3:8b locally using Ollama
✅ Set up Qwen-Agent to build an AI assistant
✅ Connected a tool to generate images using text prompts
✅ Prompted the agent and got a real response — locally, on an Intel-powered AI PC

Resources

You can take advantage of building your own agents locally and speak with other developers using the resources listed below.

Check out all of the possible Qwen3 models hosted by Ollama
To build your own Qwen-based agents, visit the Qwen-Agent GitHub repository
Learn more about the AI PC Powered by Intel
To chat with other developers, you can visit the Intel DevHub Discord
For a more in-depth review and performance testing of Qwen3, you can visit the article Intel® AI Solutions Accelerate Qwen3 Large Language Models
Check out Intel’s AI developer resources here

Deploying AI Agents Locally with Qwen3, Qwen-Agent, and Ollama was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Build An End-to-End SQL + RAG AI Agent

Intel — Fri, 28 Mar 2025 20:52:36 GMT

Learn how agents are implemented in enterprise use cases using OPEA blueprints to deploy

Photo by Possessed Photography on Unsplash

Excitement surrounds every new large language model (LLM) release, yet enterprises still struggle to extract real value from them. Techniques like retrieval augmented generation (RAG) enhance LLM capabilities by injecting relevant context into prompts. The process seems simple: build a knowledge base, retrieve data, and provide context. However, this context remains static. When a query requires refinement, deeper reasoning, or multi-step processing, the system doesn’t dynamically adjust its retrieval strategy; it simply returns what it finds in one step.

Imagine a music discovery application using an LLM to assist users in exploring bands, albums, and song details. A user inputs a query about a band, expecting the system to provide not only a list of their albums but also detailed insights into each album’s release date, genre, and tracklist. However, if the system only retrieves information based on a static context without dynamically adjusting to incorporate deeper reasoning or multi-step processing, it might return a generic list of albums without contextual details such as collaborations, notable tracks, or historical significance. This could lead to an incomplete or less engaging music discovery experience, missing key insights that enhance a user’s understanding and appreciation of the band’s discography.

We’ve all gotten used to having chatbots spit out quick answers to generic questions, but we want and need our GenAI assistants to do more. We don’t want them to solve just any problems; we want them to solve our specific problems. While RAG improves context injection, it still lacks the flexibility to handle complex queries that require iterative reasoning or dynamic decision-making.

AI agents solve this problem. By orchestrating retrieval, reasoning, and action-taking, they push GenAI beyond static context injection, enabling more dynamic, context-aware, and useful responses. Agents also provide a crucial bridge to the external world, connecting with APIs, databases, enterprise systems, and other tools to fetch live data, execute tasks, and adapt to evolving information. We can have the best, smartest, and most well-trained LLM. But without external tools, it’s confined to processing only the input it is given. Agents enable the GenAI application to interact and manipulate the world around it, transforming it from a passive knowledge repository into an active problem solver.

What is an AI Agent?

An AI agent is like a super-smart program designed to make decisions and take actions in a way that feels like it’s thinking on its own. It doesn’t just follow a fixed set of instructions like an “if-else” program does. Instead, it can:

Perceive its environment (it gathers info like a person looking around).
Think about what to do (processing the info it gathers).
Act on its own to achieve a specific goal.

Take, for example, a robot vacuum cleaner. It senses where dirt is (perceives), decides which path to clean (thinks), and moves around cleaning (acts).

An AI agent is built using a few key components:

Perception: This is how the agent senses or gathers information from its environment. It could be data from sensors, user input, or anything it can access (for example, camera data, temperature, text inputs).
Decision-Making: After perceiving its environment, the agent processes that information to decide what action to take. This is typically done using algorithms or models like machine learning, decision trees, or rule-based systems. It tries to pick the best option based on the data it has.
Action: Once it has decided what to do, the agent takes action to affect its environment. This could be sending a response, moving, adjusting settings, or anything the agent is designed to do.

What are Tools?

These components often rely on Tools, which are specific functions that perform actions. Tools are independent pieces of software that carry out tasks, ranging from simple operations like mathematical calculations to more complex tasks like sensing the external world. They can be seen as specialized components that help the agent execute certain tasks or interact with its environment more effectively.

For example, in your robot vacuum, perception would be the sensors detecting dirt, the action would be moving the vacuum, and the tool would be the cleaning mechanism itself, which is an independent function that the robot calls upon to clean specific areas.

Open Platform for Enterprise AI (OPEA) AgentQnA Interface Planner: SQL & RAG Agents

Deploying an agent can be a simple task if done within a single script, but this approach is not sufficient for enterprise-level solutions. Deployments need to be scalable, and each functionality should ideally be encapsulated. This can be achieved through a cloud native architecture. OPEA, an open source project (part of LF AI&DATA), is a framework for building and deploying modular, composable generative AI solutions. OPEA focuses on security, scalability, and cost-efficiency and provides the building blocks and blueprints to make this possible. In this blog, we will use the AgentQnA blueprint to build a hierarchical multi-agent system for question-answering applications. The blueprint is available on Docker (docker-compose) or Helm charts (Kubernetes). The example implements the following architecture:

Image from OPEA AgentQnA repo (https://github.com/opea-project/GenAIExamples/tree/main/AgentQnA)

You can find instructions for deploying the example in this guide. After deploying the blueprint you’ll see a set of microservices deployed on your environment. The deployment scenario includes a knowledge base with data in two common formats:

SQL Db: In many businesses, you’ll find SQL databases being used to support core functions, such as managing customer relationships, processing orders, tracking inventory, and generating reports. The structured nature of SQL databases allows organizations to store data in a clear, organized way, making it easier to maintain and query. In this example, we use a sample SQL database to demonstrate how companies can leverage these types of databases for their operations, and we will feed it with two main tables. “Business” tables managing customers, employees, and sales: Customer, Invoice, InvoiceLine, and Employee. “Music” tables storing media details: Artist, Album, Track, Genre, MediaType, Playlist, and PlaylistTrack.
VectorDB: A vector database stores rich, unstructured data, such as text, which is valuable for retrieval-augmented generation (RAG) applications. Unlike SQL databases with structured tables, a vector database represents information in high-dimensional vectors, capturing deeper context. This allows RAG to retrieve not only structured data, but also relevant text, enhancing the accuracy and richness of generated responses. In this example, we’ll have music information + context which adds more information, making it useful to perform the RAGs (from https://github.com/minmin-intel/GenAIExamples/blob/test-openai/AgentQnA/example_data/test_docs_music.jsonl).

There are three agents involved, all of them built on Langchain/LangGraph frameworks.

Supervisor Agent: The supervisor’s main role is to process the input, identify which tool to use to answer the query (SQL or RAG), and manage the generation of responses (potentially by calling the LLM). This agent is based on ReAct strategy — it engages in “reason-act-observe” cycles to solve problems. Please refer to this doc to learn more about how to use this strategy.
Worker Agent (RAG): The worker RAG agent uses the retrieval tool to retrieve relevant documents from the knowledge base (a vector database).
Worker agent (SQL): The worker SQL agent retrieves relevant data from the SQL database.

To test the example, you can use the scripts provided here. You can test each agent independently or you can directly test the supervisor by running:

# supervisor agent: this will test a two-turn conversation
python tests/test.py - agent_role "supervisor" - ext_port 9090

Since the example feeds information about music you can find tests related to that topic. The test for the supervisor agent is done in two turns (like a back-and-forth conversation):

We ask, “Which artist has the most albums ?” to the supervisor agent.
The supervisor agent chooses the tool to use, and the RAG Agent goes into its music database and finds the answer.
Then we ask, “Give me a few examples of that artist’s albums?”
The Supervisor remembers you were talking about the artist and asks the SQL Agent to respond with album names.

You can experiment with different types of data to observe how the system adapts and responds to various queries.

Build Your Own Agent

This example demonstrates how to deploy an SQL use case, but agents are dynamic, and you may want to register your own agent within the architecture. You can explore the YAML and Python files in this example to understand how tools are integrated. Learn more about the strategies at: opea-project/GenAIComps.

For more details, please refer to the “Provide your own tools” section in the instructions here.

Contribute to the project! OPEA is built by a growing community of developers and AI professionals. Whether you’re interested in contributing code, improving documentation, or building new features, your involvement is key to our success.

Join us on the OPEA GitHub to start contributing or explore our issues list for ideas on where to start.

About the Author

Ezequiel Lanza, Open Source AI Evangelist, Intel & LF AI&DATA Chair/Board)

Ezequiel Lanza is an open source AI evangelist at Intel, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X and LinkedIn.

Build An End-to-End SQL + RAG AI Agent was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding Retrieval Augmented Generation (RAG)

Intel — Thu, 15 Aug 2024 19:31:15 GMT

Learn what a RAG system is and how to deploy it using OPEA’s open source tools and frameworks

Photo by Xavi Cabrera on Unsplash

By this point, most of us have used a large language model (LLM), like ChatGPT, to try to find quick answers to questions that rely on general knowledge and information. These questions range from the practical (What’s the best way to learn a new skill?) to the philosophical (What is the meaning of life?).

Image 1: ChatGPT gives what are some of the most common questions it get asked

But how do you get answers to questions that are personal? How much does your LLM know about you? Or your family?

Let’s test ChatGPT and see how much it knows about my parents.

Image 2: ChatGPT answer to “Do you know who is my mum?”

It’s understandable to feel frustrated when a model doesn’t recognize you, but it’s important to remember that these models don’t have much information about our personal lives. Unless you’re a celebrity or have your own Wikipedia page (as Tom Cruise has), the training dataset used for these models likely doesn’t include our information, which is why they can’t provide specific answers about us.

Image 2: ChatGPT answer to “Do you know who Tom Cruise's mum is?”

So, how do we get our LLMs to know us better?

That’s the million-dollar question facing enterprises looking to boost productivity with GenAI. They need models that provide context-based results. In this post, we’ll explain the basics of how retrieval augmented generation (RAG) improves your LLM’s responses and show you how to easily deploy your RAG-based model using a modular approach with the open source building blocks that are part of the new Open Platform for Enterprise AI (OPEA).

What is RAG?

We know that LLMs can greatly contribute to completing an extensive number of tasks, such as writing, learning, programming, translating, and more. However, the result we receive depends on what we ask the model, in other words, on how we meticulously build our prompts. For that reason, we spend too much time looking for the perfect prompt to get the answer we want; we’re starting to become experts in model prompting.

Let’s return to the above question: “Who is my mum?” We know who our mum is, we have memories, and that information lives in our “mental” knowledge base, our brain.

When building the prompt, we need to somehow provide it with memories of our mum and try to guide the model to use that information to creatively answer the question: Who is my mum? We’ll provide it with some of mum’s history and ask the model to take her past into account when answering the question.

Image 4: Instructions prompt a creative response based on context about the user’s mother, emphasizing how her Italian roots influence her personality and preferences.

As we can see, the model successfully gave us an answer that described my mum. Congratulations, we have used RAG!

Let’s inspect what we did.

Given the initial question, we tweaked the prompt to guide the model in how to use the information (context) we provided.

We can think of the RAG process in three parts :

Image 5: Prompt provided to the LLM for answering creatively based on the provided context.

Instruct: Guide the model. We have guided the model to use the information we provided (documents) to give us a creative answer and take into account my mum’s history. We used those instructions as an example; we could have used other guidance depending on the outcome we wanted to achieve. If we don’t want a creative answer, for example, this is the time to declare it.
Context: Provide the context. In this example, we already knew the information about my mother since we retrieved that information from my memories, but in a real scenario, the challenge would be finding the relevant data in a knowledge base to feed the model so that it has the context needed to provide us with an accurate response, this process is called “retrieval.”
Initial Question: The initial question we want answered.

Let’s explore how an enterprise can implement a real-life RAG example using open source tools and models. We’ll deploy it using the standardized frameworks and tools made available through OPEA, which was created to help streamline the implementation of enterprise AI.

Exploring the OPEA Architecture

Here’s the architecture we used for the previous example:

Image 6: Author’s RAG architecture adaptation using a building block concept, based on the original OPEA ChatQnA RAG architecture (https://github.com/opea-project/GenAIExamples/tree/main/ChatQnA)

RAG can be understood as simply the steps mentioned above:

1) Initial Question

2) Context

3) Instruct

However, implementing the process in practice can be challenging because multiple components are needed: retrievers, embedding models, and a knowledge base, as shown in the image above. Let’s explore how those parts can work together.

The key lies in providing the right context. You can compare the process to how our memories help us answer questions. For a company, this might mean drawing from a knowledge base of historical financial data or other relevant documents.

For example, when a user asks a chatbot a question before the LLM can spit out an answer, the RAG application must first dive into a knowledge base and extract the most relevant information (the retrieval process). But even before the retrieval happens, an embedding model plays a crucial role in converting the data in the knowledge base into vector representations — meaningful numerical embeddings that capture the essence of the information. These embeddings will live in the knowledge base (vector database) and will allow the retriever to efficiently match the user’s query with the most relevant documents.

Once the RAG application finds the relevant documents, it performs a rerank process to check the quality of the information and then re-orders the information based on relevance. It then builds a new prompt based on the refined context from the top-ranked documents and sends this prompt to the LLM, enabling the model to generate a high-quality, contextually informed response. Easy, right?

As you can see, the RAG architecture isn’t about just one tool or one framework; it’s composed of multiple moving pieces making it difficult to pay attention to each component. When deploying a RAG system in our enterprise, we face multiple challenges, such as ensuring scalability, handling data security, and integrating with existing infrastructure.

The Open Platform for Enterprise AI (OPEA) aims to solve those problems by treating each component in the RAG pipeline as a building block that is easily interchangeable. Say, for example, you’re using Mistral, but want to easily replace it with Falcon. Or, say you want to replace a vector database on the fly. You don’t want to have to rebuild the entire application. That would be a nightmare. OPEA makes deployment easier by providing robust tools and frameworks designed to streamline these processes and facilitate seamless integration.

You can see this process in action by running the ChatQnA example: https://github.com/opea-project/GenAIExamples/tree/main/ChatQnA. There, you’ll find all the steps needed to create the building blocks for your RAG application on your server or your AIPC.

Call to Action

We have shown you the basics of how RAG works and how to deploy a RAG pipeline using the OPEA framework. While the process is straightforward, deploying a RAG system at scale can introduce complexities. Here’s what you can do next:

Explore GenAIComps: Gain insights into how generative AI components work together and how you can leverage them for real-world applications. OPEA provides detailed examples and documentation to guide your exploration.
Explore RAG demo(ChatQnA): Each part of a RAG system presents its own challenges, including ensuring scalability, handling data security, and integrating with existing infrastructure. OPEA, as an open source platform, offers tools and frameworks designed to address these issues and make the deployment process more efficient. Explore our demos to see how these solutions come together in practice.
Explore GenAI Examples: OPEA is not focused only on RAG; it is about generative AI as a whole. Multiple other demos, such as VisualQnA, showcase different GenAI capabilities. These examples demonstrate how OPEA can be leveraged across various tasks, expanding beyond RAG into other innovative GenAI applications.
Contribute to the project! OPEA is built by a growing community of developers and AI professionals. Whether you’re interested in contributing code, improving documentation, or building new features, your involvement is key to our success.

Join us on the OPEA GitHub to start contributing or explore our issues list for ideas on where to start.

About the Author

Ezequiel Lanza, Open Source AI Evangelist, Intel

Ezequiel Lanza is an open source AI evangelist on Intel’s Open Ecosystem team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X and LinkedIn

Understanding Retrieval Augmented Generation (RAG) was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Improve your Tabular Data Ingestion for RAG with Reranking

Intel — Tue, 16 Jul 2024 15:02:28 GMT

Boost your RAG system’s accuracy by adding a reranker to select the most relevant context chunks.

Photo by shawnanggg on Unsplash

By Eduardo Rojas Oviedo with Ezequiel Lanza

In our previous post, Tabular Data, RAG, & LLMs, we explored how to improve a large language model’s (LLM’s) ability to generate responses by feeding it small tabular data in various formats to provide necessary context. However, when the provided context is only partially correct, mismatches can lead to less accurate responses from our LLM.

In a typical RAG architecture, the retriever part gets chunks of data based on a similarity search from documents stored in a knowledge base. But are those chunks always the most relevant for our case?

Using the example from our previous article, what if we ask, “Who are the top billionaires in the tech industry in 2024?” and we get documents related to the overall list of world billionaires, including those from various industries that are not specific to tech? If the retriever returns documents about just any world billionaire working in any industry, we won’t get the response we need when the LLM is prompted.

To provide addtional context beyond basic text-based input based on ingesting small tabular data in various formats. It has shown to be useful in enhancing LLMs comprehension not only in plain text but also in other formats like tables.

In this post, we’ll demonstrate how to reduce mismatches by using a ranker that can select the most relevant context chunks, ensuring that the LLM gets the best possible information for generating accurate answers.

Extracting and Processing Data for a RAG System

Our demonstration builds a Q&A chatbot on top of what is considered a basic RAG solution, but we’re adding a reranker at the end to perform an additional ranking over the retrieved chunks. As shown in Image 1, this process has three stages: Data Preparation, Indexing, and Retrieval. We won’t cover the “staging” part; we can consider it as our knowledge base, and it’s not part of this tutorial. For this example, the “knowledge base” will be a PDF containing the World ‘s Billionaires data from Wikipedia, when in a business scenario it could be the entire company database.

Image 1: A comprehensive data pipeline diagram illustrating the stages of data processing, from raw data ingestion to data retrieval, created by the author.

Data Preparation

First, we need to prepare our data for storage in our database. Since our data has both tables and text, the data pipeline creation will follow two paths: one for the text, and one for the tabular data.

In the text Path, we’ll extract the text from the PDF and perform pre-processing tasks, which includes data cleaning, as discussed in a previous article, and implementing a character-based chunking strategy as well as adding in the metadata.

For the tabular path, we’ll extract two sets of tabular data and demonstrate how to convert the information into useful context chunks for later model consumption (as explained in this article). Because the PDF includes tables, we’ll also need to select which approach to use to convert the information to vectors. Then we’ll employ two approaches: one based on row-by-row chunks of information and another using the entire table, as we explained in our previous article Tabular Data, RAG, & LLMs.

Indexing

The next step is to choose how we’ll store the data (which we converted to vectors) Depending on how we want to store the data, the most common approach is to use a unified context collection (UCC). UCC keeps all the information and metadata in a single data location, along with the vectors we’ll use for semantic search; we can think of this approach as having one dataset containing all the information. This is called a collection.

Image 2: Adaptation by the authors from Chorma — Home page. It illustrates Chroma’s Unified Context Collection, showcases the integration of documents, metadata, and embeddings for comprehensive data representation.

Another approach is to use a distributed context collection (DCC) where information, metadata, and vectors are stored in separate collections according to the type of information. In this scenario, we’ll have multiple collections rather than one, central location.

Image 3. Adaptation by the authors from Chroma — Home page comparing Chroma’s Distributed Context Collection and Distributed Table Collection, highlighting their respective components: Document, Metadata, and Embedding.

Each approach has its pros and cons. A DCC may be more complex to manage due to the need for data synchronization across multiple nodes, managing potential inconsistencies,and distributed storage efficiently but it can offer greater scalability and fault tolerance, enabling the system to handle higher loads and recover more quickly from node failures. By contrast, a UCC offers simpler data management by centralizing all data in a single location, ensuring consistent and straightforward administration, but may face challenges with scalability and can become a single point of failure. Below, we evaluate both scenarios to determine how this configuration could affect the retrieval process.

Retrieval

The final stage of this process is retrieving the data. In this stage, the most relevant documents are retrieved from our collections before prompting the LLM, as shown in image four below. A similarity search is performed to retrieve the documents most similar to the user’s question. This sounds great, but are these documents the most relevant? We propose adding an additional step that will perform a verification of the retrieved documents to rank them by similarity. A rerank approach involves scoring the retrieved documents to prioritize the most relevant ones before they are fed into the LLM for response generation. By ensuring the most pertinent information is prioritized, reranking minimizes irrelevant or less useful data, thereby improving the overall performance and reliability of the RAG system. You can explore a deeper explanation of this topic in this Pinecone article.

Image 4. Adaptation by the author from Chorma — Home page and API and API and re-ranking.This diagram illustrates the reranking process. Data points (queries, answers) are adjusted in a semantic space based on relevance.

Let’s see it in action

Let’s then run our experiments! We’ll guide you through setting up the environment, defining functions, and running the experiment. Our goal is to see how well the model performs at answering questions based on a PDF document.

Prepare the environment

We first set up the environment using Python 3.12.3 and provided all the required dependencies in requirements.txt[Install packages in a virtual environment] file with the following required library specifications:

tqdm==4.66.4
spacy==3.7.4
pypdf==4.2.0
langchain-community==0.2.1
langchain-text-splitters== 0.2.0
flashrank==0.2.5
opencv-python==4.9.0.80
camelot-py==0.11.0
ghostscript==0.7
openai 1.30.5
chromadb==0.5.0
sentence-transformers==3.0.0

Convert the Data to PDF

Next, prepare your we need to prepare our data, and load it from a local PDF file. For this purpose, we’ll convert the World’s Billionaires data from Wikipedia, which you can find here: The World’s Billionaires to a PDF format.

from pathlib import Path

# PDF file path
ROOT = Path(os.getcwd()).absolute()
file_path = os.path.join(ROOT, "temp", "World_Billionaires_Wikipedia.pdf")
file_path

Helper Functions for Data Preparation

We now need to define the helper functions we’ll be using later. We’ll start by defining the data preparation functions that convert our data from the initial PDF format into useful chunks. In a previous article, we discussed the essential skills for data cleaning, emphasizing the importance of effective data transformation.

unicode_to_ascii: Converts Unicode text to ASCII by normalizing Unicode characters, removing any accents or special characters that don’t exist in ASCII, and ensuring the text remains readable in UTF-8 encoding.

import unicodedata
def unicode_to_ascii(text):
"""Normalize unicode values"""
 return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

Load_PDF: This function extracts text from our PDF file and then normalizes it. It also generates metadata values (see T-RAG: Lessons from the LLM Trenches for other approaches), and finally, employs a basic character-based chunking strategy. We’ll use some libraries from LangChain to retrieve information from our PDF (PyPDFLoader), perform the chunking process(RecursiveCharacter TextSplitter), and handle data in a standardized manner as a Document.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
def load_pdf(file_path, chunk_size, chunk_overlap):
    
    # load pdf with Langchain loader
    loader = PyPDFLoader(file_path)
    documents = loader.load()
    
    # get total pages count
    page_count = len(documents)
    
    # text cleaning and normalization
    for document in tqdm(documents):
        # text cleaning
        document.page_content = normalize(document.page_content)

        # add metadata 'text' classification
        document.metadata["page"] = str(document.metadata["page"] + 1)
        document.metadata["type"] = "text"
        
    # create text chunks base on character splitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    text_chunks = text_splitter.split_documents(documents)
    
    return text_chunks, page_count

get_tables: This function extracts tables from a PDF document (path) across multiple pages, cleans up table data, generates summaries for each row, and returns structured outputs: tables_summary for textual summaries, metadata_result for metadata, and tables_result for cleaned table data (classified as “row” and “table” as we saw in our previous article Tabular Data, RAG, & LLMs. Our main challenge consists of extracting tabular information using camelot-py and then expressing the information as context chunks that add value to our contextual information. We will need camelot-py to extract tabular data from our PDF.

— Note: Keep in mind the limitation of this algorithm, which does not consider tables distributed across pages.


import Camelot

def get_tables(path: str, pages: int, year: int):
    
    tables_result, metadata_result, tables_summary = [], [], []

    for page in tqdm(range(1, pages)):
        
        table_list = camelot.read_pdf(path, pages=str(page))

        for tab in range(table_list.n):
            
            df_table = table_list[tab].df.dropna(how="all").loc[:, ~table_list[tab].df.columns.isin(['',' '])]
            df_table = df_table.apply(lambda x: x.str.replace("\n", " ").replace("\xa0", " "))

            df_table = df_table.rename(columns=df_table.iloc[0]).drop(df_table.index[0]).reset_index(drop=True)

            if df_table.shape[0] <= 3 or df_table.eq("").all(axis=None):
                continue

            df_table["Year"] = year # Complete missing values (for demonstration purposes only)
            metadata_table = {"source": path, "page": str(page), "year": str(year), "type": "row"}

            df_table["summary"] = df_table.apply(
                lambda x: " ".join([f"{col}: {val}, " for col, val in x.items()]).replace("\xa0", " "),
                axis=1
            )

            docs_summary = [Document(page_content=row["summary"].strip(), metadata=metadata_table) for _, row in df_table.iterrows()]

            tables_result.append(df_table)
            metadata_result.append(metadata_table)
            tables_summary.extend(docs_summary)

            metadata_table = {"source": path, "page": str(page), "year": str(year), "type": "table"}
            tables_summary.append(Document(page_content=df_table.to_markdown(), metadata=metadata_table))
            metadata_result.append(metadata_table)

            year -= 1 # auxiliary code (for demonstration purposes only)

    return tables_summary, metadata_result, tables_result

normalize : This function takes takes a sentence and makes it easy to analyze by converting special characters to simpler forms, converting all text to lowercase, and then removing unnecessary words and symbols like punctuation, short words, and web addresses. It returns a cleaned-up version of the sentence that’s ready for further processing. Since the example will use text, we need to use a library that can understand language (in this case, English). Then, we’ll use spaCy to load an English NLP model (en_core_web_sm).

import spacy
from spacy.cli import download

try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print("Model not found. Downloading the model...")
    download('en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')

def normalize(sentence):
    """Normalize the list of sentences and return the same list of normalized sentences"""
        
    # Normalize Unicode characters to ASCII
    sentence = unicode_to_ascii(sentence)
    
    # Convert the sentence to lowercase and process it with spaCy
    sentence = nlp(sentence.replace('\n', ' ').lower())
    
    # Lemmatize the words and filter out punctuation, short words, stopwords, mentions, and URLs
    sentence_normalized = " ".join([word.lemma_ for word in sentence if (not word.is_punct)
                                    and (len(word.text) > 2) and (not word.is_stop) 
                                    and (not word.text.startswith('@')) and (not word.text.startswith('http'))])
    return sentence_normalized

Helper Functions for the Indexer

Next, we’ll create functions for the indexing stage. We’ll use chromadb as our embedding database because it’s open source . As mentioned before, we will be using both UCC and DCC scenarios.

Unified Context Scenario

In this scenario, we’re leveraging ChromaDB to create a unified collection of structured data, optimizing it for efficient semantic search, and embedding generation.

The key feature highlighted is metadata={“hnsw:space”: “cosine”} with which we configure the distance function used in semantic search and the embedding provider employed. By default, ChromaDB uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings.

import chromadb
clientdb = chromadb.PersistentClient()

def unified_context_collection(text_chunks, tables_summary):
    # generate structure for unified vector scenario
    
    # generated chromadb collection
    unified_collection = clientdb.create_collection(name="unified_context_collection", metadata={"hnsw:space": "cosine"})

    # Unified 'page_content' and 'metadata'
    unified_docs = [doc.page_content for doc in text_chunks + tables_summary]
    unified_meta = [doc.metadata for doc in text_chunks + tables_summary]

    # generate unique identifiers
    unified_ids = [str(uuid.uuid4()) for _ in unified_docs]

    # index data and generate default embeddings (vectors)
    unified_collection.add(documents=unified_docs, metadatas=unified_meta, ids=unified_ids)

    return unified_collection

Distributed Context Scenario

In this distributed scenario configuration, we utilize ChromaDB to manage two distinct collections: distrctx_context_collection and distrtbl_table_collection. Each collection is optimized for semantic search and embedding generation using the cosine similarity distance function ({“hnsw:space”: “cosine”}). It processes text_chunks for contextual data and tables_summary for structured tabular data, assigning unique identifiers and indexing them for efficient data management.

import chromadb
clientdb = chromadb.Client()

def distributed_context_collection(text_chunks, tables_summary):
    # generate structure for unified vector scenario
    
    # generated chromadb collection
    distrctx_collection = clientdb.create_collection(name="distrctx_context_collection", metadata={"hnsw:space": "cosine"})
    distrtbl_collection = clientdb.create_collection(name="distrtbl_table_collection", metadata={"hnsw:space": "cosine"})

    # Distributed context 'page_content' and 'metadata'
    distrctx_docs = [doc.page_content for doc in text_chunks]
    distrctx_meta = [doc.metadata for doc in text_chunks]

    # generate context unique identifiers
    distrctx_ids = [str(uuid.uuid4()) for _ in distrctx_docs]


    # Distributed Table 'page_content' and 'metadata'
    distrtbl_docs = [doc.page_content for doc in tables_summary]
    distrtbl_meta = [doc.metadata for doc in tables_summary]

    # generate table unique identifiers
    distrtbl_ids = [str(uuid.uuid4()) for _ in distrtbl_docs]

    # index data and generate default embeddings (vectors)
    distrctx_collection.add(documents=distrctx_docs, metadatas=distrctx_meta, ids=distrctx_ids)
    distrtbl_collection.add(documents=distrtbl_docs,metadatas=distrtbl_meta, ids=distrtbl_ids)

    return distrctx_collection, distrtbl_collection

The Helper Function for Reranking

Now, we’ll look at the retrieval functions. For the reranker, we’ve chosen FlashRank for its ease of use and becauseit operates independently of Torch or Transformer frameworks, running efficiently on CPUs with a minimal memory footprint of approximately 4MB. FlashRank plays a pivotal role in our RAG system by reranking candidate chunks after a semantic search. Using SOTA cross-encoders and advanced models, FlashRank evaluates the relevance of retrieved chunks based on their similarity to the input query. The function will generate the metric “score” (max_score, min_score, avg_score, std_score) that offers insights into the quality and distribution of the ranked results, helping to assess and refine the performance of the ranking process.

Note: The final segment of metrics was added for demonstrative purposes only, and final evaluation is not required in the proposed production setting.

from flashrank import Ranker, RerankRequest
import pandas as pd
pd.set_option('display.max_colwidth', None)


def get_ranked_chunks(query: str, collectionContext, test_scenario=3, year_scope=2024, top_results=20, top_rank: int=5):
    
    conditions = {
        0: {"year": str(year_scope)},
        1: {"$and": [{"type": "row"}, {"year": str(year_scope)}]},
        2: {"$and": [{"type": "table"}, {"year": str(year_scope)}]},
        3: {"$and": [{"year": str(year_scope)}, {"$or": [{"type": "table"}, {"type": "row"}]}]},
    }

    results = collectionContext.query(
        query_texts=[query],
        n_results=top_results,
        where=conditions[test_scenario]
    )

    total_collection_results = len(results)
    
    passages = [
        {"id": results["ids"][0][i], "text": results["documents"][0][i], "meta": results["metadatas"][0][i]}
        for i in range(total_collection_results)
    ]

    ranker = Ranker()
    rerankrequest = RerankRequest(query=query, passages=passages)
    ranked_results = ranker.rerank(rerankrequest)
    total_rank_results = len(ranked_results)

    chunks = [item['text'] for item in ranked_results[:top_rank]]
    total_chunks = len(chunks)

    df_ranked = pd.json_normalize(ranked_results).rename(columns=lambda x: x.replace('meta.', ''))
    df_ranked = df_ranked.sort_values(by='score', ascending=False)

    metrics = {
        "max_score": "{:+.6f}".format(df_ranked['score'].max()),
        "min_score": "{:+.6f}".format(df_ranked['score'].min()),
        "avg_score": "{:+.6f}".format(df_ranked['score'].mean()),
        "std_score": "{:+.6f}".format(df_ranked['score'].std()),
        "types": ', '.join(df_ranked['type'].unique()),
        "search_count": str(total_collection_results),
        "rerank_count": str(total_rank_results),
        "chunks_count": str(total_chunks),
        "chunks": ', '.join(df_ranked['text'].str[:7])
    }

    new_row_dict = {"Query": query, "Collection": collectionContext.name, "Scenario": test_scenario, **metrics}
    
    return chunks, new_row_dict, df_ranked

LLM API Client Configuration

We’re almost done! We still need to configure the last step to inference using the LLM with the context we prepared in previous sections.

Depending on your specific LLM setup, you should adjust the configuration accordingly for your personal use.

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key = os.environ.get("OAI_API_Key"), 
    api_version = os.environ.get("OAI_API_Version"), 
    azure_endpoint = os.environ.get("OAI_API_Base"))

MESSAGE_SYSTEM_CONTENT = """You are a customer service agent that helps a customer with answering questions. 
Please answer the question based on the provided context below. 
Make sure not to make any changes to the context, if possible, when prepare answers so as to provide accurate responses. 
If the answer  not be found in context, just politely say that you do not know, do not try to make up an answer."""

def response_test(question:str, context, model:str = "gpt-4"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": MESSAGE_SYSTEM_CONTENT,
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": '\n '.join(context)},
        ],
    )
    
    return response.choices[0].message.content

Perform the Test

Now that we have all the functions in place, let’s perform the test following the steps below:

1. Get the chunks: We selected a chunk size of 600. With this, we handle 34 pages and process 129 chunks (96 for chunk_size 800, 77 for 1000). Experimenting with these parameters is quite entertaining!

text_chunks, page_count = load_pdf(file_path, chunk_size=600, chunk_overlap=100)
print(f"Total text chunks: {len(text_chunks)}")
print(f"Total pages: {page_count}")

2. Get the tabular data:

tables_summary, metadata_result, tables_result = get_tables(file_path, page_count, 2024)
print(len(tables_result))

Note: The value “2024” is a temporary solution that facilitates adding functional value to our classification metadata.

3. Put the data into collections (UCC / UCD)

# Get the unified context data collection
unified_collection = unified_context_collection(text_chunks, tables_summary)

# Get the distributed context data collections
distrctx_collection, distrtbl_collection = distributed_context_collection(text_chunks, tables_summary)

Finally, here is the code snippet that allows us to control our tests. As you can see, our proposal encompasses: extraction and reranking of context information (get_ranked_chunks), prompt LLM (response_test), and finally, in ‘df_results’, we accumulate the metrics.

import pandas as pd
pd.set_option('display.max_colwidth', None)

def query_controler(questions, unified_collection, distrtbl_collection):
    
    df_results = pd.DataFrame(columns=[
        "Query", "Collection", "Scenario", "Answer",  
        "max_score", "min_score", "avg_score", "std_score", 
        "types", "search_count", "rerank_count", "chunks_count", "chunks"
    ])
    
    def process_query(query_source, collection, year_scp):
        chunks, new_row_dict, df_ranked = get_ranked_chunks(query=query_source, collectionContext=collection, year_scope=year_scp)
        response = response_test(query_source, chunks)
        new_row_dict["Answer"] = response
        
        return new_row_dict
    
    for key in questions.keys():
        query_source = questions[key]["query"]
        year_scp = questions[key]["year_scope"]
        
        new_row_dict = process_query(query_source, unified_collection, year_scp)
        df_results.loc[len(df_results)] = new_row_dict
        
        new_row_dict = process_query(query_source, distrtbl_collection, year_scp)
        df_results.loc[len(df_results)] = new_row_dict
        
    return df_results

We’ve now reached the most anticipated moment! Can we learn the answer to: “Who was the fourth millionaire of 2022?” using both scenarios (UCC and DCC)

questions = {}
questions["question1"] = {"query": "Who was the fourth millionaire of 2022?", "year_scope": 2022}

df_results = query_controler(questions, unified_collection, distrtbl_collection)
df_results[["Collection", "Answer", "max_score", "min_score", "types", "rerank_count", "chunks_count","chunks"]]

Image 5: Results table generated by the RAG application after processing a user query.

As you can see on Image 5, using both collections, “unified_context_collection” and “distrtbl_table_collection,” they both produced similar outputs answering our initial question. Since we configured metrics, this helps identify the quality of chunks retrieved. In both scenarios, we received scores ranging from +0.000086 to +0.000017, which indicates that the retrieved chunks consistently align closely with the query’s requirements. This range of scores suggests that the information retrieved from both collections is consistently relevant and closely matches what we sought to find, providing reliable and credible data.

For this case, we selected collections that include rows and tables, with seven rerank counts and five final contextual chunks each. Overall, the analysis confirms Bill Gates’ ranking and net worth with high precision and consistency across both data collections.

We’d like to show you one additional example. In this scenario, we’ve removed the metrics to facilitate understanding the results. However, it’s necessary to emphasize question number 3: “Based on context, which Billionaries have Google as a source of their fortune in 2021?” which offers an interesting opportunity to see that this question was resolved using only row-based table chunks.

questions = {}
questions["question2"] = {"query": "What's the Elon Musk net worth by 2023 year?", "year_scope": 2023}
questions["question3"] = {"query": "Based on context, which millionaires in 2021 have Google as the source of their fortune?", "year_scope": 2021}

df_results2 = query_controler(questions, unified_collection, distrtbl_collection)
df_results2[["Query", "Answer", "types", ]]

Image 6: Results table generated by the RAG application after processing a user query.

What Did We Learn?

In this tutorial for ingesting small tabular data while working with LLMs, we explored various aspects crucial for understanding and implementing RAG frameworks. From data extraction and normalization to semantic search and reranking, each step plays a pivotal role in achieving accurate and contextually rich responses in natural language processing tasks.

We also highlighted the challenges and opportunities of integrating tabular data into RAG frameworks. Notably, we emphasized the importance of chunking strategies and the impact of irrelevant context on model performance, drawing attention to relevant resources for deeper exploration.

This study not only enhances our understanding of RAG frameworks but also provides practical insights for implementing them effectively in real-world scenarios. As the field of natural language processing continues to evolve, such investigations pave the way for more sophisticated and contextually aware language models.

If you would like to learn more, here are some suggestions for further reading:

Density-based retrieval relevance. This is vital in retrieval-augmented generation, where irrelevant context can confuse generative models and degrade performance. Unlike relational databases, vector search systems return the nearest neighbors regardless of relevance.
Large Language Models Can Be Easily Distracted by Irrelevant Context. The authors show that irrelevant information significantly reduces model performance. They also identify mitigation strategies, including self-consistent decoding and prompt instructions for ignoring irrelevant data.
Open Platform for Enterprise AI (OPEA): When deploying a RAG system, we will face multiple challenges such as ensuring scalability, handling data security, and integrating with existing infrastructure. This open-source project helps with the deployment by providing robust tools and frameworks designed to streamline these processes and facilitate seamless integration.

About the Authors

Eduardo Rojas Oviedo, Platform Engineer, Intel

Eduardo Rojas Oviedo is a dedicated RAG developer within Intel’s dynamic and innovative team. Specialized in cutting-edge developer tools for AI, Machine Learning, and NLP, he is passionate about leveraging technology to create impactful solutions. Eduardo’s expertise lies in building robust and innovative applications that push the boundaries of what’s possible in the realm of artificial intelligence. His commitment to sharing knowledge and advancing technology drives his ongoing pursuit of excellence in the field.

Ezequiel Lanza, Open Source AI Evangelist, Intel

Follow us!

Medium, Podcast, Open.intel , X , Linkedin

Improve your Tabular Data Ingestion for RAG with Reranking was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Containerize Your Local LLM

Intel — Tue, 25 Jun 2024 15:32:37 GMT

Learn how to create a container to expose your local large language model (LLM) for consumption by an API.

Photo by Jukebox Print on Unsplash

Presented by Ezequiel Lanza — Open Source AI Evangelist (Intel)

Let’s say you’re building a chatbot for your company and you need to find the best way to deploy it. You might start by building your logic in a Jupyter Notebook, but this won’t work when you want to deploy it in a live environment where users can interact with it. You might then start thinking about a suitable strategy for making your application scalable, portable, and efficient. Cloud native development emerges as the best alternative, allowing you to focus on topics like scaling and dimensioning while treating each piece as an isolated module.

So, what happens next? You need to create the core components of your application, starting with the large language model (LLM). You might want an LLM container that receives a string input and returns the model’s response (Image 1). That sounds easy, right?

Image 1 : The user asks the large language model ‘What is Kubernetes?’ The model answers that Kubernetes is an Open Source platform for automating the deployment, scaling, and management of containerized applications. It highlights Kubernetes’ role in managing clusters of containers effectively

Unfortunately, this can be challenging for any application composed of more than just a language model. For instance, you may have a React front end that doesn’t understand the intricacies of LLMs, such as LLM pipelines, AI frameworks, or optimizations needed to make your model more efficient. To address this, you need to abstract the LLM logic and provide an API for the application to interact with by containerizing your model. This approach offers several benefits characteristic of microservices architecture, such as scalability, isolation, portability, efficiency, and continuous deployment.

In this post, we’ll walk you through the step-by-step process for configuring and creating an LLM container for further use by an application. This post builds off the project we shared in our article Easily Deploy Multiple LLMs in a Cloud Native Environment. You may want to read that one first before proceeding.

You’ll find the code you need for this project here: https://github.com/intel/Multi-llms-Chatbot-CloudNative-LangChain/tree/main

Why a Container?

When containerizing LLM logic, you can use either a local or external model. It’s worth mentioning that if you choose to use a local model, you don’t have to include it on the container image. This avoids having to package large model files (~26GB for a 7B model) with the container, which can significantly increase the container size and slow down deployment times. Additionally, it eliminates the need to rebuild the container image whenever the model is updated or retrained. Instead, you can store the model on a local server or file system that the containerized application can access when the container launches, allowing for more efficient management and deployment.

Since this example is based on a project, you’ll first need to clone the repo with the code.

git clone https://github.com/intel/Multi-llms-Chatbot-CloudNative-LangChain/tree/main

As mentioned in the previous post , the repo explains how to create all the containers you need to build your own multi-chatbot, such as the front end, the proxy, or the LLM model. In this case, we’ll focus on the container for the LLaMa2 non-optimized model.

What Will Be Inside The Container?

This example assumes you want to use a local non-optimized model, which means you didn’t optimize your model using a tool like Intel® Extension for Transformers (ITREX) to make it use fewer resources. In short, you want to use a vanilla version of LLaMa2, which you can download HERE. Save the model in an externally accessible place (ideally a file server) to be used by the container when it’s launched.

You might be wondering, why aren’t we storing the model on the container image? Let’s think about it. Models are heavy. Consider, for example, the LLaMA2 model family. The LLaMA2 13B model has 13 billion parameters and requires around 40 gigabytes of storage space. Loading such a large model directly on the container image would significantly increase the container’s size, making it more cumbersome to deploy and manage. By keeping the model on an external file server, you can keep the container lightweight and more agile, facilitating easy updates and scalability. This approach also allows for better resource management, as the model can be shared across multiple containers or instances, reducing redundancy and optimizing storage usage.

The repo is organized by folders and each folder corresponds to a different image in the chatbot. Go to the folder where the local models scripts are, in this case : “LLAMA-non

cd 3__Local_Models/LLAMA-non

Within that folder, you’ll find all the files needed to build the LLM container. The following is a tree that Docker will use to create the container.

LLAMA-non
├── Dockerfile
├── app
│ ├── __init__.py
│ ├── llama2.py
│ └── server.py
└── requirements.txt

Let’s explore each part before building the container.

Application Folder (/app)

Any application that will live in a container has to be declared. Within the application folder (/app), we can typically find the scripts that will make the application run. In our case, we’ll have two scripts (llama.py and server.py), one to declarethe LLM pipeline object and the other to expose the API. These scripts will be run when the container is launched. Let’s explore each of them.

Llama2.py (/app/llama.py)

Here is where the pipeline will live. A pipeline abstracts the complexities of the underlying models and provides an easy-to-use interface for common tasks such as text classification, question answering, or text generation.

The code below sets up a Hugging Face text generation pipeline using a locally stored LLaMa2 model and its tokenizer (downloaded offline HERE). Having models locally hosted allows further customization; when configuring the pipeline, you could set specific parameters to control the text generation behavior, like limiting the max number of tokens generated by max_new_tokens, penalizing the repetition of words in the generation with repetition_penalty or defining how “creative” the model will be with temperature.

Since we’ll be using LangChain further to integrate the pipeline, we should put the pipeline in the provided LangChain HuggingFacePipeline object, making it ready to be served via LangChain’s API.

from transformers import pipeline,LlamaForCausalLM, LlamaTokenizer
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

# Local or File server address where the local model is stored
model_path="/fs_mounted/Models/llama-2-7b-chat-hf" 
local_model =  LlamaForCausalLM.from_pretrained(model_path, return_dict=True, trust_remote_code=True)
local_tokenizer = LlamaTokenizer.from_pretrained(model_path)

# Hugging face pipeline
pipe= pipeline(task="text-generation", model=local_model, tokenizer=local_tokenizer, 
                         trust_remote_code=True, max_new_tokens=100, 
                         repetition_penalty=1.1, model_kwargs={"max_length": 1200, "temperature": 0.01})

#Pipeline to be consumed by Langserve API
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

server.py (/app/server.py)

As we mentioned, to interact with the model in our cloud native environment, we expose the LLM via an API call. External applications or front ends can simply send an API POST message and receive a response, without needing to understand the pipeline or LLM parameters.

This is where our FastAPI application comes in. By creating an instance of FastAPI, we set up a web service that can handle these API requests. We configure cross-origin resource sharing (CORS) middleware to allow requests from any origin, which is useful during the development stage. This ensures that our front end, regardless of its origin, can communicate with the backend. Note that you should adapt this configuration when it’s used in your production scenario.

But this is an LLM server, so we define a prompt template to guide the LLM in providing clear, concise, and accurate answers. The template instructs the model to respond within a specified character limit and to admit when it doesn’t know the answer, ensuring reliability and consistency in responses.

We’ll also need to set up a root endpoint to redirect users to the API documentation page, making it easy for developers to understand and use the API. We then add specific routes to handle requests involving the LLM pipeline. These routes use the predefined prompt template and the LLM pipeline to process inputs and generate responses.

When the script is run directly (when the container launches), the application starts using Uvicorn, a fast ASGI server, and listens on localhost at port 5005. This setup allows us to expose the LLM through a simple API call, making it accessible to external applications without requiring them to handle the complexities of the LLM’s internal workings.

from fastapi import FastAPI
from langchain.prompts import PromptTemplate 
from fastapi.responses import RedirectResponse
from fastapi.middleware.cors import CORSMiddleware
from langserve import add_routes
from app.llama2 import llm_pipeline

# Initializes a FastAPI app instance
app = FastAPI()

# Set up CORS middleware to allow requests from any origin
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Set this to the specific origin of your frontend in production
    allow_credentials=False,
    allow_methods=["*"],
    allow_headers=["*"],
)

#Defines a template for prompts that will be sent to the LLM. The template ensures that the assistant gives clear, concise, and accurate answers.

template = """You are a very smart and educated assistant to guide the user to understand the concepts. Please Explain the answer
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Question: {question}

Only return the helpful answer below and nothing else. Give an answer in 1000 characteres at maximum please
Helpful answer:
"""

prompt = PromptTemplate.from_template(template)

@app.get("/")
async def redirect_root_to_docs():
    return RedirectResponse("/docs")

add_routes(app, 
           prompt|llm_pipeline,
           path='/chain_llama_non')

if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="localhost", port=5005)

Create and Upload Your Container to a Registry

Now that we’ve defined the files we need, the next step is to create the container image with those files including all the dependencies to allow easy replication. Finally, we’ll upload the image to our preferred registry so any environment (Kubernetes cluster) can have access to it.

Create the Container image

The following is the file the Docker engine needs to create the container; we can think of a Dockerfile in two parts: image creation and execution.

For image creation, we need to define the steps the engine has to follow to create the image. In this stage, we instruct the engine to use a determined image, which in our case will be Python. We will use the Python 3.11-slim version because, as mentioned, the idea of the container is to use minimal software. For example, we don’t need to install, say, the entire Ubuntu image (which comes with several software components that won’t be used and will make the container heavier for no reason). In addition, all the required files that will be used for the application need to be declared. The COPY command transfers the requirements.txt file and the application directory (./app) into the Docker container. requirements.txt lists Python dependencies, while the application directory contains the source code. This ensures the application is packaged with its dependencies and ready for execution inside the container.

FROM python:3.11-slim

# Set the working directory in the container
WORKDIR /usr/

#Copy requirements file to install python dependencies
COPY requirements.txt ./

# Upgrade pip and install dependencies
RUN pip install --upgrade pip && \
    pip install -r requirements.txt &&\
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Copy app folder where the application is inside the container
COPY ./app ./app

#Expose port 5005, the application will expose the API on that port
EXPOSE 5005

After the image is created, we need to declare what the container will do when launched. This is declared using CMD, which stands for commands that will be executed in the command line. In our example, we will be launching the Uvicorn server in port 5005.

#Command the container image will run when is lauched.
CMD exec uvicorn app.server:app --host 0.0.0.0 --port 5005

Let’s build our container! In this case, we’ll build a container to run on Intel X86 architecture to take advantage of further optimizations like Intel® Extension for PyTorch. Those optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs.

NOTE: BE SURE TO HAVE YOUR DOCKER ENGINE INSTALLED

Refer to https://www.docker.com

docker build --platform linux/amd64 -t llama7b-non-optimized:latest .

Upload the Container

We should now see our container. Next, we’ll upload it to a registry. This will be relevant when building your Kubernetes environment.

docker login

docker tag llama7b-non-optimized:latest /llama7b-non-optimized:latest

docker push /llama7b-non-optimized:latest

Your container is now in your Docker Hub and ready for use when you deploy your Kubernetes cluster.

Call to Action

Containerizing your model can transform how you deploy and scale your applications. Clone the repository and follow the guide to build your own LLaMa container and make it accessible via an API: Multi-llms-Chatbot-CloudNative-LangChain GitHub repo

This step-by-step process will help make your model scalable, portable, and ready for production.

For more articles on LLM topics, be sure to also checK out:

· Tabular Data, RAG, & LLMs: Improve Results Through Data Table Prompting

· Easily Deploy Multiple LLMs in a Cloud Native Environment

· Four Data Cleaning Techniques to Improve Large Language Model (LLM) Performance

· Optimize Vector Databases, Enhance RAG-Driven Generative AI

About the Author

Ezequiel Lanza, Open Source AI Evangelist, Intel

Follow us!

Medium, Podcast, Open.intel , X , Linkedin

How to Containerize Your Local LLM was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tabular Data, RAG, & LLMs: Improve Results Through Data Table Prompting

Intel — Tue, 14 May 2024 14:02:10 GMT

How to ingest small tabular data when working with LLMs.

Photo by Mika Baumeister on Unsplash

By Eduardo Rojas Oviedo with Ezequiel Lanza

Say you’re a financial analyst working for an investment firm. Your job involves staying ahead of market trends and identifying potential investment opportunities for your clients, who are often curious about the world’s richest people and their sources of wealth. You might consider using a retrieval augmented generation (RAG) system to easily and quickly identify market trends, investment opportunities, and economic risks as well as answer questions like, “Which industry has the highest number of billionaires?” or “How does the gender distribution of billionaires compare across different regions?”

Your first step is to go to the source to get that information. However, as you do an initial inspection, you discover a possible obstacle. The document contains not only text but also TABLES!

Image 1 : A table which includes a ranking of the world billionaires for 2023. It was extracted from Wikipedia (https://en.wikipedia.org/wiki/The_World%27s_Billionaires)

In this post, we’ll demonstrate how to use a large language model (LLM) model to consume tabular data in different formats.

LLMs and the Challenge of Structured Data

For us humans, it’s straightforward to connect the dots between the text and the table. But for LLMs, it’s like trying to solve a puzzle without all the pieces.

LLMs are designed to process text in a sequential manner, meaning they read and understand information one word or sentence at a time. This sequential processing is how they’ve been trained on vast amounts of text data. However, tables present information in a different format. They organize data into rows and columns, which creates a multidimensional structure.

Understanding this structure requires a different approach for LLMs than sequential text. LLMs need to recognize patterns across rows and columns, understand relationships between different data points, and interpret the meaning of headers and cell values. Because LLMs are primarily trained on sequential text, they might struggle to interpret and process tabular data. It’s like asking someone who’s used to reading novels to suddenly understand and interpret graphs or charts; they might find it challenging because it’s outside their usual mode of processing information.

Common Examples of Tabular Data

We can find tabular data more commonly than we might think. Below are three scenarios ranging from small, medium, to large:

Small: Every day, users of our RAG platform will need to support their queries with documents that contain embedded small tabular data, as the example shown in Image 1. where the document contains mostly text and some tables. An industry report on global sales is another good example of small tabular data as it features textual analysis alongside tables detailing sales figures, which are often broken down by region and manufacturer. We can’t remove that tabular data to make it easy for our model to consume because it provides critical context.
Medium: For some situations, you may need to analyze a greater amount of tabular data, such as spreadsheets, CSV files (comma-separated values), or data preformatted by our preferred reporting tools, such as Power BI. Say, for example, you’re working on a marketing team at a retail company, which is analyzing sales data from the past quarter to identify trends and opportunities for growth. This data would likely come in the form of multiple database files containing information on product sales, customer demographics, and regional sales performance.
Large: A third scenario in which our users might wish to perform data analysis involves transaction databases and multidimensional datasets, such as OLAP cubes, because they offer advantages over spreadsheets for analyzing large amounts of complex data. OLAP provides fast query performance, support for complex calculations, and the ability to slice and dice data across multiple dimensions. OLAP cubes are better suited for handling the complexity and volume of data typically found in transaction databases or other large datasets, which require an understanding of the data information domain. For example, if you’re a retail chain analyzing data to enhance sales performance, you’ll need SQL for querying and OLAP cubes for analysis to garner insights from sales trends and customer segmentation to inform inventory management and marketing strategies. These tables can be large to analyze and understand.

In this article, we’ll explore the scenario where tabular data is embedded within textual information. We’ll work with small tables containing a dozen rows and a few columns, all within the model’s context window capacity.

Let’s See It in Action

Our example will be based on Figure 1, which shows the world billionaires list, to feed into our future RAG framework and answer questions about that billionaire list. With just four steps, we can demonstrate the use of tabular data as context for our users’ questions and even verify whether it’s possible to make the model incur errors.

Prepare our environment libraries

import os 
import sys
import pandas as pd
from typing import List

To facilitate the tabular display of the results, we will use the “BeautifulTable” library, which provides a visually attractive format in a terminal (beautifultable’s documentation).

!pip install beautifultable
from beautifultable import BeautifulTable

Since we will only be working with tables, let’s explore the capabilities of camelot-py, another useful python library. We can follow the camelot installation instructions at Installation of dependencies and Ghostscript dependency from Ghostscript Downloads.

!pip install opencv-python
!pip install camelot-py
!pip install ghostscript

import camelot

Prepare our data

We will save the list (the world’s billionaires) as a PDF to facilitate our exploration.

file_path="./World_Billionaires_Wikipedia.pdf"

The code below will allow us, in a basic way, to clean the data once it has been extracted as a Pandas dataframe. This routine processes page by page from a preselected list and will allow us to have access to a structure that is easy to manipulate.

Note: The below code was adapted by the author from Recursive Retriever + Query Engine Demo

# use camelot to parse tables   
def get_tables(path: str, pages: List[int]):    
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        if table_list.n>0:
            for tab in range(table_list.n):
                
                # Conversion of the the tables into the dataframes.
                table_df = table_list[tab].df 
                
                table_df = (
                    table_df.rename(columns=table_df.iloc[0])
                    .drop(table_df.index[0])
                    .reset_index(drop=True)
                )        
                     
                table_df = table_df.apply(lambda x: x.str.replace('\n',''))
                
                # Change column names to be valid as XML tags
                table_df.columns = [col.replace('\n', ' ').replace(' ', '') for col in table_df.columns]
                table_df.columns = [col.replace('(', '').replace(')', '') for col in table_df.columns]
    
    return table_df
# extract data table from page number
df = get_tables(file_path, pages=[3])

Now, let’s convert our tabular data from data frame into multiple formats, such as: Json, CSV or Markdown, among others.

# prepare test set
eval_df = pd.DataFrame(columns=["Data Format", "Data raw"]) # , "Question", "Answer"

# Save the data in JSON format
data_json = df.to_json(orient='records')
eval_df.loc[len(eval_df)] = ["JSON", data_json]

# Save the data as a list of dictionaries
data_list_dict = df.to_dict(orient='records')
eval_df.loc[len(eval_df)] = ["DICT", data_list_dict]

# Save the data in CSV format
csv_data = df.to_csv(index=False)
eval_df.loc[len(eval_df)] = ["CSV", csv_data]

# Save the data in tab-separated format
tsv_data = df.to_csv(index=False, sep='\t')
eval_df.loc[len(eval_df)] = ["TSV (tab-separated)", tsv_data]

# Save the data in HTML format
html_data = df.to_html(index=False)
eval_df.loc[len(eval_df)] = ["HTML", html_data]

# Save the data in LaTeX format
latex_data = df.to_latex(index=False)
eval_df.loc[len(eval_df)] = ["LaTeX", latex_data]

# Save the data in Markdown format
markdown_data = df.to_markdown(index=False)
eval_df.loc[len(eval_df)] = ["Markdown", markdown_data]

# Save the data as a string
string_data = df.to_string(index=False)
eval_df.loc[len(eval_df)] = ["STRING", string_data]

# Save the data as a NumPy array
numpy_data = df.to_numpy()
eval_df.loc[len(eval_df)] = ["NumPy", numpy_data]

# Save the data in XML format
xml_data = df.to_xml(index=False)
eval_df.loc[len(eval_df)] = ["XML", xml_data]

It’s time to explore our test data. We have configured a dataset where each row represents an output format from dataframe and the data in “Data raw” corresponds to the tabular data that we will use with the generative model.

from pandas import option_context
with option_context('display.max_colwidth', 150):
    display(eval_df.head(10))

Output:

Image 2: A code output that showcases the raw data for each text format

Set our model for validation

Let’s prepare a basic prompt that allows us to interact with the context data.

MESSAGE_SYSTEM_CONTENT = """You are a customer service agent that helps a customer with answering questions. 
Please answer the question based on the provided context below. 
Make sure not to make any changes to the context, if possible, when preparing answers to provide accurate responses. 
If the answer cannot be found in context, just politely say that you do not know, do not try to make up an answer."""

Before carrying out our tests with the tabular dataset, we will need to prepare our model’s connection settings (in this example, we’ll use AzureOpenAI). You’ll need to provide your credentials.

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=OAI_API_Key, 
    api_version=OAI_API_Version, 
    azure_endpoint=OAI_API_Base)

def response_test(question:str, context:str, model:str = "gpt-4"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": MESSAGE_SYSTEM_CONTENT,
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": context},
        ],
    )
    
    return response.choices[0].message.content

Since we are working with a dataset, where each row represents an individual unit of context information, we have implemented the following iteration routine, allowing us to process one row after the other and store the model interpretation for each one.

def run_question_test(query: str, eval_df:str):

    questions = []
    answers = []

    for index, row in eval_df.iterrows():
        questions.append(query)
        response = response_test(query, str(row['Data raw']))
        answers.append(response)
        
    eval_df['Question'] = questions
    eval_df['Answer'] = answers
    
    return eval_df

def BeautifulTableformat(query:str, results:pd.DataFrame, MaxWidth:int = 250):
    table = BeautifulTable(maxwidth=MaxWidth, default_alignment=BeautifulTable.ALIGN_LEFT)
    table.columns.header = ["Data Format", "Query", "Answer"]
    for index, row in results.iterrows():
        table.rows.append([row['Data Format'], query, row['Answer']])
    
    return table

Let’s have FUN!

Now, we’ll connect the dots, processing the dataset, from which we will obtain an answer using each of the tabular data formats as context information, and then we will display the results in a tabular manner.

query = "What's the Elon Musk's net worth?"
result_df1 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df1, 150)
print(table)

Output:

Image 3 depicts a table outlining the responses provided by a model for diverse input text formats to the inquiry, “What is Elon Musk’s net worth?”

We can see how the question, “What’s Elon Musk's net worth?” was consistently answered for each tabular data format obtained during the Panda data frame conversion. As we know, because it consists of a semantic elaboration, the variations between responses are a challenge that we must consider during the generation of final validation metrics

We could also obtain more concise or elaborate responses if we modify our “MESSAGE_SYSTEM_CONTENT” variable.

Let’s repeat the exercise once again, this time with a question that requires more analytical reasoning for our model.

query = "What's the sixth richest billionaire in 2023 net worth?"
result_df2 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df2, 150)
print(table)

Output:

Image 4 displays a table presenting the responses provided by a model to various formats of input text for the question, “What is the net worth of the sixth wealthiest billionaire in 2023?

As with the previous example, we have used each of the Pandas data frame export formats as a query’s information context.context for the quer

As with the previous example, we have used each of the Pandas dataframe export formats as context for the query.

In this example, the question "What's the sixth richest billionaire in 2023 net worth?" proves that the model can respond to something more abstract, such as "the sixth richest billionaire," which involves greater analytical reasoning and tabular data calculation. Both challenges were resolved with excellent consistency

Relevancy and Distraction

Let’s play around with the model and check that our prompt and data context work as we expect.

query = "What's the Michael Jordan net worth?"
result_df3 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df3, 150)
print(table)

Output:

Image 5 showcases a table illustrating the responses generated by a model for different input text formats in response to the question, “What is Michael Jordan’s net worth?”

With this test, we’ve proposed a question that has no correct answer in the context of the information provided. Our goal is to ensure that our model does not respond with a hallucination or false positive (a false response that seems true).
The answer to our “What’s Michael Jordan’s net worth?” was resolved consistently for each data format, as we expected (there was no answer to the question).

Let’s provide another example, which would possibly mislead an unsuspecting user, by using a name that significantly resembles one existing in the tabular data.

query = "What's Michael Musk's net worth?"
result_df4 = run_question_test(query, eval_df.copy())
table = BeautifulTableformat(query, result_df4, 180)
print(table)

Image 6 presents a table delineating the responses generated by a model across various input text formats in answer to the question, “What is Michael Musk’s net worth?”

With the question “What’s Michael Musk's net worth?” where “Musk” could make us misinterpret the question, the model has nevertheless solved the challenge satisfactorily.

Conclusion

By ingesting small tabular data from textual documents, we’ve seen how an LLM can understand the context of a table even when we tried to trick it by asking questions with wrong information. It’s evident that preserving structural integrity while extracting contextually embedded tables is relevant. These tables often contain critical contextual information, enhancing the surrounding text’s comprehension to provide more accurate results.

Focusing on small-embedded tables, it’s essential to recognize their significance in providing contextual clues to your RAG framework. Leveraging libraries like Camelot for table extraction ensures the preservation of these structures. However, maintaining relevance without distraction poses challenges, as demonstrated in model testing. By doing so, we provide essential context to models like GPT, enabling them to generate accurate and contextually relevant responses within broader textual contexts.

Call to Action

If you want to explore more, there are a series of alternative libraries that facilitate the extraction of tabular data, such as: Tabula, pdfplumber, pdftables, and pdf-table-extract.In “Comparison with other PDF Table Extraction libraries and tools,” Vinayak Mehta developed a comparative performance evaluation of the listed libraries that you might find useful.

Explore Open Platform for Enterprise AI (OPEA). When deploying a RAG system, we will face multiple challenges such as ensuring scalability, handling data security, and integrating with existing infrastructure. This open-source project helps with the deployment by providing robust tools and frameworks designed to streamline these processes and facilitate seamless integration.

About the Authors

Eduardo Rojas Oviedo, Platform Engineer, Intel

Ezequiel Lanza, Open Source AI Evangelist, Intel

Follow us!

Medium, Podcast, Open.intel , X , Linkedin

Tabular Data, RAG, & LLMs: Improve Results Through Data Table Prompting was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Easily Deploy Multiple LLMs in a Cloud Native Environment

Intel — Tue, 23 Apr 2024 16:02:09 GMT

Take the complexity out of deploying cloud native LLMs with LangChain and Intel Developer Cloud

By Arun Gupta and Ezequiel Lanza

Imagine you could give a personal assistant to each employee. Productivity across your organization would spike as employees can see how AI can help them and feel empowered to focus on strategic thinking. This dream scenario is possible with powerful AI technology, such as department-specific chatbots that deliver fast, high-quality results.

However, businesses often need to weave together multiple large language models (LLMs) to support diverse use cases. Because each model may have different compute and storage needs or specific knowledge to be used by internal departments in unique ways, the complexity can quickly skyrocket.

The right set of tools can take the complexity out of the deployment process. Here we’ll explore a reference architecture for building and deploying multiple LLMs in a single user interface with Kubernetes and LangChain. You can also find a thorough explanation of each step in the corresponding GitHub repo, which is structured as an educational resource with recipe files for you to download and create your own containers.

We demo the complete steps for this approach in a KubeCon + CloudNativeCon Europe 2024 presentation.

https://medium.com/media/402d4ec8ca6afde782d5b38f96a0984e/href

Step 1: Define Your Model

Hugging Face makes it easy to download models, so you can start inferencing locally right away, but how do you know which LLM to choose? There are three important considerations to keep in mind as you evaluate your options, in addition to deciding whether you should deploy the model locally or externally (which we’ll address in the next step).

· Performance: Before downloading a model, you can see how it stacks up against industry benchmarks by comparing it to other models on the Hugging Face Leaderboard, a public ranking system of open LLMs.

· Community support: You don’t want to choose a model that no one uses or maintains. Look for a model with widespread community adoption, an active contributor base, and strong documentation with helpful resources like tutorials. These are all signs of a thriving community that can offer the help you need down the road.

· Ethical considerations: Sometimes models generate biased results around traits like ethnicity or gender. Choosing a model that trains on diverse data and is transparent about its processes can help you mitigate bias and ensure your results are fair.

Depending on your use case, you may also want to optimize your model. A tool like Intel® Extension for Transformers, for example, can help you shrink the RAM needed to store a 7 billion parameter LLaMa2 chatbot model from 26 GB to 7 GB.

Step 2: Choose a Consumption Model

Once you’ve picked a model, you need to consider where you’ll consume it. Local models can be stored on your internal servers or even laptops if they are optimized, while external models allow you to use LLMs hosted by a third party. Factors such as cost, storage capacity, and how you plan to handle sensitive data will help determine which consumption model is right for you.

For example, if your model will be used by financial or legal departments, which are often subject to data privacy regulations, you may need to use a local model so you can inference without ever sending sensitive data outside your organization. Local models give you more control to choose how and when you use them, enabling your team to use models offline and customize them to your business, also known as fine-tuning. Additionally, local models can potentially benefit from cost efficiencies, as some external models can require you to pay per outbound and inbound token . When using a fee-based external model, you will be paying both to send a query and receive a response. You’ll want to ensure that your prompts are well engineered and phrased precisely to draw out the correct answer. Otherwise, you’ll be sending multiple prompts and inflating your fees.

However, there are advantages to external models that require much less compute power and storage. Say you have a 7 billion parameter LLaMa model, which requires about 26 GB RAM. To inference the model locally, you’d need sufficient storage space on your server plus the necessary compute power provided by local CPUs and GPUs to support model inference and your application. In an external consumption model, however, you’d only need enough compute power to run your application — simply make an API call and start inferencing. Additionally, because a third party manages infrastructure resources, external models are typically simpler to set up, faster to deploy, and easier to scale.

You can use a pricing calculator to start estimating the cost of an external model based on the provider and number of input and output tokens you need.

Step 3: Package Your Models

As you narrow in on which LLMs you’ll use to support your most important use cases, you need a unified way to manage all models in the most efficient way possible. LangChain is an open source framework that simplifies building and managing multiple types of LLM applications in a single user interface. You can plug in any of the more than 80 supported types of open source LLMs, including local and external models, optimized and unoptimized models, and even advanced techniques like retrieval-augmented generation (RAG).

The LangChain API Chain combines the prompt template and LLM pipeline to generate better results.

However, as you can see, an LLM application is not just a model. Complete applications include a model, parameters, and a tokenizer, which Hugging Face provides in multiple configurations called pipelines. In addition, LangChain offers a prompt template that delivers more context to the pipeline to generate better results. LangChain offers an API called Chain, among others, that pulls together your prompt template and pipeline, so interacting with the model becomes as simple as using “chain.invoke” to send your question and generate a response.

Step 4: Containerize Your Model

Ask a developer how they like to deploy their LLMs, and more often than not they’ll say Kubernetes. Cloud native has become the de facto platform for deploying LLMs because it offers a few key advantages.

· Scalability and portability: After you’ve configured a model in a cluster on your desktop, you can seamlessly scale it across platforms — such as Amazon EKS, Microsoft Azure, or the Intel® Developer Cloud — and production environments, from on-premises to the edge and from small to large node clusters.

· Resource management: AI models consume lots of memory and compute power. Kubernetes gives you more control to fine-tune your resources via CPU and memory limits, resource quotas, and priority classes to funnel power to your most important models first.

· Observability: Many open source projects are using telemetry to enhance visibility and provide more insight into AI models.

Just like the model packaging process in the previous step, LangChain also provides an API to help you easily containerize your model. Start by connecting your container to the file server via a persistent volume claim (PVC) or a persistent volume (PV). Now when you run the container, the “POST” API downloads the model from the file server and places it inside the container.

Each time you run the container, the local or external model (represented by the pipelines here) is downloaded from the file server onto the container.

Step 5: Integrate Multiple Models

Now that you’ve packaged and containerized your model, you may want to integrate additional LLMs that are fine-tuned to new use cases. To do so, you need an LLM proxy — an architecture layer that unites distinct LLMs so you can interact with multiple models at once.

Take for example your organization’s intranet. It connects employees across your organization to the tools they need to complete their work, communicate across business units, and access important information about their employment status, such as pay stubs, PTO requests, tax forms. Attaching your intranet UI to an LLM proxy gives your employees access to AI models optimized to each business unit.

In addition to providing management tools for model provisioning and governance, the LLM proxy may also include additional intelligence to help field AI prompts and direct them to the best LLM to solve the problem.

The LLM proxy connects front-end architecture with the LLMs running on the back end.

To build this in a Kubernetes environment, first create an ingress controller or load balancer, such as NGINX, to expose the front end and the proxy so that the browser can communicate with it. Everything beneath the proxy, such as local models or APIs for external models, will not need to be exposed to the browser and can be deployed internally as pods on the cluster. Once you push your containers to a container registry like Docker Hub, your containers will be accessible when you run your cluster.

Demo: Deploying a Chatbot in Kubernetes in Intel Developer Cloud

In this quick demo, we’ll show you how to deploy a chatbot that can switch between multiple models using LangChain and Intel Kubernetes Service (IKS). IKS is an Intel® Developer Cloud service that lets you test applications on the latest Intel® hardware as soon as it’s available, rather than waiting for a cloud service provider to adopt the hardware. You can also download the necessary files from the GitHub repo to launch your own containers and follow along. As you’ll see, the results appear in a simple, mock UI, using React.

The LangChain application running in Intel Kubernetes Service (IKS) allows the user to easily switch between three models.

Try It for Yourself

AI models can help improve employee productivity across your organization, but one model rarely fits all use cases. LangChain makes it easy to use multiple LLMs in one environment, allowing employees to choose which model is right for each situation. Explore the GitHub repo to get started using LangChain and IKS.

About the Authors

Ezequiel Lanza, Open Source AI Evangelist, Intel

Arun Gupta, vice president and general manager, Open Ecosystem, Intel

Dedicated to growing the open ecosystem at Intel, Arun Gupta is a strategist, advocate, and practitioner who has spent two decades helping companies such as Apple and Amazon embrace open source principles. He is currently chairperson of the Cloud Native Computing Foundation Governing Board.

Follow us!

Medium, Podcast, Open.intel , X , Linkedin

Easily Deploy Multiple LLMs in a Cloud Native Environment was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimize Vector Databases, Enhance RAG-Driven Generative AI

Intel — Tue, 02 Apr 2024 02:56:13 GMT

Two methods to optimize your vector database when using RAG

Photo by Ilya Pavlov on Unsplash

By Cathy Zhang and Dr. Malini Bhandaru

Contributors: Lin Yang and Changyan Liu

Generative AI (GenAI) models, which are seeing exponential adoption in our daily lives, are being improved by retrieval-augmented generation (RAG), a technique used to enhance response accuracy and reliability by fetching facts from external sources. RAG helps a regular large language model (LLM) understand context and reduce hallucinations by leveraging a giant database of unstructured data stored as vectors — a mathematical presentation that helps capture context and relationships between data.

RAG helps to retrieve more contextual information and thus generate better responses, but the vector databases they rely on are getting ever larger to provide rich content to draw upon. Just as trillion-parameter LLMs are on the horizon, vector databases of billions of vectors are not far behind. As optimization engineers, we were curious to see if we could make vector databases more performant, load data faster, and create indices faster to ensure retrieval speed even as new data is added. Doing so would not only result in reduced user wait time, but also make RAG-based AI solutions a little more sustainable.

In this article, you’ll learn more about vector databases and their benchmarking frameworks, datasets to tackle different aspects, and the tools used for performance analysis — everything you need to start optimizing vector databases. We will also share our optimization achievements on two popular vector database solutions to inspire you on your optimization journey of performance and sustainability impact.

Understanding Vector Databases

Unlike traditional relational or non-relational databases where data is stored in a structured manner, a vector database contains a mathematical representation of individual data items, called a vector, constructed using an embedding or transformation function. The vector commonly represents features or semantic meanings and can be short or long. Vector databases do vector retrieval by similarity search using a distance metric (where closer means the results are more similar) such as Euclidean, dot product, or cosine similarity.

To accelerate the retrieval process, the vector data is organized using an indexing mechanism. Examples of these organization methods include flat structures, inverted file (IVF), Hierarchical Navigable Small Worlds (HNSW), and locality-sensitive hashing (LSH), among others. Each of these methods contributes to the efficiency and effectiveness of retrieving similar vectors when needed.

Let’s examine how you would use a vector database in a GenAI system. Figure 1 illustrates both the loading of data into a vector database and using it in the context of a GenAI application. When you input your prompt, it undergoes a transformation process identical to the one used to generate vectors in the database. This transformed vector prompt is then used to retrieve similar vectors from the vector database. These retrieved items essentially serve as conversational memory, furnishing contextual history for prompts, akin to how LLMs operate. This feature proves particularly advantageous in natural language processing, computer vision, recommendation systems, and other domains requiring semantic comprehension and data matching. Your initial prompt is subsequently “merged” with the retrieved elements, supplying context, and assisting the LLM in formulating responses based on the provided context rather than solely relying on its original training data.

Figure 1. A RAG application architecture.

Vectors are stored and indexed for speedy retrieval. Vector databases come in two main flavors, traditional databases that have been extended to store vectors, and purpose-built vector databases. Some examples of traditional databases that provide vector support are Redis, pgvector, Elasticsearch, and OpenSearch. Examples of purpose-built vector databases include proprietary solutions Zilliz and Pinecone, and open source projects Milvus, Weaviate, Qdrant, Faiss, and Chroma. You can learn more about vector databases on GitHub via LangChain and OpenAI Cookbook.

We’ll take a closer look at one from each category, Milvus and Redis.

Improving Performance

Before diving into the optimizations, let’s review how vector databases are evaluated, some evaluation frameworks, and available performance analysis tools.

Performance Metrics

Let’s look at key metrics that can help you measure vector database performance.

Load latency measures the time required to load data into the vector database’s memory and build an index. An index is a data structure used to efficiently organize and retrieve vector data based on its similarity or distance. Types of in-memory indices include flat index, IVF_FLAT, IVF_PQ, HNSW, scalable nearest neighbors (ScaNN),and DiskANN.
Recall is the proportion of true matches, or relevant items, found in the Top K results retrieved by the search algorithm. Higher recall values indicate better retrieval of relevant items.
Queries per second (QPS) is the rate at which the vector database can process incoming queries. Higher QPS values imply better query processing capability and system throughput.

Benchmarking Frameworks

Figure 2. The vector database benchmarking framework.

Benchmarking a vector database requires a vector database server and clients. In our performance tests, we used two popular open source tools.

VectorDBBench: Developed and open sourced by Zilliz, VectorDBBench helps test different vector databases with different index types and provides a convenient web interface.
vector-db-benchmark: Developed and open sourced by Qdrant, vector-db-benchmark helps test several typical vector databases for the HNSW index type. It runs tests through the command line and provides a Docker Compose file to simplify starting server components.

Figure 3. An example vector-db-benchmark command used to run the benchmark test.

But the benchmark framework is only part of the equation. We need data that exercises different aspects of the vector database solution itself, such as its ability to handle large volumes of data, various vector sizes, and speed of retrieval.With that, let’s look at some available public datasets.

Open Datasets to Exercise Vector Databases

Large datasets are good candidates to test load latency and resource allocation. Some datasets have high dimensional data and are good for testing speed of computing similarity.

Datasets range from a dimension of 25 to a dimension of 2048. The LAION dataset, an open image collection, has been used for training very large visual and language deep-neural models like stable diffusion generative models. OpenAI’s dataset of 5M vectors, each with a dimension of 1536, was created by VectorDBBench by running OpenAI on raw data. Given each vector element is of type FLOAT, to save the vectors alone, approximately 29 GB (5M * 1536 * 4) of memory is needed, plus a similar amount extra to hold indices and other metadata for a total of 58 GB of memory for testing. When using the vector-db-benchmark tool, ensure adequate disk storage to save results.

To test for load latency, we needed a large collection of vectors, which deep-image-96-angular offers. To test performance of index generation and similarity computation, high dimensional vectors provide more stress. To this end we chose the 500K dataset of 1536 dimension vectors.

Performance Tools

We’ve covered ways to stress the system to identify metrics of interest, but let’s examine what’s happening at a lower level: How busy is the computing unit, memory consumption, waits on locks, and more? These provide clues to databasebehavior, particularly useful in identifying problem areas.

The Linux top utility provides system-performance information. However, the perf tool in Linux provides a deeper set of insights. To learn more, we also recommend reading Linux perf examples and the Intel top-down microarchitecture analysis method. Yet another tool is the Intel® vTune™ Profiler, which is useful when optimizing not just application but also system performance and configuration for a variety of workloads spanning HPC, cloud, IoT, media, storage, and more.

Milvus Vector Database Optimizations

Let’s walk through some examples of how we attempted to improve the performance of the Milvus vector database.

Reducing Memory Movement Overhead in Datanode Buffer Write

Milvus’s write path proxies write data into a log broker via MsgStream. The data nodes then consume the data, converting and storing it into segments. Segments will merge the newly inserted data. The merge logic allocates a new buffer to hold/move both the old data and the new data to be inserted and then returns the new buffer as old data for the next data merge. This results in the old data getting successively larger, which in turn makes data movement slower. Perf profiles showed a high overhead for this logic.

Figure 4. Merging and moving data in the vector database generates a high-performance overhead.

We changed the merge buffer logic to directly append the new data to be inserted into the old data, avoiding allocating a new buffer and moving the large old data. Perf profiles confirm that there is no overhead to this logic. The microcode metrics metric_CPU operating frequency and metric_CPU utilization indicate an improvement that is consistent with the system not having to wait for the long memory movement anymore. Load latency improved by more than 60 percent. The improvement is captured on GitHub.

Figure 5. With less copying we see a performance improvement of more than 50 percent in load latency.

Inverted Index Building with Reduced Memory Allocation Overhead

The Milvus search engine, Knowhere, employs the Elkan k-means algorithm to train cluster data for creating inverted file (IVF) indices. Each round of data training defines an iteration count. The larger the count, the better the training results. However, it also implies that the Elkan algorithm will be called more frequently.

The Elkan algorithm handles memory allocation and deallocation each time it’s executed. Specifically, it allocates memory to store half the size of symmetric matrix data, excluding the diagonal elements. In Knowhere, the symmetric matrix dimension used by the Elkan algorithm is set to 1024, resulting in a memory size of approximately 2 MB. This means for each training round Elkan repeatedly allocates and deallocates 2 MB memory.

Perf profiling data indicated frequent large memory allocation activity. In fact, it triggered virtual memory area (VMA)allocation, physical page allocation, page map setup, and updating of memory cgroup statistics in the kernel. This pattern of large memory allocation/deallocation activity can, in some situations, also aggravate memory fragmentation. This is a significant tax.

The IndexFlatElkan structure is specifically designed and constructed to support the Elkan algorithm. Each data training process will have an IndexFlatElkan instance initialized. To mitigate the performance impact resulting from frequent memory allocation and deallocation in the Elkan algorithm, we refactored the code logic, moving the memory management outside of the Elkan algorithm function up into the construction process of IndexFlatElkan. This enables memory allocation to occur only once during the initialization phase while serving all subsequent Elkan algorithm function calls from the current data training process and helps to improve load latency by around 3 percent. Find the Knowhere patch here.

Redis Vector Search Acceleration through Software Prefetch

Redis, a popular traditional in-memory key-value data store, recently began supporting vector search. To go beyond a typical key-value store, it offers extensibility modules; the RediSearch module facilitates the storage and search of vectors directly within Redis.

For vector similarity search, Redis supports two algorithms, namely brute force and HNSW. The HNSW algorithm is specifically crafted for efficiently locating approximate nearest neighbors in high-dimensional spaces. It uses a priority queue named candidate_set to manage all vector candidates for distance computing.

Each vector candidate encompasses substantial metadata in addition to the vector data. As a result, when loading a candidate from memory it can cause data cache misses, which incur processing delays. Our optimization introduces software prefetching to proactively load the next candidate while processing the current one. This enhancement has resulted in a 2 to 3 percent throughput improvement for vector similarity searches in a single instance Redis setup. The patch is in the process of being upstreamed.

GCC Default Behavior Change to Prevent Mixed Assembly Code Penalties

To drive maximum performance, frequently used sections of code are often handwritten in assembly. However, when different segments of code are written either by different people or at different points in time, the instructions used may come from incompatible assembly instruction sets such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and Streaming SIMD Extensions (SSE). If not compiled appropriately, the mixed code results in a performance penalty. Learn more about mixing Intel AVX and SSE instructions here.

You can easily determine if you’re using mixed-mode assembly code and have not compiled the code with VZEROUPPER, incurring the performance penalty. It can be observed through a perf command like sudo perf stat -e ‘assists.sse_avx_mix/event/event=0xc1,umask=0x10/’ . If your OS doesn’t have support for the event, use cpu/event=0xc1,umask=0x10,name=assists_sse_avx_mix/.

The Clang compiler by default inserts VZEROUPPER, avoiding any mixed mode penalty. But the GCC compiler only inserted VZEROUPPER when the -O2 or -O3 compiler flags were specified. We contacted the GCC team and explained the issue and they now, by default, correctly handle mixed mode assembly code.

Start Optimizing Your Vector Databases

Vector databases are playing an integral role in GenAI, and they are growing ever larger to generate higher-quality responses. With respect to optimization, AI applications are no different from other software applications in that they reveal their secrets when one employs standard performance analysis tools along with benchmark frameworks and stress input.

Using these tools, we uncovered performance traps pertaining to unnecessary memory allocation, failing to prefetch instructions, and using incorrect compiler options. Based on our findings, we upstreamed enhancements to Milvus, Knowhere, Redis, and the GCC compiler to help make AI a little more performant and sustainable. Vector databases are an important class of applications worthy of your optimization efforts. We hope this article helps you get started.

When deploying a RAG system, we will face multiple challenges such as ensuring scalability, handling data security, and integrating with existing infrastructure. Open Platform for Enterprise AI (OPEA) is an open source project that helps with the deployment by providing robust tools and frameworks designed to streamline these processes and facilitate seamless integration.

About the Authors

Cathy Zhang is a senior software engineer with deep software, microarchitecture, and virtualization expertise. As a member of the Cloud Software Engineering team in SATG/SSE, she has recently been focused on optimizing vector databases like Milvus and has experience building retrieval-augmented generation (RAG) systems. At Intel, Cathy has contributed to x86 virtualization, working on the lightweight, more secure Intel® hypervisor for cloud usages and on Intel® Software Guard Extensions (Intel® SGX). Prior to Intel, Cathy worked for an extended time on IBM System Z Virtualization. Outside of work, Cathy enjoys reading, mountain climbing, and traveling.

Dr. Malini Bhandaru is a senior principal engineer focused on confidential compute, security, and performance. She has been engaged in open source for more than a decade, working on Confidential Containers, EdgeX Foundry, KubeFlow, and OpenStack projects. She started at Intel as an Intel® Xeon® Server Power Performance architect. Across her career, she has worked on autonomous driving, healthcare, and telecom applications. Malini is a frequent conference speaker and has served on KubeCon and OSS program committees. She has a PhD in machine learning and more than 25 patents. She is also a STEM coach and avid gardener

Optimize Vector Databases, Enhance RAG-Driven Generative AI was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Four Data Cleaning Techniques to Improve Large Language Model (LLM) Performance

Intel — Mon, 01 Apr 2024 15:06:02 GMT

Unlock more accurate and meaningful AI outcomes with RAG (retrieval-augmented generation).

Photo by No Revisions on Unsplash

By Eduardo Rojas Oviedo and Ezequiel Lanza

The retrieval-augmented generation (RAG) process has gained popularity due to its potential to enhance the understanding of large language models (LLMs), providing them with context and helping to prevent hallucinations. The RAG process involves several steps, from ingesting documents in chunks to extracting context to prompting the LLM model with that context. While known to significantly improve predictions, RAG can occasionally lead to incorrect results. The way documents are ingested plays a crucial role in this process. For instance, if our “context documents” contain typos or unusual characters for an LLM, such as emojis, it could potentially confuse the LLM’s understanding of the provided context.

In this post, we’ll demonstrate the use of four common natural language processing (NLP) techniques to clean text before it’s ingested and converted into chunks for further processing by the LLM. We’ll also illustrate how these techniques can significantly enhance the model’s response to a prompt.

The steps of the RAG process, adapted from RAG-Survey.

Why Is it Important to Clean Your Documents?

It’s standard practice to clean up text before feeding it into any kind of machine learning algorithm. Whether you’re using supervised or unsupervised algorithms, or even crafting context for your generative AI (GAI) model, getting your text in good shape helps to:

· Ensure accuracy: By getting rid of mistakes and making everything consistent, you’re less likely to confuse the model or end up with model hallucinations.

· Improve quality: Cleaner data ensures that the model works with reliable and consistent information, helping our models to infer from accurate data.

· Facilitate analysis: Clean data is easy to interpret and analyze. For example, a model trained with plain text may struggle to comprehend tabular data.

By cleaning our data — especially unstructured data — we provide the model with reliable and relevant context, which improves generation, reduces the probability of hallucinations, and improves GAI speed and performance, as large volumes of information lead to longer wait times.

How Do We Achieve Data Cleaning?

To help you build your data cleaning toolbox, we’ll explore four NLP techniques and how they help the model.

Step 1: Data Cleaning and Noise Reduction

We’ll start by removing symbols or characters that don’t provide meaning, such as HTML tags (in the case of scraping), XML parses, JSON, emojis, and hashtags. Unnecessary characters often confuse the model, and increase the number of context tokens and therefore the computational cost.

Recognizing that there’s no one-size-fits-all solution, we’ll adapt our methods to different problems and text types using common cleaning techniques:

· Tokenization: Split the text into individual words or tokens.

· Remove noise: Eliminate unwanted symbols, emojis, hashtags, and Unicode characters.

· Normalization: Convert the text to lowercase for consistency.

· Remove stop words: Discard common or repeated words that do not add meaning, such as “a,” “in,” “of,” and “the.”

· Lemmatization or stemming: Reduce words to their base or root form.

Let’s take this tweet for example:

“I love coding! 😊 #PythonProgramming is fun! 🐍✨ Let’s clean some text 🧹”

While the meaning is clear to us, let’s simplify it for the model by applying common techniques in Python. The following code snippet and all others in this post were generated with the help of ChatGPT.

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopword
s
from nltk.stem import WordNetLemmatizer

# Sample text with emojis, hashtags, and other characters
text = “I love coding! 😊 #PythonProgramming is fun! 🐍✨ Let’s clean some text 🧹”

# Tokenization
tokens = word_tokenize(text)

# Remove Noise
cleaned_tokens = [re.sub(r’[^\w\s]’, ‘’, token) for token in tokens]

# Normalization (convert to lowercase)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# Remove Stopwords
stop_words = set(stopwords.words(‘english’))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)

# output:
# [‘love’, ‘coding’, ‘pythonprogramming’, ‘fun’, ‘clean’, ‘text’]

The process has removed irrelevant characters and left us with clean and meaningful text our model can understand: [‘love’, ‘coding’, ‘pythonprogramming’, ‘fun’, ‘clean’, ‘text’].

Step 2: Text Standardization and Normalization

Next, we should always prioritize consistency and coherence across the text. This is crucial for ensuring accurate retrieval and generation. In the following Python example, let’s scan our text input for spelling errors and other inconsistencies that could lead to inaccuracies and decreased performance.

import re

# Sample text with spelling errors
text_with_errors = “””But ’s not  oherence  about more language  oherence . 
Other important aspect is ensuring accurte retrievel by  oherence  product name spellings. 
Additionally, refning descriptions  oherenc the  oherence of the contnt.”””

# Function to correct spelling errors
def correct_spelling_errors(text):
    # Define dictionary of common spelling mistakes and their corrections
    spelling_corrections = {
        “ oherence ”: “everything”,
        “ oherence ”: “refinement”,
        “accurte”: “accurate”,
        “retrievel”: “retrieval”,
        “ oherence ”: “correcting”,
        “refning”: “refining”,
        “ oherenc”: “enhances”,
        “ oherence”: “coherence”,
        “contnt”: “content”,
    }

    # Iterate over each key-value pair in the dictionary and replace the
    # misspelled words with their correct versions
    for mistake, correction in spelling_corrections.items():
        text = re.sub(mistake, correction, text)

    return text

# Correct spelling errors in the sample text
cleaned_text = correct_spelling_errors(text_with_errors)

print(cleaned_text)
# output
# But it’s not everything about more language refinement.
# other important aspect is ensuring accurate retrieval by correcting product name spellings.
# Additionally, refining descriptions enhances the coherence of the content.

With a cohesive, coherent text representation, our model can now generate accurate and contextually relevant responses. This process also enables semantic search to extract the most optimal context chunks, particularly in the context of RAG.

Step 3: Metadata Handling

Metadata collection, such as identifying important keywords and entities, makes it easy for us to recognize elements in the text that we can use to improve semantic search results, especially in enterprise applications such as content recommendation systems. This process provides the model with additional context, often required to improve RAG performance. Let’s apply this step to another Python example.

Import spacy
import json

# Load English language model
nlp = spacy.load(“en_core_web_sm”)

# Sample text with meta data candidates
text = “””In a blog post titled ‘The Top 10 Tech Trends of 2024,’ 
John Doe discusses the rise of artificial intelligence and machine learning 
in various industries. The article mentions companies like Google and Microsoft 
as pioneers in AI research. Additionally, it highlights emerging technologies 
such as natural language processing and computer vision.”””

# Process the text with spaCy
doc = nlp(text)

# Extract named entities and their labels
meta_data = [{“text”: ent.text, “label”: ent.label_} for ent in doc.ents]

# Convert meta data to JSON format
meta_data_json = json.dumps(meta_data)

print(meta_data_json)

# output
“””
[
    {“text”: “2024”, “label”: “DATE”},
    {“text”: “John Doe”, “label”: “PERSON”},
    {“text”: “Google”, “label”: “ORG”},
    {“text”: “Microsoft”, “label”: “ORG”},
    {“text”: “AI”, “label”: “ORG”},
    {“text”: “natural language processing”, “label”: “ORG”},
    {“text”: “computer vision”, “label”: “ORG”}
]
“””

The code highlights how spaCy’s entity recognition capability recognizes dates, persons, and organizations, and other important entities in the text. This helps RAG applications better understand context and relationships between words.

Step 4: Contextual Information Handling

When working with LLMs, you may commonly be working with diverse languages or managing extensive documents brimming with various topics, which can be hard for your model to comprehend. Let’s look at two techniques that can help your model better understand the data.

Let’s start with language translation. Using the Google Translation API, the code translates the original text, “Hello, how are you?” from English to Spanish.

From googletrans import Translator

# Original text
text = “Hello, how are you?”

# Translate text
translator = Translator()
translated_text = translator.translate(text, src=’en’, dest=’es’).text

print(“Original Text:”, text)
print(“Translated Text:”, translated_text)

Topic modeling including techniques like clustering data, is like organizing a messy room into neat categories, helping your model identify the topic of a document and sort through lots of information quickly. Latent Dirichlet allocation (LDA), the most popular technique for automating the topic modeling process, is a statistical model that helps find hidden themes in text by looking closely at word patterns.

In the following example, we’ll use sklearn to process a set of documents and identify key topics.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing involves analyzing and understanding human languages.",
    "Deep learning algorithms mimic the structure and function of the human brain.",
    "Sentiment analysis aims to determine the emotional tone of a text."
]

# Convert text into numerical feature vectors
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Apply Latent Dirichlet Allocation (LDA) for topic modeling
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# Display topics
for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([vectorizer.get_feature_names()[i] for i in topic.argsort()[:-5 - 1:-1]]))

# output
#
#Topic 1:
#learning machine subset artificial intelligence
#Topic 2:
#processing natural language involves analyzing understanding

If you’d like to explore more topic modeling techniques, we recommend starting with these:

· Non-negative matrix factorization (NMF) is great for things like images where negative values don’t make sense. It’s handy when you need clear, understandable factors. For instance, in image processing, NMF helps extract features without the confusion of negative values.

· Latent semantic analysis (LSA) shines when you have a large volume of text spread across multiple documents and want to find connections between words and documents. LSA uses singular value decomposition (SVD) to identify semantic relationships between terms and documents, helping to streamline tasks like sorting documents by similarity and detecting plagiarism.

· Hierarchical Dirichlet process (HDP) helps you quickly sort through mountains of data and identify topics in a document when you’re unsure how many there are. As an extension of LDA, HDP allows for infinite topics and greater flexibility in modeling. It identifies hierarchical structures in text data for tasks like understanding the organization of topics in academic papers or news articles.

· Probabilistic latent semantic analysis (PLSA) helps you figure out how likely it is for a document to be about certain topics, which can be useful when building a recommendation system that provides personalized recommendations based on past interactions.

DEMO: Cleaning a GAI Text Input

Let’s put it all together with an example. In this demo, we’ve used ChatGPT to generate a conversation between two technologists. We’ll apply basic cleaning techniques to the conversation to show how these practices enable reliable and consistent results.

synthetic_text = """
Sarah (S): Technology Enthusiast
Mark (M): AI Expert
S: Hey Mark! How's it going? Heard about the latest advancements in Generative AI (GA)?
M: Hey Sarah! Yes, I've been diving deep into the realm of GA lately. It's fascinating how it's shaping the future of technology!
S: Absolutely! I mean, GA has been making waves across various industries. What do you think is driving its significance?
M: Well, GA, especially Retrieval Augmented Generative (RAG), is revolutionizing content generation. It's not just about regurgitating information anymore; it's about creating contextually relevant and engaging content.
S: Right! And with Machine Learning (ML) becoming more sophisticated, the possibilities seem endless.
M: Exactly! With advancements in ML algorithms like GPT (Generative Pre-trained Transformer), we're seeing unprecedented levels of creativity in AI-generated content.
S: But what about concerns regarding bias and ethics in GA?
M: Ah, the age-old question! While it's true that GA can inadvertently perpetuate biases present in the training data, there are techniques like Adversarial Training (AT) that aim to mitigate such issues.
S: Interesting! So, where do you see GA headed in the next few years?
M: Well, I believe we'll witness a surge in applications leveraging GA for personalized experiences. From virtual assistants to content creation tools, GA will become ubiquitous in our daily lives.
S: That's exciting! Imagine AI-powered virtual companions tailored to our preferences.
M: Indeed! And with advancements in Natural Language Processing (NLP) and computer vision, these virtual companions will be more intuitive and lifelike than ever before.
S: I can't wait to see what the future holds!
M: Agreed! It's an exciting time to be in the field of AI.
S: Absolutely! Thanks for sharing your insights, Mark.
M: Anytime, Sarah. Let's keep pushing the boundaries of Generative AI together!
S: Definitely! Catch you later, Mark!
M: Take care, Sarah!
"""

Step 1: Basic Cleanup

First, let’s remove the emojis, hashtags, and Unicode characters from the conversation.

# Sample text with emojis, hashtags, and unicode characters

# Tokenization
tokens = word_tokenize(synthetic_text)

# Remove Noise
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

# Normalization (convert to lowercase)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# Remove Stopwords
stop_words = set(stopwords.words('english'))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)

Step 2: Prepare Our Prompt

Next, we’ll craft a prompt, asking the model to respond as a friendly customer service agent based on information it gleaned from our synthetic conversation.

MESSAGE_SYSTEM_CONTENT = "You are a customer service agent that helps 
a customer with answering questions. Please answer the question based on the
provided context below. 
Make sure not to make any changes to the context if possible,
when prepare answers so as to provide accurate responses. If the answer 
cannot be found in context, just politely say that you do not know, 
do not try to make up an answer."

Step 3: Prepare the Interaction

Let’s prepare our interaction with the model. In this example, we’ll use GPT-4.

def response_test(question:str, context:str, model:str = "gpt-4"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": MESSAGE_SYSTEM_CONTENT,
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": context},
        ],
    )
    
    return response.choices[0].message.content

Step 4: Prepare the Question

Finally, let’s ask the model a question and compare the results before and after cleaning.

question1 = "What are some specific techniques in Adversarial Training (AT) 
that can help mitigate biases in Generative AI models?"

Before cleaning, our model generates this response:

response = response_test(question1, synthetic_text)
print(response)

#Output
# I'm sorry, but the context provided doesn't contain specific techniques in Adversarial Training (AT) that can help mitigate biases in Generative AI models.

After cleaning, the model generates the following response. With enhanced understanding enabled by basic cleaning techniques, the model can provide a more thorough answer.

response = response_test(question1, new_content_string)
print(response)
#Output:
# The context mentions Adversarial Training (AT) as a technique that can 
# help mitigate biases in Generative AI models. However, it does not provide 
#any specific techniques within Adversarial Training itself.

A Brighter Future for AI-Generated Outcomes

RAG models offer several advantages, including enhanced reliability and coherence of AI-generated results by providing relevant context. This contextualization significantly improves the accuracy of AI-generated content.

To get the most out of your RAG models, robust data cleaning techniques are essential during document ingestion. These techniques address discrepancies, imprecise terminology, and other potential errors within textual data, significantly improving the quality of input data. When operating on cleaner, more reliable data, RAG models deliver more accurate and meaningful results, enabling AI use cases with better decision-making and problem-solving capabilities across domains.

Have you explored additional methods to improve RAG model performance? Let us know as we continue to refine and improve its capabilities.

About the Authors

Eduardo Rojas Oviedo, Platform Engineer, Intel

Eduardo Rojas Oviedo is a dedicated RAG developer within Intel’s dynamic and innovative team. With a specialization in cutting-edge developer tools for AI, Machine Learning, and NLP, he is passionate about leveraging technology to create impactful solutions. Eduardo’s expertise lies in building robust and innovative applications that push the boundaries of what’s possible in the realm of artificial intelligence. His commitment to sharing knowledge and advancing technology drives his ongoing pursuit of excellence in the field.

Ezequiel Lanza, Open Source AI Evangelist, Intel

Four Data Cleaning Techniques to Improve Large Language Model (LLM) Performance was originally published in Intel Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.