Cracking the Code of Food Intolerance Reports with AI

From Idea to Innovation: The Technical Journey of Health Cleo

Day 1: The Spark on Jeju Island

It all started on a bustling street in Jeju, South Korea. I stood there, overwhelmed by the vibrant array of street food, but paralyzed by uncertainty. As someone with multiple food intolerances, I found myself in a familiar predicament: how could I safely navigate this culinary paradise without risking my health?

Story Image

I pulled out my food intolerance report, a crumpled piece of paper filled with numbers and medical jargon. As I squinted at the tiny print, trying to remember which foods were safe and which weren't, a thought struck me: why isn't there an app for this?

Report

Day 2: Research and Problem Definition

Back home, I dove into research. I discovered I wasn't alone in my struggles:

  1. Inconsistent Medical Reports: Each lab uses different formats, reference ranges, and measurement units.
  2. Information Overload: The average food intolerance report contains data on 100+ food items.
  3. Language Barriers: For travellers, deciphering foreign food labels adds another layer of complexity.
  4. Cost: Personal nutritionists are expensive, costing an average of USD 100 - USD 200 per session.

Problem Image

The problem was clear: we needed a solution to interpret varied medical reports, remember complex dietary restrictions, and provide real-time food safety advice.

Day 3: Conceptualizing the Solution

Solution concept

I envisioned an AI-powered app that could:

  1. Parse any food intolerance report
  2. Create a personalized dietary profile
  3. Analyze food products in real-time
  4. Provide safe eating recommendations

But how to build it? I needed a powerful AI backbone that could handle complex language processing, image recognition, and data analysis. After evaluating various options, I discovered Upstage AI's suite of APIs and TiDB vector db. Their combination of document parsing, OCR, embedding, and language models with vector db seemed perfect for our needs.

Day 4: Laying the Groundwork

With the concept clear and the tools chosen, it was time to start coding. I began by setting up the project structure and defining our data models using Pydantic:

Pydantic Class

from pydantic import BaseModel
from typing import List, Dict

class ReferenceRange(BaseModel):
    elevated: str
    borderline: str
    normal: str

class FoodItem(BaseModel):
    name: str
    value: float
    category: str

class FoodIntoleranceReport(BaseModel):
    reference_range: ReferenceRange
    food_items: List[FoodItem]

class ProductAnalysis(BaseModel):
    product_name: str
    ingredients: List[str]
    suitability: Dict[str, str]
    overall_rating: str
    explanation: str

Using Pydantic not only ensured type safety but also served as self-documenting code, clearly defining the structure of our data.

Day 5: Tackling Report Inconsistency

The first major challenge was parsing inconsistent medical reports. I leveraged Upstage's document parser and OCR capabilities, combined with custom HTML parsing and prompt engineering:

TiDB Integration: Powering Scalable Vector Storage

Health Cleo leverages TiDB, a distributed SQL database, to provide scalable and efficient vector storage for our food intolerance data. This integration allows us to handle large volumes of data while maintaining quick retrieval times, crucial for real-time product analysis.

Setting Up TiDB Connection

We initialize the TiDB connection in our FoodIntoleranceAnalysisService class:

from langchain_community.vectorstores import TiDBVectorStore
from sqlalchemy import Column, Integer, MetaData, String, Table, create_engine, JSON

class FoodIntoleranceAnalysisService:
    def __init__(self, upstage_api_key: str, tavily_api_key: str, tidb_connection_string: str):
        # ... other initializations ...
        self.tidb_connection_string = tidb_connection_string
        self.vector_store = self._initialize_vector_store()

    def _initialize_vector_store(self):
        return TiDBVectorStore(
            connection_string=self.tidb_connection_string,
            embedding_function=self.embeddings,
            table_name="upstage-tidb-store",
            distance_strategy="cosine"
        )

This setup allows us to use TiDB as our vector store, leveraging its distributed nature for efficient storage and retrieval of embedding vectors.

Storing Extracted Data

When processing PDF reports, we now store the extracted tables in TiDB:

def process_pdf(self, pdf_path: str) -> FoodIntoleranceReport:
    # ... existing code for loading and processing PDF ...

    chunked_tables = self.text_splitter.split_text("\n".join(all_tables))

    # Store the extracted tables in TiDB
    documents_to_insert = []
    for chunk in chunked_tables:
        documents_to_insert.append(
            Document(
                page_content=chunk,
                metadata={
                    "source": pdf_path,
                    "chunk_type": "table"
                }
            )
        )

    self.vector_store.add_texts([doc.page_content for doc in documents_to_insert], [doc.metadata for doc in documents_to_insert])

    # ... rest of the processing ...

This approach allows us to persistently store the extracted information, making it available for future queries without needing to reprocess the PDF.

Querying the Vector Store

We've updated our querying method to use the TiDB vector store:

def _query_vectorstore(self, query: str) -> str:
    relevant_docs = self.vector_store.similarity_search_with_score(query, k=3)

    context = "\n".join([doc.page_content for doc, _ in relevant_docs])

    prompt = f"""
    Based on the following context, answer the question:

    Context:
    {context}

    Question: {query}

    Answer:
    """

    response = self.llm([HumanMessage(content=prompt)])
    return response.content

This method performs a similarity search in the TiDB vector store, retrieves the most relevant documents, and uses them as context for the language model to answer queries.

Benefits of TiDB Integration

  1. Scalability: TiDB's distributed nature allows Health Cleo to handle growing volumes of food intolerance data efficiently.
  2. Performance: The vector indexing capabilities of TiDB enable fast similarity searches, crucial for real-time product analysis.
  3. Persistence: Storing processed data in TiDB allows for quick retrieval in subsequent sessions, improving the overall user experience.
  4. Consistency: As a transactional database, TiDB ensures that our data remains consistent even as we update and query it concurrently.

By integrating TiDB, Health Cleo gains a robust, scalable backend that can grow with the application's needs, ensuring fast and reliable access to food intolerance data as we expand our user base and feature set.

PDF Lab Report parsing

def extract_tables_from_html(html_content: str) -> List[str]:
    soup = BeautifulSoup(html_content, 'html.parser')
    return [str(table) for table in soup.find_all('table')]

def parse_reference_range(text: str) -> ReferenceRange:
    elevated = re.search(r'Elevated[^\d]*(\d+(?:\.\d+)?)', text)
    borderline = re.search(r'Borderline[^\d]*(\d+(?:\.\d+)?)[^\d]*(\d+(?:\.\d+)?)', text)
    normal = re.search(r'Normal[^\d]*(\d+(?:\.\d+)?)', text)

    return ReferenceRange(
        elevated=f"> {elevated.group(1)} U/mL" if elevated else "",
        borderline=f"{borderline.group(1)}-{borderline.group(2)} U/mL" if borderline else "",
        normal=f"< {normal.group(1)} U/mL" if normal else ""
    )

def categorize_food_item(value: float, reference_range: ReferenceRange) -> str:
    elevated_threshold = float(re.search(r'\d+', reference_range.elevated).group())
    borderline_range = [float(x) for x in re.findall(r'\d+', reference_range.borderline)]

    if value >= elevated_threshold:
        return "Elevated"
    elif borderline_range[0] <= value <= borderline_range[1]:
        return "Borderline"
    else:
        return "Normal"

Let's break down the key components:

  1. HTML Parsing: The extract_tables_from_html function uses BeautifulSoup to parse the HTML content and extract all table elements. This is crucial for handling the varied formats of medical reports.

  2. Reference Range Parsing: The parse_reference_range function uses regular expressions to extract the elevated, borderline, and normal ranges from the text. This flexibility allows us to handle different reporting formats.

  3. Food Item Categorization: The categorize_food_item function determines whether a food item's intolerance level is elevated, borderline, or normal based on the parsed reference range.

  4. PDF Processing: The process_pdf function ties everything together. It uses Upstage's LayoutAnalysisLoader to convert the PDF to HTML, extracts tables, chunks the content, and stores it in a TiDB vector database for efficient querying.

Output:

Pdf parsing output

Day 6: Implementing Prompt Engineering and RetrievalQA

To extract meaningful information from the parsed reports, I implemented a sophisticated prompt engineering system using LangChain's PromptTemplate and RetrievalQA:

Analyze Product

def _extract_reference_range(self, tables: List[str]) -> ReferenceRange:
    reference_range_query = PromptTemplate(
        input_variables=["tables"],
        template="""
        Analyze the following table content and extract the reference range for Elevated, Borderline, and Normal food intolerance levels:

        {tables}

        Provide the results in the following format:
        Elevated: > X U/mL
        Borderline: Y-Z U/mL
        Normal: < W U/mL
        """
    )

    results = []
    for chunk in tables:
        result = self._query_vectorstore(reference_range_query.format(tables=chunk))
        results.append(result)

    combined_result = " ".join(results)
    return parse_reference_range(combined_result)

def _query_vectorstore(self, query: str) -> str:
    qa_chain = RetrievalQA.from_chain_type(
        llm=self.llm,
        chain_type="stuff",
        retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3}),
    )
    return qa_chain.run(query)

def _check_ingredient_suitability(self, ingredients: List[str]) -> Dict[str, str]:
        suitability_query = PromptTemplate(
            input_variables=["ingredients", "food_items"],
            template="""
            Based on the following food intolerance data:
            {food_items}

            Analyze the suitability of these ingredients:
            {ingredients}

            For each ingredient, classify it as:
            - Suitable: if it's not in the list or has a "Normal" value
            - Caution: if it's in the list with a "Borderline" value
            - Avoid: if it's in the list with an "Elevated" value
            - Unknown: if there's not enough information

            Provide the result in the following format:
            Ingredient: Classification
            """
        )

        # Split ingredients into chunks if necessary
        ingredient_chunks = self.text_splitter.split_text(", ".join(ingredients))
        food_items_text = "\n".join([f"{item.name}: {item.value} U/mL - {item.category}" for item in self.food_items])

        all_suitability = {}
        for chunk in ingredient_chunks:
            result = self._query_vectorstore(suitability_query.format(
                ingredients=chunk,
                food_items=food_items_text
            ))
            chunk_suitability = self._parse_suitability_result(result)
            all_suitability.update(chunk_suitability)

        return all_suitability

Here's what's happening:

  1. Prompt Engineering: The PromptTemplate in _extract_reference_range creates a structured query for the language model. This prompt engineering ensures that we get consistent, formatted responses regardless of the input variation.

  2. RetrievalQA: The _query_vectorstore function sets up a RetrievalQA chain. This powerful feature combines the vector store (which contains our parsed report data) with the language model to answer complex queries about the report content.

  3. Chunked Processing: We process the tables in chunks, allowing us to handle large reports efficiently while maintaining context.

This approach allows us to extract structured information from unstructured or semi-structured medical reports, a key innovation in dealing with the inconsistency problem.

Day 6: Building the Product Analysis Pipeline

Next, I tackled the challenge of real-time product analysis. This required integrating multiple technologies:

def analyze_product_from_image(self, image_path: str) -> ProductAnalysis:
    ocr_text = self.ocr_product_image(image_path)
    context = self.tavily.search(query=f"Find product name for any such product that carries similar texts and brand '{ocr_text}'")

    product_name_query = PromptTemplate(
        input_variables=["context"],
        template="Extract the most likely product name from the following context:\n\n{context}\n\nProduct Name:"
    ).format(context=context['results'][0]['content'])

    product_name = self._query_vectorstore(product_name_query)
    ingredients = self.get_product_ingredients(product_name)

    return self.analyze_product(product_name, ingredients)

def get_product_ingredients(self, product_name: str) -> List[str]:
        result = self.api.product.text_search(product_name, page=1, page_size=1)
        if result['products']:
            # Try to get the ingredients in English first
            ingredients_text = result['products'][0].get('ingredients_text_en', '')
            # If not found or empty, fall back to the default ingredients_text
            if not ingredients_text:
                ingredients_text = result['products'][0].get('ingredients_text', '')

            # Return the list of ingredients, split by comma and stripped of extra spaces
            return [ing.strip() for ing in ingredients_text.split(',') if ing.strip()]
        return []

Output Analyze product

This pipeline combines OCR, web search, language model querying, and our custom analysis logic to provide instant, accurate product assessments. Since OCR doesn't have contextual knowledge we use Tavily API to find accurate product name. Then use Upstage LLM to return idea candidat as product name.

Day 7: Ensuring Reliability with Groundedness Checks

To prevent AI hallucinations - a critical concern when dealing with health information - I implemented Upstage's groundedness check:

def _generate_overall_rating(self, suitability: Dict[str, str]) -> Tuple[str, str]:
    # ... (previous code)

    groundedness_check = UpstageGroundednessCheck(api_key=self.upstage_api_key)
    request_input = {
        "context": suitability_text,
        "answer": explanation,
    }
    groundedness_response = groundedness_check.invoke(request_input)

    if groundedness_response == "notGrounded":
        explanation += " (Note: This explanation may not be fully grounded in the provided data.)"
    elif groundedness_response == "notSure":
        explanation += " (Note: The accuracy of this explanation could not be confirmed.)"

    return rating, explanation

This ensures that every piece of advice given is firmly rooted in factual data.

Day 7: Optimization and Testing

The final weeks were spent optimizing performance and conducting thorough testing. I implemented efficient chunking strategies for handling large documents:

self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=200,
    length_function=self.llm.get_num_tokens
)

This allows us to process large documents in manageable pieces while maintaining context.

The Result: Health Cleo

After an intense week, Health Cleo was born. It's not just an app; it's a personal AI nutritionist that:

  1. Understand any food intolerance report
  2. Provides real-time product analysis
  3. Offers multilingual support for global travellers
  4. Ensures reliability through groundedness checks

Streamlit App Video

But more than that, it's a testament to the power of innovative engineering combined with cutting-edge AI. By leveraging Upstage AI's suite of APIs and implementing clever software design patterns, we've created a solution that turns the complex world of food intolerances into manageable, actionable information.

Running Streamlit demo

Demo

There is a run.sh script add you api keys and lauch the script in your terminal.

# make a new virtual env using python 3.11 (tested on MacOS 14.6.1)
# python3.11 -m venv .venv
# source .venv/bin/activate

# pip install -r requirements.txt

cd cleo

# put your API keys in app.py file beginning

streamlit run app.py --server.port 8501 --server.address 0.0.0.0

Looking Ahead

As I reflect on this journey, I'm excited about the future. There's so much more we can do:

  1. Integration with wearable devices like Rabbit R1, Humane AI pin for real-time capture, search and query
  2. Expansion of our food database to include more regional specialities
  3. Development of AR features for in-store shopping assistance

future

Health Cleo is just the beginning. We're working on integrating with wearable devices and expanding our database to include more regional and cultural specialities. Increasing other lab tests like allergy reports, etc. Imagine walking through a grocery store, and with a glance at a product through your AR glasses, you receive instant feedback on whether the product is safe for you to consume.

Here's to the future of personalized nutrition, powered by AI and driven by human ingenuity!

Built With

  • langchain
  • llm
  • ocr
  • python
  • streamlit
  • tidb
  • upstage
Share this project:

Updates