Stories by Ayushg on Medium

Classification with GenAI for Enterprises: GPT vs Customized LLMs

Ayushg — Wed, 16 Apr 2025 09:36:06 GMT

Introduction

Enterprise applications use classification problems throughout their technology stack. These can include intent detection to route user requests to the right service, product category classification in retail, or document type identification when extracting data from unstructured files. The number of classes in these tasks typically ranges from 10 to 100. During our work with enterprises, we’ve frequently observed that public LLMs like GPT become less effective as the number of classes increases. This experiment aims to validate this performance decline and evaluate whether personalized LLMs provide better results.

Objective

While public LLMs like OpenAI’s GPT-4o and Claude Sonnet 3.5/3.7 are attractive for their seamless integration, they often underperform in real-world production environments. At scale, these models become prohibitively expensive to use. For example, a marketing technology company processing 2.7M NLP classifications on marketing documents chose to continue using classic NLP models despite GenAI’s accuracy benefits, as OpenAI’s estimated costs would reach $0.5M per month. These models also present challenges in regulated industries like fintech and healthcare due to data privacy concerns, and they typically respond more slowly than lighter, optimized alternatives. A viable alternative is customizing LLMs through fine-tuning smaller open-weight models and embedding business knowledge to achieve accurate classifications.

Since the privacy and cost advantages of customized LLMs are well established, this experiment focuses on evaluating how classification accuracy changes as the number of classes increases. We also compare the performance of proprietary, general-purpose models against customized, fine-tuned LLMs to determine if domain-specific training provides meaningful advantages.

Data Preparation

For this experiment, we used an open-source dataset available on Hugging Face: gretelai/synthetic_text_to_sql. This dataset contains high-quality, synthetic Text-to-SQL samples created using Gretel Navigator and spans across 100 distinct domains or verticals, with each vertical containing nearly 900–1000 samples.

Out of the full dataset, we focused on two key columns:

domain: This column includes around 100 unique domain values, which we used as classification labels.
sql_prompt: This contains the actual user input. This serves as the input for our classification task.

To build a meaningful and scalable classification problem, we needed a subset of domains that were closely related. For this, we applied a clustering-based approach:

We embedded the sql_prompt texts using a Sentence Transformer model—"all-MiniLM-L6-v2", an open-source, lightweight model suitable for semantic understanding.
We then applied K-Means clustering with n_clusters=3 on the embedded representations to group similar domains based on prompt semantics.
From the resulting clusters, we selected one cluster containing 58 domains with strong semantic overlap. After manually reviewing the cluster, we removed 8 loosely connected classes to finalize a set of 50 well-aligned classes for our experiments.

Using this refined group, we created benchmark datasets with varying numbers of classes specifically 5, 10, 20, 25, and 50. For each class count, we sampled 10 examples per class. This means the dataset size scaled as follows:

5 classes → 50 samples
10 classes → 100 samples
20 classes → 200 samples

and so on.

Importantly, we ensured class continuity across groups. The classes included in the 5-class set are also part of the 10-class set, and so forth. This approach helps maintain the experimental integrity and allows us to reliably assess the impact of increasing class diversity on model performance.

The data, excluding the benchmark samples for each of the selected classes, was used to prepare the training dataset. Since each class had around 1,000 examples, we created a total training set of around 40,000 samples to fine-tune the Llama model and compare its performance against the GPT model.

GPT-4o Evaluation

Using benchmark datasets with varying class sizes, we crafted a clear and consistent prompt format to evaluate GPT-4o’s classification performance. The prompt provided unambiguous instructions for the model to select the most appropriate domain from a provided list of domain names for a given SQL-related question. We structured the output as a simple JSON format to ensure reliable evaluation.

Here’s an example of the prompt structure:

Given a domain list: ['forestry', 'marine biology', 'aquaculture', 'agriculture', 'wildlife conservation'], select the most appropriate domain for the provided question. Your response should be a JSON with the "domain" key containing the domain name and nothing else.

Sample Format:
Sample Question: Find the latest movie which "Gabriele Ferzetti" acted in.
Response: {"domain": "imdb"}

Now, process the following question accordingly.
Question: {{question}}

For each benchmark dataset (with 5, 10, 20, 25, and 50 classes), the corresponding list of domains in the prompt was updated to match the set of classes used. The same structure was maintained across all experiments to ensure consistency.

The results with GPT-4o showed a clear trend: as the number of classes increased, the model’s accuracy in correctly identifying the appropriate domain decreased beyond use. This decline highlights the challenge of scaling general-purpose models for fine-grained classification tasks.

The performance drop is visualized in the graph below

Fig 1: Accuracy of GPT-4o Model Varying Across Increasing Number of Classes

Fine-tuning and Evaluating Meta-Llama-3.1–8B-Instruct

In the second experiment, we evaluated how a customized model — Meta-Llama-3.1–8B-Instruct — performs when fine-tuned on our domain classification task. We trained this base model first on the dataset with 50 classes, and then applied it to inference with different number of classes.

We trained three versions of the model, each with a different amount of training data to study impact of training examples on accuracy:

Model A: Fine-tuned with 2,000 samples
Model B: Fine-tuned with 5,000 samples
Model C: Fine-tuned with 40,000 samples

All models were evaluated using the same benchmark datasets and inference format as in the GPT-4o experiment. This ensured a fair comparison of performance across both approaches.

The table below summarizes the accuracy scores for GPT-4o and all three fine-tuned models across the benchmark datasets

Table 1: Accuracy Comparison of Various Experiments with Increasing Number of Classes

The following graph presents a visual representation of the accuracy trends highlighted in the comparison table above:

Fig 2: Accuracy Trend Across Varying Class Distributions During Inference

It can be seen from the graph and the comparison table, the fine-tuned models performed significantly better than general models, and showed consistent competency with increasing number of classes. As expected, accuracy improved with more training data. Model C (trained on 40K samples) consistently achieved the highest accuracy across all class groupings (5, 10, 20, 25, 50).

It can be noted that while we fine-tuned a single model and evaluated it across multiple class groupings, training separate models for specific class ranges could potentially yield even higher accuracy.

Conclusion

Our experiments clearly highlight a key insight: general-purpose LLMs struggle with accuracy as task complexity increases, especially in classification problems with a growing number of classes. This limitation becomes critical in enterprise applications where precision drives downstream decisions.

In contrast, customized LLMs, fine-tuned on domain-specific data, consistently outperform general models, making them a more reliable choice for production use. While fine-tuning requires initial development effort, tools like Genloop make this process more accessible and scalable.

As complexity grows, so does the performance gap, favoring tailored solutions over generic ones. For enterprises, the path forward is clear: invest in quality data and model customization to unlock the true potential of AI in production systems.

About Genloop

Genloop delivers customized LLMs that provide unmatched cost, control, simplicity, and performance for production enterprise applications. Please visit genloop.ai or email founder@genloop.ai for more details. Schedule a free consultation call with our GenAI experts for personalized guidance and recommendations.

https://genloop.ai/request-demo

Text to SQL: The Ultimate Guide for 2025

Ayushg — Thu, 13 Feb 2025 13:15:40 GMT

Introduction

Data is the backbone of software companies, and SQL is the backbone of data management. Over 80% of the world’s structured data is stored in relational databases, with SQL serving as the primary language for accessing and managing it. Every day, millions of data analysts, data scientists, and developers rely on SQL to query and manipulate data. In essence, SQL acts as a bridge between you and your data.

The Challenge and Solution

Translating natural language questions into SQL queries is a complex task that often delays insights. This process is particularly difficult for non-technical users, making it hard for them to gather insights and make data-driven decisions. As a result, crucial opportunities are sometimes missed, or decisions are delayed.

Large Language Models (LLMs) offer a promising solution by automating the generation of SQL queries. By doing so, they help bridge the gap between data and its stakeholders. Text-to-SQL technology has emerged as one of the top three applications of Generative AI since 2023, enabling companies to unlock insights faster and empower a wider range of users to interact with data effectively.

A Text-to-SQL example (above), and Evolutionary Process of Text-to-SQL Research (below). Source: arXiv:2406.08426 (Next-Generation Database Interfaces)

A lot has changed in the space of Generative AI (GenAI) since 2023. There have been significant improvements in both closed and open-weight models, as well as new methods for customizing these models using techniques like fine-tuning with QLoRA. In this blog, we will explore how to implement text-to-SQL in 2025. Genloop has been collaborating with enterprise clients on text-to-SQL solutions, and here, we provide a comprehensive guide on the approaches you can try. We will also objectively compare these methods to help you choose the best approach for your needs.

Understanding the Basics: Glossary

Table Identification: The process of accurately determining which table(s) should be used to retrieve data for a specific question or query.
Business Rules: A set of business knowledge or rules that are relevant to the specific use case. This includes domain knowledge, business formulas, and customer-specific rules.
Context: All the information and metadata about the database required to retrieve the necessary data for answering a query. This includes tables, schemas, columns, descriptions, etc.
SQL Writer: A module that takes the relevant context and generates an SQL query. This is typically powered by an LLM or an agent.
SQL Executor: A module that executes an SQL query and returns the corresponding data. It may also include mechanisms to validate or correct SQL queries.
SQL Data: The data is retrieved by executing the SQL query. This is used to prepare a response to the user’s input or question.
Decorator: The final module that processes the SQL data and user input to prepare a natural language response. This module often includes an LLM and features like chart creation. In some cases, it may also validate the correctness of the response.

Different Approaches to Text to SQL

1. Prompting the Smartest LLM

The simplest solution is just to give all the schemas, columns, tables, and table descriptions (collectively called context) to the biggest and smartest model with the hope that it will be smart enough to reason and arrive at the right SQL query. This SQL query when executed gives us the SQL data, which the decorator uses to prepare the final response.

Architecture of Text to SQL: Prompting the Smartest LLM

While this approach requires the LLM to handle significant complexity, it might yield reasonable accuracy for smaller schemas and straightforward queries. However, as schemas grow larger and queries become more complex — as they typically do in real-world production systems — this method often falls short. The context window of the LLM becomes a limiting factor, and such models are expensive, slow, and require transferring sensitive business data to third-party servers.

Performance Metrics:

Attribute Scoring: Prompting the Smartest LLM

Bonus: A Sample Prompt Template

### System Prompt Template: SQL Query Generator

You are a highly intelligent assistant specialized in generating SQL queries.
Your role is to analyze the user query and generate a correct and optimized SQL query based on the provided database information.
The output should follow standard SQL syntax and consider all constraints, table relationships, and requirements mentioned in the input.

- --

#### Inputs Provided

1. User Query: A natural language description of the data the user wants to retrieve.

2. Database Information:

- Tables and Columns: A list of tables with their descriptions and columns.

- Relationships: Descriptions of relationships between tables (e.g., foreign keys, joins).

- Sample Data (if available): Example data from tables for context.

- --

#### Database Information Format

Each table is provided as follows:

- Table Name: ``
- Description: ``
- Columns:
- ``: `` ()
- ...

#### Instructions for SQL Query Generation

1. Understand the User Query: Parse the query to identify the intent and required data.

2. Analyze Table Information: Identify the relevant tables, columns, and relationships needed to construct the query.

3. Generate the Query:

- Use appropriate SQL commands (`SELECT`, `INSERT`, `UPDATE`, `DELETE`, etc.).

- Include filters, joins, grouping, or sorting as needed.

- Ensure the query adheres to the schema and data types.

4. Validate Query Structure:

- Ensure the SQL query is syntactically correct.

- Optimize for performance by minimizing unnecessary joins and filters.

5. Output: Return the SQL query as plain text. Provide a brief explanation of the query if needed.

2. RAG with Fast General-Purpose LLMs

In the first approach, we just dumped the complete information to an LLM hoping it would figure everything out. The second approach is to fetch only the relevant information through RAG (Retrieval Augmented Generation) techniques, break the problem into simpler tasks, and have those simpler tasks performed in orchestration through relatively faster general-purpose LLMs like GPT-4o, Claude, Meta Llama 3.3 70B, DeepSeek-v3, etc.

Architecture of Text to SQL: RAG with Fast General-Purpose LLMs

The first step is to select the relevant information from the complete context. This could either be done through vector search-based retrieval, where the question maps to a related canonical question, which further has pre-selected tables, schemas, and examples that can be used. Or, it could be useful to identify the relevant tables through a Table Intent Classification LLM. Once the right tables are selected, the rules and schema governing those tables are collected and supplied as the context for downstream processing.

After the relevant context is found, a general-purpose LLM is used to generate an SQL query from the given context and query. Since the context is already filtered by RAG, general-purpose LLMs are able to do a better job.

The process of selecting the best model for this could be iterative. We suggest going with the smarter models prioritizing accuracy (example: gpt-4o). Once satisfied, you can try even smaller models (like GPT-4o-mini) and evaluate their accuracy for your use case on your test set.

Performance Metrics:

Attribute Scoring: RAG with Fast General-Purpose LLMs

Bonus: A Sample Prompt Template for Table Intent LLM

### Prompt: Classify Relevant Tables for SQL Query

You are a highly intelligent assistant specializing in SQL query generation. 
Your task is to analyze the user query and classify which table(s) from the database should be used to construct the SQL query. 
Consider the table descriptions and their columns to make your decision.

- --

#### Input Format

- User Query:
- A natural language description of the data the user wants to retrieve.

- Database Information:
- A list of tables, their descriptions, and columns.

----
- Example Database:

1. Table Name: `customers`

- Description: Contains customer details.

- Columns:

- `customer_id`: `int` (Primary Key)

- `name`: `varchar`

- `email`: `varchar`

2. Table Name: `orders`

- Description: Contains order information, including customer IDs and order dates.

- Columns:

- `order_id`: `int` (Primary Key)

- `customer_id`: `int` (Foreign Key referencing `customers.customer_id`)

- `order_date`: `date`

3. Table : `products`

- Description: Contains product details.

- Columns:

- `product_id`: `int` (Primary Key)

- `name`: `varchar`

- `price`: `decimal`

---

#### Task

Analyze the user query to determine which table(s) should be used. Consider the following:

1. Does the user query mention attributes or concepts tied to specific columns or descriptions?

2. Are multiple tables required due to relationships (e.g., joins)?

3. Exclude irrelevant tables.


#### Output Format
Output the result is a json with keys 'tables' (the relevant tables) and 'explanation' (brief explanation for why the tables are relevant). Return 'NA' if no tables are relevant.

3. Agents with General-Purpose LLMs

AI agents are autonomous software systems that perceive their environment and take actions to achieve specific goals. They can operate independently, learn from experience, and make decisions based on their programming and environmental inputs.

A simple RAG with a general-purpose model cannot recover from errors. This is where agents come to the rescue. A simple agentic system could have 2 agents, a context agent that collects the most relevant context to share. RAG is a simplified implementation of this context agent. The other agent is the SQL agent which uses workers like SQL Writer (LLM) and SQL Executor to produce the RAG-based for generating the response.

Architecture of Text to SQL: Agents with General-Purpose LLMs

A real example of such approach is the Open Data QnA by Google.

Open Data QnA Architecture. Source: https://github.com/GoogleCloudPlatform/Open_Data_QnA

Such system is able to successfully recover from many mistakes and common syntactical errors because of the agentic planning and interaction. The query enhancement step also gets better since the agents are better able to follow the business rules due to the multi-step nature of the system as opposed to a single-shot nature of the RAG approach.

However, this approach takes longer to get the response. The increase in accuracy in this approach comes at the expense of cost and latency. This is because the models have not become smarter or better but essentially the agentic nature gave the system more steps and time to get to the answer and also correct itself from the errors.

Performance Metrics:

Attribute Scoring: Agents with General-Purpose LLMs

Bonus:

Various tools help simplify agent building and execution. AutoGen, OpenAI Swarm, Pydantic AI, and Crew AI are some frameworks to get started with agentic workflows.

4. Customized Contextual LLMs

While the previous approaches can typically deliver 70–85% accuracy, customizing large language models (LLMs) to build personalized contextual models can help you push beyond that threshold. If you’re experiencing challenges with any of the attributes mentioned above or need more tailored results, this approach is your best bet.

Architecture of Text to SQL: Customized Contextual LLMs

The approach is to fine-tune open-weight models such as Llama, Mistral, and Qwen, tailoring them to better understand the business and domain-specific needs.

Here’s how it works: when a user query comes in, the first customized LLM is responsible for establishing the correct context for downstream tasks. It not only understands the user query but also identifies relevant data tables and devises an execution plan. This plan is then passed to the SQL Writer LLM, which generates a query specific to your domain, accounting for business rules such as how growth is calculated or the formulas used for quarterly sales. Afterward, the SQL data is processed by the Decorator LLM and returned as a response. Since the entire execution is highly contextual, it rarely needs any error recovery.

By adopting this model, enterprises gain full ownership of their AI, enjoying unmatched cost-effectiveness, control, and performance. With ownership, the model can be deployed behind the company’s firewall, ensuring complete data security. Moreover, because the model is fine-tuned for the business, it delivers unparalleled accuracy. The smaller size of these models also means they require a much faster path to response — up to 250x quicker than general LLMs — resulting in the fastest response times.

This approach has proven highly effective in our enterprise engagements. We will dive deeper into the benefits and workflow details in an upcoming blog post. However, it’s important to note that while this approach offers exceptional value, it requires a significant effort in development effort to build, scale, and maintain these LLMs.

Performance Metrics:

Attribute Scoring: Customized Contextual LLMs

Conclusion

Final Comparison Table of All Approaches

If you have a straightforward use case, a RAG-based general model might suffice. However, for use cases with significant scale — where rental AI costs skyrocket — or if you require the highest accuracy standards or have privacy concerns, Approach 4: Customized Contextual Models is essential. Make an informed decision using the Should I Fine-Tune tool.

About Genloop

https://genloop.ai/request-demo