Stories by Sherlock Xu on Medium

A Guide to Model Composition

Sherlock Xu — Tue, 30 Jul 2024 08:18:02 GMT

This blog post was originally published at The New Stack: https://thenewstack.io/a-guide-to-model-composition/

Consider an AI-powered image recognition app designed to identify and classify wildlife photos. You upload a picture taken during a hike, and within moments, the app not only identifies the animal in the photo but also provides detailed information about its species, habitat, and conservation status. This kind of app can be built through model composition — a technique where multiple AI models collaborate to analyze and interpret the image from various perspectives.

Model composition in this context might involve a sequence of specialized models: one for detecting the animal in the image, another for classifying it into broad categories (e.g., bird, mammal, and reptile), and yet another set of models that work together to determine the specific species. This layered approach offers a nuanced analysis that exceeds the capabilities of a single AI model.

What is model composition?

At its core, model composition is a strategy in machine learning that combines multiple models to solve a complex problem that cannot be easily addressed by a single model. This approach leverages the strengths of each individual model, providing more nuanced analyses and improved accuracy. Model composition can be seen as assembling a team of experts, where each member brings specialized knowledge and skills to the table, working together to achieve a common goal.

Many real-world problems are too complicated for a one-size-fits-all model. By orchestrating multiple models, each trained to handle specific aspects of a problem or data type, we can create a more comprehensive and effective solution.

There are several ways to implement model composition, including but not limited to:

Sequential processing: Models are arranged in a pipeline, where the output of one model serves as the input for the next. This is often used in tasks like data preprocessing, feature extraction, and then classification or prediction.
Parallel processing: Multiple models run in parallel, each processing the same input independently. Their outputs are then combined, either by averaging, voting, or through a more complex aggregation model, to produce a final result. This is commonly used in ensemble methods.

An important concept related to model composition is the inference graph. An inference graph visually represents the flow of data through various models and processing steps in a model composition system. It outlines how models are connected, the dependencies between them, and how data transforms and flows from input to final prediction. The graphical representation helps us design, implement, and understand complex model composition. Here is an inference graph example:

The service accepts a text input, such as “I have an idea!”
It simultaneously sends the prompt to three separate text generation models, which run in parallel to produce results using different algorithms or datasets.
The results from these three models are then sent to a text classification model.
The classification model assesses each piece of generated text and assigns a classification score to them (for example, based on the content’s sentiment).
Finally, the service aggregates the generated text along with their respective classification scores and returns them as JSON.

When should I compose models?

Model composition is a practical solution to a wide array of challenges in machine learning. Here are some key use cases where model composition plays a crucial role.

Multi-modal applications

In today’s digital world, data comes in various forms: text, images, audio, and more. A multi-modal application combines models specialized in processing different types of data. A typical example of composing models to create multi-modal applications is BLIP-2, which is designed for tasks that involve both text and images.

BLIP2 integrates three distinct models, each providing a unique capability to the system:

A frozen large language model (LLM): Provides strong language generation and zero-shot transfer abilities.
A frozen pre-trained image encoder: Extracts and encodes visual information from images.
A lightweight Querying Transformer model (Q-Former): Bridges the modality gap between the LLM and the image encoder. It integrates visual information from the encoder with the LLM, focusing on the most relevant visual details for generating text.

BLIP-2 architecture. Source: The original BLIP-2 paper

Ensemble modeling

Ensemble modeling is a technique used to improve the prediction of machine learning models. It does so by combining the predictions from multiple models to produce a single, more accurate result. The core idea is that by aggregating the predictions of several models, you can often achieve better performance than any single model could on its own. The models in an ensemble may be of the same type (e.g., all decision trees) or different types (e.g., a combination of neural networks, decision trees, and logistic regression models). Key techniques in ensemble modeling include:

Bagging: Train multiple models on different subsets of the training data and then average their predictions, useful for reducing variance.
Boosting: Sequentially train models, where each model attempts to correct errors made by the previous ones.
Stacking: Train multiple models and then use a better model that leverages the strengths of each base model to improve overall performance and combine their predictions.

A real-world use case of ensemble modeling is a weather forecasting system, where accuracy is important for planning and safety across industries and activities. An ensemble model for weather prediction might integrate outputs from various models, each trained on different data sets, using different algorithms, or focusing on different aspects of weather phenomena. Some models might be more capable of predicting precipitation, while others perform better at forecasting temperature or wind speed. By aggregating these predictions, an ensemble approach can provide a more accurate and nuanced forecast.

Pipeline processing

Machine learning tasks often require a sequence of processing steps to transform raw data into actionable insights. Implementing model composition can help you structure these tasks as pipelines, where each step is handled by a different model optimized for a specific function.

One of the common use cases is an automated document analysis system, capable of processing, understanding, and extracting meaningful information from documents. The system might use a series of models, each dedicated to a phase in the processing pipeline:

Preprocessing: The first step might require an OCR (Optical Character Recognition) model that extracts text from scanned documents or images. This model is specialized in recognizing and converting varied fonts and handwriting styles into machine-readable text.
Prediction: Following text extraction, a text classification model can be used to categorize the document based on its content, such as a legal document, a technical manual, and a financial report. This classification step is important for routing the document to appropriate downstream processes.
Post-processing: After classification, a summarization model can be used to generate a concise summary of the document’s content. This summary provides quick insights into the document, informing decision-making and prioritization.

In addition to sequential pipelines, you can also implement parallel processing for multiple models to run concurrently on the same data (as shown in the first image). This is useful in scenarios like:

Ensemble modeling: Predictions from multiple models are aggregated to improve accuracy.
Computer vision tasks: Models for image segmentation and object detection may run in parallel to provide a comprehensive analysis of an image, combining insights into the image’s structure with identification of specific objects.

What are the benefits of model composition?

Model composition provides a number of operational and developmental advantages. Here are some key benefits:

Improved accuracy and performance

In some cases, the synergy of multiple models working together can result in improved accuracy and performance. Each model in the composition may focus on a specific aspect of the problem, such as different data types or particular features of the data, ensuring that the combined system covers the entire problem space than any single model could. This is especially true in ensemble modeling, as aggregating the results from multiple models can help cancel out their individual biases and errors, leading to more accurate predictions.

Dedicated infrastructure and resource allocation

Model composition allows you to deploy the involved models across varied hardware devices, optimizing the use of computational resources. They can be assigned to run on the most appropriate infrastructure — whether it’s CPU, GPU, or edge devices — based on their processing needs and the availability of resources. This dedicated allocation also ensures that each part of the system can be scaled separately.

Customization and flexibility

One of the most significant advantages of model composition is the flexibility it offers. Models can be easily added, removed, or replaced within the system, allowing developers to adapt and evolve their applications as new technologies emerge or as the requirements change. This modular approach simplifies updates and maintenance, ensuring that the system can quickly adapt to new challenges and opportunities.

Faster development and iteration

Model composition supports a parallel development workflow, where teams can work on different models or components of the system simultaneously. This helps accelerate the development process, which means quicker iterations and more rapid prototyping. It also enables teams to provide more agile responses to feedback and changing requirements, as individual models can be refined or replaced without disrupting the entire system.

Resource optimization

By intelligently distributing workloads across multiple models, each optimized for specific tasks or hardware, you can maximize resource utilization. This optimization can lead to more efficient processing, reduced latency, and lower operational costs, particularly in complex applications that require substantial computational power. Effective resource optimization also means that your application can scale more gracefully, accommodating increases in data volume or user demand.

Composing multiple models with BentoML

Different model serving or deployment frameworks may adopt different approaches to model composition. In this connection, BentoML, an open-source model serving framework, provides simple service APIs to help you wrap models, establish interservice communication, and expose the composed models as REST API endpoints.

The code example below demonstrates how to use BentoML to compose multiple models. In BentoML, each Service is defined as a Python class. You use the @bentoml.service decorator to mark it as a Service and allocate CPU or GPU resources to it. When you deploy it to BentoCloud, different Services can run on dedicated instance types and be separately scaled.

In this BentoML service.py file, GPT2 and DistilGPT2 are initialized as separate BentoML Services to generate text. The BertBaseUncased Service then takes the generated text and classifies it, providing a score that represents sentiment. The InferenceGraph Service orchestrates these individual Services, using asyncio.gather to concurrently generate text from both GPT-2 models and then classifying the output using the BERT model.

After deployed to BentoCloud, Services can run on separate instance types as shown below:

Monitor the performance:

For detailed explanations, see this example project.

Frequently asked questions

Before I wrap up, let’s see some frequently asked questions about model composition.

What is the difference between ensemble modeling and multi-modal applications?

These two machine learning concepts serve different purposes and are applied in different contexts.

Purpose and application: Ensemble modeling improves prediction accuracy by combining multiple models. Multi-modal applications integrate and interpret data from multiple sources or types to make better decisions or predictions.
Models vs. Data: Ensemble modeling focuses on using multiple models to enhance predictions. Multi-modal applications focus on integrating different types of data (e.g., text, image, audio).
Implementation: Multi-modal systems often require data preprocessing and feature extraction techniques to handle different data types effectively. Ensemble modeling, on the other hand, needs strategies for combining model predictions, which might involve direct averaging or more complicated voting systems.

I am using a single model for my application. Should I move to multiple models?

It’s important to note that while model composition offers different benefits as mentioned above, it’s not always necessary. If a single model can efficiently and accurately accomplish the task at hand, I recommend you just stick with it. The decision to compose multiple models and the design of the processing pipeline should be guided by your specific requirements.

How does model composition affect production deployment?

The integration of multiple models into a single application affects production deployment in several key ways:

Increased complexity

Configuration and management: Each model in the composition may require its configuration, dependencies, and environment. Managing them across multiple models adds complexity to the deployment process.
Service orchestration: Composing multiple models often requires careful orchestration to ensure that data flows correctly between models and that each model is executed in the correct order or in parallel as required.

Resource allocation

Hardware requirements: As mentioned above, different models may have different hardware requirements. Some models might need GPUs for inference, while others can run on CPUs. The serving and deployment framework you select should support flexible resource allocation to meet the needs.
Scaling strategies: Scaling multiple models in production may not be as straightforward as scaling a single model. Different components of the application may have varying loads, requiring dynamic scaling strategies that can adjust resources for individual models based on demand.

Monitoring and maintenance

Monitoring: Keeping track of the performance and health of different models in production requires comprehensive monitoring solutions that can provide insights into each model’s performance, resource usage, and potential bottlenecks.
Versioning and updates: Updating one model in a composite application can have cascading effects on other models. Proper version control and testing strategies must be in place to manage updates without disrupting the application’s overall performance.

Deployment strategies

Microservices architecture: Adopting a microservices architecture can simplify the deployment of multiple models by encapsulating each model as a separate service. This approach simplifies scaling, updates, and management but requires flexible service orchestration tools.
Containerization: Using containers for deploying AI models can help manage dependencies and environments for each model. Container orchestration tools like Kubernetes can help manage the deployment, scaling, and networking of containerized models.

Model composition can affect deployment by requiring more resources and potentially more complex deployment strategies. However, as shown in the example above, platforms like BentoML and BentoCloud can help developers build AI applications of multiple models by allowing them to package, deploy, and scale multi-model services efficiently.

Final thoughts

While the benefits of model composition are clear, from enhanced performance to the ability to process multiple data types, it’s important to acknowledge the complexity it introduces, especially related to production deployment. Successful implementation requires careful planning, resource management, and the adoption of modern deployment practices and tools to navigate the challenges of configuration, scaling, and maintenance.

A Guide to Model Composition was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Serving A LlamaIndex RAG App as REST APIs

Sherlock Xu — Tue, 28 May 2024 11:03:59 GMT

Creating REST APIs for a retrieval-augmented generation (RAG) system provides a flexible and scalable way to integrate RAG with a wide range of applications. In this blog post, we will cover the basics of how to serve a RAG system built with LlamaIndex as REST APIs using BentoML.

This is Part 1 of our blog series on Private RAG Deployment with BentoML. You will progressively build upon an example RAG app and expand to a fully private RAG system with open-source and custom fine-tuned models. Topics to cover in the blog series:

Part 1: Serving A LlamaIndex RAG App as REST APIs
Part 2: Self-Hosting LLMs and Embedding Models for RAG
Part 3: Multi-Model Orchestration for Advanced RAG systems

Concepts

Before we serve the RAG service, let’s briefly introduce RAG and LlamaIndex.

RAG

Simply put, RAG is designed to help LLMs provide better answers to queries by equipping them with a customized knowledge base. This allows them to return relevant information even if it hasn’t been trained directly on that data. Here’s a brief overview of how a typical RAG system operates:

The RAG system breaks the input data into manageable chunks.
An embedding model translates these chunks into vectors.
These vectors are stored in a database, ready for retrieval.
Upon receiving a query, the RAG system retrieves the most relevant chunks based on their vector similarities to the query.
The LLM synthesizes the retrieved information to generate a contextually relevant response.

Image source: Techniques, Challenges, and Future of Augmented Language Models

For more information about RAG, you can refer to our previous articles.

LlamaIndex

LlamaIndex is a Python library that enhances the capabilities of LLMs by integrating custom data sources, such as APIs and documents. It provides efficient data ingestion, indexing, and querying, making it an ideal tool for building compound Python programs like RAG. Therefore, we will use it together with BentoML across this RAG blog series.

What are we building?

Production use cases often require an API serving system to expose your RAG code. Although web frameworks can help, they become limiting as you start adding model inference components to your server. For more information, see Building RAG with Open-Source and Custom AI Models.

In this blog post, we will be building a REST API service with an /ingest_text endpoint for knowledge ingestion and a /query endpoint for handling user queries. The /ingest_text API lets you submit a text file to populate your RAG system's knowledge base so that you can interact with the /query API to answer questions.

Setting up the environment

You can find all the source code of this blog series in the bentoml/rag-tutorials repo. Clone the entire project and go to the 01-simple-rag directory.

git clone https://github.com/bentoml/rag-tutorials.git
cd rag-tutorials/01-simple-rag

We recommend you create a virtual environment to manage dependencies and avoid conflicts with your local environment:

python -m venv rag-serve
source rag-serve/bin/activate

Install all the dependencies.

pip install -r requirement.txt

By default, LlamaIndex use OpenAI’s text embedding model and large language model APIs. Set your OpenAI API key as an environment variable to allow your RAG to authenticate with OpenAI’s services.

export OPENAI_API_KEY="your_openai_key_here"

Serving a LlamaIndex RAG service

First, let’s define a class for indexing documents using the LlamaIndex framework:

# Define a directory to persist index data
PERSIST_DIR = "./storage"

class RAGService:

    def __init__(self):
        # Set OpenAI API key from environment variable
        openai.api_key = os.environ.get("OPENAI_API_KEY")

        from llama_index.core import Settings
        # Configure text splitting to parse documents into smaller chunks
        self.text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
        Settings.node_parser = self.text_splitter

        # Initialize an empty index
        index = VectorStoreIndex.from_documents([])
        # Persist the empty index initially
        index.storage_context.persist(persist_dir=PERSIST_DIR)
        # Load index from storage if it exists
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        self.index = load_index_from_storage(storage_context)

Note: You can find the complete source code here and we will only focus on the core snippets in this article. The code uses local storage for demo purpose and it is not scalable. You will need to integrate a vector database, such as Milvus, Pinecone, or Weaviate, for better performance at scale.

In the above code:

VectorStoreIndex and StorageContext manage the storage and retrieval of indexed data.
SentenceSplitter breaks down the text into manageable chunks for better indexing and retrieval.

Next, define a function for the RAG system to receive documents, which will be indexed and stored, and another one that responds to user queries by retrieving relevant information from the indexed data.

    def ingest_text(self, txt) -> str:
        # Create a Document object from the text
        with open(txt) as f:
            text = f.read()

        doc = Document(text=text)
        self.index.insert(doc)
        # Persist changes to the index
        self.index.storage_context.persist(persist_dir=PERSIST_DIR)
        return "Successfully Loaded Document"
        
    def query(self, query: str) -> str:
        query_engine = self.index.as_query_engine()
        response = query_engine.query(query)
        return str(response)

To test this class, simply create a RAGService object and call the methods.

# Instantiate the RAGService
rag_service = RAGService()

# Ingest text from a file
ingest_result = rag_service.ingest_text("path/to/your/file.txt")
print(ingest_result)  # Expected output: "Successfully Loaded Document"

# Query for information
query_result = rag_service.query("Your query question goes here")
print(query_result)  # Expected output: Retrieved information based on the query

The code should work well and the next step is to create an API for ingesting knowledge and another one for asking questions. This is where BentoML comes in.

BentoML generates API endpoints based on function names, type hints, and uses them as callback functions to handle incoming API requests and produce responses. To serve this LlamaIndex app as an API server with BentoML, you only need to add a few decorators:

PERSIST_DIR = "./storage"
        
# Mark a class as a BentoML Service via decorator
@bentoml.service
class RAGService:

    def __init__(self):
        ...

    # Generate a REST API from the callback function
    @bentoml.api
    def ingest_text(self, txt: Annotated[Path, bentoml.validators.ContentType("text/plain")]) -> str:
        ...

    # Generate a REST API from the callback function
    @bentoml.api
    def query(self, query: str) -> str:
        ...

Note: See the BentoML Services doc to learn more about @bentoml.service and @bentoml.api.

Start this BentoML Service by running:

$ bentoml serve service:RAGService

2024-04-26T08:49:13+0000 [INFO] [cli] Starting production HTTP BentoServer from "service:RAGService" listening on http://localhost:3000 (Press CTRL+C to quit)

The server is now accessible at http://localhost:3000.

Querying the RAG API service

To begin querying your RAG APIs, create an API client and ingest a file into your RAG system with the ingest_text API:

import bentoml
from pathlib import Path

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    result = client.ingest_text(
        txt=Path("paul_graham_essay.txt"),
    )

Now, the text content of paul_graham_essay.txt has been chunked and embedded in your RAG system. Try submitting a query to ask a question about this document:

import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    result: str = client.query(
        query="What did Paul Graham do growing up?",
    )
    print(result)

Example output:

Paul Graham spent a lot of time at the Carnegie Institute as a kid and visited it in 1988. While looking at a painting there, he realized that paintings were something that could last and be made by individuals. This realization sparked his interest in art and the possibility of becoming an artist.

BentoML generates a standard REST API server. You may choose to use any HTTP API client to interact with the endpoint. For example, you can send requests via curl:

Deploying the RAG service for production

Now, you can deploy this RAG app for production, which means it will be running in a scalable and reliable environment to handle real-world usage and traffic. Before deployment, create a bentofile.yaml file (already in the project directory) to set runtime configurations for your RAG. They will be packaged as a standardized distribution archive in BentoML, or a Bento. All the build options can be found here. Remember to set the OPENAI_API_KEY environment variable.

service: "service.py:RAGService"
labels:
  owner: bentoml-team
include:
  - "*.py"
exclude:
  - "storage/"
python:
  requirements_txt: "./requirement.txt"
docker:
  distro: debian
envs:
  - name: OPENAI_API_KEY
    value: "sk-*******************" # Add your key here

You can then choose to deploy the LlamaIndex RAG app with Docker or BentoCloud.

Docker

Use bentoml build to build a Bento.

bentoml build

Make sure Docker is running and then run the following command:

bentoml containerize rag_service:latest

Verify that the Docker image has been created successfully.

$ docker images

REPOSITORY                     TAG                IMAGE ID       CREATED          SIZE
rag_service                    73ikq6ayikzzze5l   0bb88768ea6e   11 seconds ago   917MB

Run the Docker image locally by following the instructions on the printed messages.

$ docker run --rm -p 3000:3000 rag_service:73ikq6ayikzzze5l

BentoCloud

Compared with Docker, BentoCloud provides a fully-managed infrastructure optimized for running inference with AI models. This AI inference platform offers more advanced features like autoscaling, GPU inference, built-in observability, and model orchestration, which will be covered in the subsequent blog posts in this series.

bentoml cloud login \
    --api-token 'your-api-token' \
    --endpoint 'your-bentocloud-endpoint-url'

Run bentoml deploy in the project directory (where bentofile.yaml exists) to deploy the LlamaIndex RAG app.

bentoml deploy .

Once it is up and running, use the ingest_text endpoint to inject a text file and then send a query to the query endpoint.

Conclusion

With BentoML, you can easily serve a LlamaIndex RAG app as a RESTful API server. The entire process only takes two steps: 1) Structure your RAG code into a stateful class; 2) Add type hints and BentoML decorators for generating REST APIs for serving.

With your RAG service as a REST API, you can deploy it with Docker or BentoCloud for production, and integrate other systems, such as web applications, with your RAG code over APIs. Note that this example uses a local file system for storage, which may limit horizontal scalability. For production, we recommend using BentoCloud along with hosted vector database services for enhanced scalability and performance.

While OpenAI models offer powerful capabilities, some use cases may require more custom solutions to meet specific needs for data privacy and security, cost, latency and reliability. In the next blog post, we will explain how to replace them with open-source embedding and language models to build private RAG.

Building RAG with Open-Source and Custom AI Models

Sherlock Xu — Mon, 06 May 2024 08:19:06 GMT

Retrieval-Augmented Generation (RAG) is a widely used application pattern for Large Language Models (LLMs). It uses information retrieval systems to give LLMs extra context, which aids in answering user queries not covered in the LLM’s training data and helps to prevent hallucinations. In this blog post, we draw from our experience working with BentoML customers to discuss:

Common challenges in making a RAG system ready for production
How to use open-source or custom fine-tuned models to enhance RAG performance
How to build scalable AI systems comprising multiple models and components

By the end of this post, you’ll learn the basics of how open-source and custom AI/ML models can be applied in building and improving RAG applications.

Note: This blog post is based on this video, with additional details.

Simple RAG system

A simple RAG system consists of 5 stages:

Chunking: RAG begins with turning your structured or unstructured dataset into text documents, and breaking down text into small pieces (chunks).
Embed documents: A text embedding model steps in, turning each chunk into vectors representing their semantic meaning.
VectorDB: These embeddings are then stored in a vector database, serving as the foundation for data retrieval.
Retrieval: Upon receiving a user query, the vector database helps retrieve chunks relevant to the user’s request.
Response Generation: With context, an LLM synthesizes these pieces to generate a coherent and informative response.

Implementing a simple RAG system with a text embedding model and an LLM might initially only need a few lines of Python code. However, dealing with real-world datasets and improving performance for the system require more than that.

Challenges in production RAG

Building a RAG for production is no easy feat. Here are some of the common challenges:

Retrieval performance

Recall: Not all chunks that are relevant to the user query are retrieved.
Precision: Not all chunks retrieved are relevant to the user query.
Data ingestion: Complex documents, semi-structured and unstructured data.

Response synthesis

Safeguarding: Is the user query toxic or offensive and how to handle it.
Tool use: Use tools such as browsers or search engines to assist the response generation.
Context accuracy: Retrieved chunks lacking necessary context or containing misaligned context.

Response evaluation

Synthetic dataset for evaluation: LLMs can be used to create evaluation datasets for measuring the RAG system’s responses.
LLMs as evaluators: LLMs also serve as evaluators themselves.

Improving RAG pipeline with custom AI models

To build a robust RAG system, you need to take into account a set of building blocks or baseline components. These elements or decisions form the foundation upon which your RAG system’s performance is built.

Text embedding model

Common models like text-embedding-ada-002, while popular, may not be the best performers across all languages and domains. Their one-size-fits-all approach often falls short when you have nuanced requirements for specialized fields.

Source: Hugging Face Massive Text Embedding Benchmark (MTEB) Leaderboard

On this note, fine-tuning an embedding model on a domain-specific dataset often enhances the retrieval accuracy. This is due to the improvement of embedding representations for the specific context during the fine-tuning process. For instance, while a general embedding model might associate the word “Bento” closely with “Food” or “Japan”, a model fine-tuned for AI inference would more likely connect it with terms like “Model Serving”, “Open Source Framework”, and “AI Inference Platform”.

Large language model

While GPT-4 leads the pack in performance, not all applications require such firepower. Sometimes, a more modest and well-optimized model can deliver the speed and cost-effectiveness needed, especially when provided with the right context. In particular, consider the following questions when choosing the LLM for your RAG:

Security and privacy: What level of control do you need over your data?
Latency requirement: What is your TTFT (Time to first token) and TPOT (Time per output token) requirement? Is it serving real-time chat applications or offline data processing jobs?
Reliability: For mission-critical applications, dedicated deployment that you control, often provides more reliable response time and generation quality.
Capabilities: What tasks do you need your LLM to perform? For simple tasks, can it be replaced by a smaller specialized models?
Domain knowledge: Does an LLM trained on general web content understand your specific domain knowledge?

These questions are important no matter you are self-hosting open-source models or using commercial model endpoints. The right model should align with your data policies, budget plan, and the specific demands of your RAG application.

Context-aware chunking

Most simple RAG systems rely on fixed-size chunking, dividing documents into equal segments with some overlap to ensure continuity. This method, while straightforward, can sometimes strip away the rich context embedded in the data.

By contrast, context-aware chunking breaks down text data into more meaningful pieces, considering the actual content and its structure. Instead of splitting text at fixed intervals (like word count), it identifies logical breaks in the text using NLP techniques. These breaks can occur at the end of sentences, paragraphs, or when topics shift. This ensures each chunk captures a complete thought or idea, and makes it possible to add additional metadata to each chunk, for implementing metadata filtering or Small-to-Big retrieval.

As your RAG system can understand the overall flow and ideas within a document with context-aware chunking, it is capable of creating chunks that capture not just isolated sentences but also the broader context they belong to.

Parsing complex documents

The real world throws complex documents at us — product reviews, emails, recipes, and websites that not only contain textual content but are also enriched with structure, images, charts, and tables.

Traditional Optical Character Recognition (OCR) tools such as EasyOCR and Tesseract are proficient in transcribing text but often fall short when it comes to understanding the layout and contextual significance of the elements within a document.

For those grappling with the complexities of modern documents, consider integrating the following models and tools into your RAG systems:

Layout analysis: LayoutLM (and v2, v3) have been pivotal in advancing document layout analysis. LayoutLMv3, in particular, integrates text and layout with image processing without relying on conventional CNNs, streamlining the architecture and leveraging masked language and image modeling, making it highly effective in understanding both text-centric and image-centric tasks.
Table detection and extraction: Table Transformer (TATR) is specifically designed for detecting, extracting, and recognizing the structure of tables within documents. It operates similarly to object detection models, using a DETR-like architecture to achieve high precision in both table detection and functional analysis of table contents.
Document question-answering systems: Building a Document Visual Question Answering (DocVQA) system often requires multiple models, such as models for layout analysis, OCR, entity extraction, and finally, models trained to answer queries based on the document’s content and structure. Tools like Donut and the latest versions of LayoutLMv3 can be helpful in developing robust DocVQA systems.
Fine-tuning: Existing open-source models are great places to start but with additional fine-tuning on your specific documents, handling its unique content or structure, can often lead to greater performance.

Metadata filtering

Incorporating these models into your RAG systems, especially when combined with NLP techniques, allows for the extraction of rich metadata from documents. This includes elements like the sentiment expressed in text, the structure or summarization of a document, or the data encapsulated in a table. Most modern vector databases supports storing metadata alongside text embeddings, as well as using metadata filtering during retrieval, which can significantly enhance the retrieval accuracy.

Reranking models

While embedding models are a powerful tool for initial retrieval in RAG systems, they can sometimes return a large number of documents that might be generally relevant, but not necessarily the most precise answers to a user’s query. This is where reranking models come into play.

Image source: Rerankers and Two-Stage Retrieval

Reranking models introduce a two-step retrieval process that significantly improves precision:

Initial retrieval: An embedding model acts as a first filter, scanning the entire database and identifying a pool of potentially relevant documents. This initial retrieval is fast and efficient.
Reranking: The reranking model then takes over, examining the shortlisted documents from the first stage. It analyzes each document’s content in more detail, considering its specific relevance to the user’s query. Based on this analysis, the reranking model reorders the documents, placing the most relevant ones at the top (sometimes at both ends of the context window for maximum relevance).

While reranking provides superior precision, it adds an extra step to the retrieval process. Many may think this can increase latency. However, reranking also means you don’t need to send all retrieved chunks to the LLM, leading to faster generation time.

For more information, see this article Rerankers and Two-Stage Retrieval.

Cross-modal retrieval

While traditional RAG systems primarily focus on text data, research like ImageBind: One Embedding Space To Bind Them All is opening doors to a more versatile approach: Cross-modal retrieval.

Image source: ImageBind: One Embedding Space To Bind Them All

Cross-modal retrieval transcends traditional text-based limitations, supporting interplay between different types of data, such as audio and visual content. For example, when a RAG system incorporates models like BLIP for visual reasoning, it’s able to understand the context within images, improving the textual data pipeline with visual insights.

While still in its early stages, multi-modal retrieval holds great potential what RAG systems can achieve.

Recap: AI models in RAG systems

As we improve our RAG system for production, the complexity increases accordingly. Ultimately, we may find ourselves orchestrating a group of AI models, each playing its part in the workflow of data processing and response generation.

As we address these complexities, we also need to pay attention to the infrastructure for deploying AI models. In the next part of this blog post, we’ll explore these infrastructure challenges and introduce how BentoML is contributing to this space.

Scaling RAG services with multiple custom AI models

Serving embedding models

One of the most frequent challenges is efficiently serving the embedding model. BentoML can help improve its performance in the following ways:

Asynchronous non-blocking invocation: BentoML allows you to convert synchronous inference methods of a model to asynchronous calls, providing non-blocking implementation and improving performance in IO-bound operations.
Shared model replica across multiple API workers: BentoML supports running shared model replicas across multiple API workers, each assigned with a specific GPU. This can maximize parallel processing, increase throughput, and reduce overall inference time.
Adaptive batching: Within a BentoML Service, there is a dispatcher that manages how batches should be optimized by dynamically adjusting batch sizes and wait time to suit the current load. This mechanism is called adaptive batching in BentoML. In the context of text embedding models, we often see performance improvements up to 3x in latency and 2x in throughput comparing to non-batching implementations.

For more information, see this BentoML example project to deploy an embedding model.

Self-hosting LLMs

Many developers may start with pulling a model from Hugging Face and run it with frameworks like PyTorch or Transformers. This is fine for development and exploration, but performs poorly when serving high throughput workloads in production.

There are a variety of open-source tools like vLLM, OpenLLM, mlc-llm, and TensorRT-LLM available for self-hosting LLMs. Consider the following when choosing such tools:

Inference best practices: Does the tool support optimized LLM inference? Techniques like continuous batching, Paged Attention, Flash Attention, and automatic prefix caching need to be implemented for efficient performance.
Customizations: LLM behavior needs customized control like advanced stop conditioning (when a model should cease generating further content), specific output formats (ensuring the results adhere to a specific structure or standard), or input validation (using a classification model to detect). If you need such customization, consider using BentoML + vLLM.

In addition to the LLM inference server, the infrastructure required for scaling LLM workloads also comes with unique challenges. For example:

GPU Scaling: Unlike traditional workloads, GPU utilization metrics can be deceptive for LLMs. Even if the metrics suggest full capacity, there might still be room for more requests and more throughput. This is why solutions like BentoCloud offers concurrency-based autoscaling. Such an approach learns the semantic meanings of different requests, using dynamic batching and wise resource management strategies to scale effectively.

Cold start and fast scaling with large container image and model files: Downloading large images and models from remote storage and loading models into GPU memory is a time-consuming process, breaking most existing cloud infrastructure’s assumptions about the workload. Specialized infrastructure, like BentoCloud, helps accelerate this process via lazy image pulling, streaming model loading and in-cluster caching.

For details, refer to Scaling AI Model Deployment.

Model composition

Model composition is a strategy that combines multiple models to solve a complex problem that cannot be easily addressed by a single model. Before we talk about how BentoML can help you compose multiple models for RAG, let’s take a look at other two typical scenarios used in RAG systems.

Document Processing Pipeline

A document processing pipeline consists of multiple AI/ML models, each specializing in a stage of the data conversion process. In addition to OCR that extract text from images, it can extend to layout analysis, table extraction and image understanding.

The models used in this process might have different resource requirements, some requiring GPUs for model inference and others, more lightweight, running efficiently on CPUs. Such a setup naturally fits into a distributed system of micro-services, each service serving a different AI model or function. This architectural choice can drastically improve resource utilization and reduce cost.

BentoML facilitates this process by allowing users to easily implement a distributed inference graph, where each stage can be a separate BentoML Service wrapping the capability of the corresponding model. In production, they can be deployed and scaled separately (more details can be found below).

Using small language models

In some cases, “small” models can be an ideal choice for their efficiency, particularly for simpler, more direct tasks like summarization, classification, and translation. Here’s how and why they fit into a multi-model system:

Rapid response: For example, when a user query is submitted, a small model like BERT can swiftly determine if the request is inappropriate or toxic. If so, it can reject the query directly, conserving resources by avoiding the follow-up steps.
Routing: These nimble models can act as request routers. Fine-tuned BERT-like model needs no more than 10 milliseconds to identify which tools or data sources are needed for a given request. By contrast, an LLM may need a few seconds or more to complete.

Uniting separate RAG components

Running a RAG system with a large number of custom AI models on a single GPU is highly inefficient, if not impossible. Although each model could be deployed and hosted separately, this approach makes it challenging to iterate and enhance the system as a whole.

BentoML is optimized for building such serving systems, streamlining both the workflow from development to deployment and the serving architecture itself. Developers can encapsulate the entire RAG logic within a single Python application, referencing each component (like OCR, reranker, text embedding, and large language models) as a straightforward Python function call. The framework eliminates the need to build and manage distributed services, optimizing resource efficiency and scalability for each component. BentoML also manages the entire pipeline, packaging the necessary code and models into a single versioned unit (a “Bento”). This consistency across different application lifecycle stages drastically simplifies the deployment and evaluation process.

Note: In the next series of blog posts, we will dive into more details on how developers can leverage BentoML for model composition and serving RAG systems at scale. Stay tuned!

To summarize, here is how BentoML can help you build RAG systems:

Define the entire RAG components within one Python file
Compile them to one versioned unit for evaluation and deployment
Adopt baked-in model serving and inference best practices like adaptive batching
Assign each model inference components to different GPU shapes and scale them independently for maximum resource efficiency
Monitor production performance in BentoCloud, which provides comprehensive observability like tracing and logging

For more information, refer to our RAG tutorials.

Conclusion

Modern RAG systems often requires a large number of open-source and custom fine-tuned AI models for achieving the optimal performance. As we improve RAG systems with all these additional AI models, the complexity grows quickly, which not only slows down your development iterations, but also comes with a high cost in deploying and maintaining such a system in production.

BentoML is designed for building and serving compound AI systems with multiple models and components easily. It comes in handy in the orchestration of complex RAG systems, ensuring seamless scaling in the cloud.

A Guide to Open-Source Image Generation Models

Sherlock Xu — Thu, 28 Mar 2024 04:53:11 GMT

In my previous article, I talked about the world of Large Language Models (LLMs), introducing some of the most advanced open-source text generation models over the past year. However, LLMs are only one of the important players in today’s rapidly evolving AI world. Equally transformative and innovative are the models designed for visual creation, like text-to-image, image-to-image, and image-to-video models. They have opened up new opportunities for creative expression and visual communication, enabling us to generate beautiful visuals, change backgrounds, inpaint missing parts, replicate compositions, and even turn simple scribbles into professional images.

One of the most mentioned names in this field is Stable Diffusion, which comes with a series of open-source visual generation models, like Stable Diffusion 1.4, 2.0 and XL, mostly developed by Stability AI. However, in the expansive universe of AI-driven image generation, they represent merely a part of it and things can get really complicated as you begin to choose the right model for serving and deployment. A quick search on Hugging Face gives over 18,000 text-to-image models alone.

In this blog post, we will provide a featured list of open-source models that stand out for their ability in generating creative visuals. Just like the previous blog post, we will also answer frequently asked questions to help you navigate this exciting yet complex domain, providing insights into using these models in production.

Stable Diffusion

Stable Diffusion (SD) has quickly become a household name in generative AI since its launch in 2022. It is capable of generating photorealistic images from both text and image prompts. You might often hear people use the term “diffusion models” together with Stable Diffusion, which is the base AI technology that powers Stable Diffusion. Simply put, diffusion models generate images by starting with a pattern of random noise and gradually shaping it into a coherent image through a process that reversibly adds and removes noise. This process is computationally intensive but has been optimized in Stable Diffusion with latent space technology.

Latent space is like a compact, simplified map of all the possible images that the model can create. Instead of dealing with every tiny detail of an image (which takes a lot of computing power), the model uses this map to find and create new images more efficiently. It’s a bit like sketching out the main ideas of a picture before filling in all the details.

In addition to static images, Stable Diffusion can also produce videos and animations, making it a comprehensive tool for a variety of creative tasks.

Why should you use Stable Diffusion:

Multiple variants: Stable Diffusion comes with a variety of popular base models, such as Stable Diffusion 1.4, 1.5, 2.0, and 2.1, Stable Diffusion XL, Stable Diffusion XL Turbo, and Stable Video Diffusion. According to this evaluation graph, the SDXL base model performs significantly better than the previous variants. Nevertheless, I think it is not 100% easy to say which model generates better images than others, as the results can impacted by various factors, like prompt, inference steps and LoRA weights. Some models even have more LoRAs available, which is an important factor when choosing the right model. For beginners, I recommend you start with SD 1.5 or SDXL 1.0. They’re user-friendly and rich in features, perfect for exploring without getting into the technical details.
Customization and fine-tuning: Stable Diffusion base models can be fine-tuned with as little as five images for generating visuals in specific styles or of particular subjects, enhancing the relevance and uniqueness of generated images. One of my favorites is SDXL-Lightning, built upon Stable Diffusion XL; it is known for its lightning-fast capability to generate high-quality images in just a few steps (1, 2, 4, and 8 steps).
Controllable: Stable Diffusion provides you with extensive control over the image generation process. For example, you can adjust the number of steps the model takes during the diffusion process, set the image size, specify the seed for reproducibility, and tweak the guidance scale to influence the adherence to the input prompt.
Future potential: There’s vast potential for integration with animation and video AI systems, promising even more expansive creative possibilities.

Points to be cautious about:

Distortion: Stable Diffusion can sometimes inaccurately render complex details, particularly faces, hands, and legs. These mistakes might not be immediately noticeable. To improve the generated images, you can try to add a negative prompt or use specific fine-tuned versions.
Text generation: Stable Diffusion has difficulties in understanding and creating text within images, which is not uncommon for image generation models.
Legal concerns: Using AI-generated art could pose long-term legal challenges, especially if the training data wasn’t thoroughly vetted for copyright issues. This isn’t specific to Stable Diffusion and I will talk more about it in an FAQ later.
Similarity risks: Given the data Stable Diffusion was trained on, there’s a possibility of generating similar or duplicate results when artists and creators use similar keywords or prompts.

Note: Stable Diffusion 3 was just released last month but it is only for early preview.

DeepFloyd IF

DeepFloyd IF is a text-to-image generation model developed by Stability AI and the DeepFloyd research lab. It stands out for its ability to produce images with remarkable photorealism and nuanced language understanding.

DeepFloyd IF’s architecture is particularly noteworthy for its approach to diffusion in pixel space. Specifically, it contains a text encoder and three cascaded pixel diffusion modules. Each module plays a unique role in the process: Stage 1 is responsible for the creation of a base 64x64 px image, which is then progressively upscaled to 1024x1024 px across Stage 2 and Stage 3. This distinguishes itself from latent diffusion models like Stable Diffusion. This pixel-level processing allows DeepFloyd IF to directly manipulate images for generating or enhancing visuals without the need for translating into and from a compressed latent representation.

Why should you use DeepFloyd IF:

Text understanding: DeepFloyd IF integrates a large language model T5-XXL-1.1 for deep text prompt understanding, enabling it to create images that closely match input descriptions.
Text rendering: DeepFloyd IF showcases tangible progress in rendering text with better coherence than previous models in the Stable Diffusion series and other text-to-image models. While it has its flaws, DeepFloyd IF marks a significant step forward in the evolution of image generation models in text rendering.
High photorealism: DeepFloyd IF achieves impressive zero-shot FID scores (6.66), which means it is able to create high-quality, photorealistic images. The FID score is used to evaluate the quality of images generated by text-to-image models, and lower scores typically mean better quality.

Points to be cautious about:

Content sensitivity: DeepFloyd IF was trained on a subset of the LAION-5B dataset, known for its wide-ranging content, including adult, violent, and sexual themes. Efforts have been made to mitigate the model’s exposure to such content, but you should remain cautious and review output if necessary.
Bias and cultural representation: The model’s training on LAION-2B(en), a dataset with English-centric images and text, introduces a bias towards white and Western cultures, often treating them as defaults. This bias affects the diversity and cultural representation in the model’s output.
Hardware requirements: You need a GPU with at least 24GB vRAM for running all its variants, making it resource-intensive.

ControlNet

ControlNet can be used to enhance the capabilities of diffusion models like Stable Diffusion, allowing for more precise control over image generation. It operates by dividing neural network blocks into “locked” and “trainable” copies, where the trainable copy learns specific conditions you set, and the locked one preserves the integrity of the original model. This structure allows you to train the model with small datasets without compromising its performance, making it ideal for personal or small-scale device use.

Why should you use ControlNet:

Enhanced control over image generation: ControlNet introduces a higher degree of control by allowing additional conditions, such as edge detection or depth maps, to steer the final image output. This makes ControlNet a good choice when you want to clone image compositions, dictate specific human poses, or produce similar images.
Efficient and flexible: The model architecture ensures minimal additional GPU memory requirements, making it suitable even for devices with limited resources.

Points to be cautious about:

Dependency on Stable Diffusion: ControlNet relies on Stable Diffusion to function. This dependency could affect its usage in environments where Stable Diffusion might not be the preferred choice for image generation. In addition, the limitations of Stable Diffusion mentioned above could also impact the generated images, like distortion and legal concerns.

Animagine XL

Text-to-image AI models hold significant potential for the animation industry. Artists can quickly generate concept art by providing simple descriptions, allowing for rapid exploration of visual styles and themes. In this area, Animagine XL is one of the important players leading the innovation. It represents a series of open-source anime text-to-image generation models. Built upon Stable Diffusion XL, its latest release Animagine XL 3.1 adopts tag ordering for prompts, which means the sequence of prompts will significantly impact the output. To ensure the generated results are aligned with your intention, you may need to follow certain template as the model was trained this way.

Why should you use Animagine XL:

Tailored anime generation: Designed specifically for anime-style image creation, it offers superior quality in this genre. If you are looking for a model to create this type of images, Animagine XL can be the go-to choice.
Expanded knowledge base: Animagine XL integrates a large number of anime characters, enhancing the model’s familiarity across a broader range of anime styles and themes.

Points to be cautious about:

Niche focus: Animagine XL is primarily designed for anime-style images, which might limit its application for broader image generation needs.
Learning curve: Mastering tag ordering and prompt interpretation for optimal results may require familiarity with anime genres and styles.

Stable Video Diffusion

Stable Video Diffusion (SVD) is a video generation model from Stability AI, aiming to provide high-quality videos from still images. As mentioned above, this model is a part of Stability AI’s suite of AI tools and represents their first foray into open video model development.

Stable Video Diffusion is capable of generating 14 and 25 frames at customizable frame rates between 3 and 30 frames per second. According to this evaluation graph, SVD gained more human voters in terms of video quality over GEN-2 and PikaLabs.

In fact, Stability AI is still working on it to improve both its safety and quality. Stability AI emphasized that “this model is not intended for real-world or commercial applications at this stage and it is exclusively for research”. That said, it is one of the few open-source video generation models available in this industry. If you just want to play around with it, pay attention to the following:

Short video length: The model can only generate short video sequences, with a maximum length of around 4 seconds, limiting the scope for longer narrative or detailed exploration.
Motion limitations: Some generated videos may lack dynamic motion, resulting in static scenes or very slow camera movements that might not meet the expectations in certain use cases.
Distortion: Stable Video Diffusion may not accurately generate faces and people, often resulting in less detailed or incorrect representations, posing challenges for content focused on human subjects.

Now let’s answer some of the frequently asked questions for open-source image generation models. Questions like “Why should I choose open-source models over commercial ones?” and “What should I consider when deploying models in production?” are already covered in my previous blog post, so I do not list them here.

What is LoRA? What can I do with it and Stable Diffusion?

LoRA, or Low-Rank Adaptation, is an advanced technique designed for fine-tuning machine learning models, including generative models like Stable Diffusion. It works by using a small number of trainable parameters to fine-tune these models on specific tasks or to adapt them to new data. As it significantly reduces the number of parameters that need to be trained, it does not require extensive computational resources.

With LoRA, you can enhance Stable Diffusion models by customizing generated content with specific themes and styles. If you don’t want to create LoRA weights yourself, check out the LoRA resources on Civitai.

How can I create high-quality images?

Creating high-quality images with image generation models involves a blend of creativity, precision, and technical understanding. Some key strategies to improve your outcomes:

Be detailed and specific: Use detailed and specific descriptions in your prompt. The more specific you are about the scene, subject, mood, lighting, and style, the more accurately the model can generate your intended image. For example, instead of saying “a cat”, input something like “a fluffy calico cat lounging in the afternoon sun by a window with sheer curtains”.
Layered prompts: Break down complex scenes into layered prompts. First, describe the setting, then the main subjects, followed by details like emotions or specific actions. This will help you guide the model understand your prompt.
Reference artists or works: Including the names of artists or specific art pieces can help steer the style of the generated image. However, be mindful of copyright considerations and use this approach for inspiration rather than replication.

Should I worry about copyright issues when using image generation models?

The short answer is YES.

Copyright concerns are a significant aspect to consider when using image generation models, including not just open-source models but commercial ones. There have been lawsuits against companies behind popular image generation models like this one.

Many models are trained on vast datasets that include copyrighted images. This raises questions about the legality of using these images as part of the training process.

Another thing is that determining the copyright ownership of AI-generated images can be complex. If you’re planning to use these images commercially, it’s important to consider who holds the copyright — the user who inputs the prompt, the creators of the AI model, or neither.

So, what can you do?

At this stage, the best suggestion I can give to someone using these models and the images they create is to stay informed. The legal landscape around AI-generated images is still evolving. Keep abreast of ongoing legal discussions and rulings related to AI and copyright law. Understanding your rights and the legal status of AI-generated images is crucial for using these tools ethically and legally.

What is the difference between deploying LLMs and image generation models in production?

Deploying LLMs and image generation models in production requires similar considerations on factors like scalability and observability, but they also have their unique challenges and requirements.

Resource requirements: Image generation models, especially high-resolution video or image models, typically demand more computational power and memory than LLMs due to the need to process and generate complex visual data. LLMs, while also resource-intensive, often have more predictable computational and memory usage patterns.
Latency and throughput: Image generation tasks can have higher latency due to the processing involved in creating detailed visuals. Optimizing latency and throughput might require different strategies for image models compared to LLMs, such as adjusting model size or using specialized hardware accelerators (GPUs).
Data sensitivity and privacy: Deploying both types of models in production needs wise data handling and privacy measures. However, image generation models may require additional considerations due to the potential for generating images that include copyrighted elements.
User experience: For image generation models, I will recommend you provide users with guidance on creating effective prompts, which can enhance the quality of generated images. You may need to design the user interface by considering the model’s response time and output characteristics.

Final thoughts

Just like LLMs, choosing the right model for image generation requires us to understand their strengths and limitations. Each model brings its unique capabilities to the table, supporting different real-world use cases. Currently, I believe the biggest challenge for image generation models is ethical and copyright concerns. As we embrace the potential of them to augment our creative process, it’s equally important to use these tools responsibly and respect copyright laws, privacy rights, and ethical guidelines.

Deploying A Large Language Model with BentoML and vLLM

Sherlock Xu — Fri, 22 Mar 2024 03:27:30 GMT

Large language models (LLMs) promise to redefine our interaction with technology across various industries. Yet, the leap from the promise of LLMs to their practical application presents a significant hurdle. The challenge lies not just in developing and training them, but in serving and deploying them efficiently and cost-effectively.

In previous blog posts, we delved into using BentoCloud for deploying ML servers, showcasing its ability to offer serverless infrastructure tailored for optimal cost efficiency. Upon this foundation, we can integrate a new tool to enhance our BentoML Service for better LLM inference and serving: vLLM.

In this blog post, let’s see how we can create an LLM server built with vLLM and BentoML, and deploy it in production with BentoCloud. By the end of this tutorial, you will have an interactive AI assistant as below:

What is vLLM?

vLLM is a fast and easy-to-use open-source library for LLM inference and serving. Developed by the minds at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo, vLLM is equipped with an arsenal of features. To name a few:

Dramatic performance boost: vLLM leverages PagedAttention to achieve up to 24x higher throughput than Hugging Face Transformers, making LLM serving faster and more efficient.
Ease of use: Designed for straightforward integration, vLLM simplifies the deployment of LLMs with an easy-to-use interface.
Cost-effective: Optimizes resource use, significantly lowering the computational cost and making LLM deployment accessible even for teams with limited compute resources.

See this article vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention to learn more about vLLM.

Setting up the environment

As always, I suggest you set up a virtual environment for your project to keep your dependencies organized:

python -m venv vllm-bentoml-env
source vllm-bentoml-env/bin/activate

Then, clone the project’s repo and install all the required dependencies.

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/mistral-7b-instruct
pip install -r requirements.txt && pip install -f -U "pydantic>=2.0"

The stage is now set. Let’s get started!

Defining a BentoML Service

1. Create a BentoML Service file service.py (already available in the repo you cloned) and open it in your preferred text editor. We'll start by importing necessary modules:

import uuid
from typing import AsyncGenerator
import bentoml
from annotated_types import Ge, Le
from typing_extensions import Annotated
from bentovllm_openai.utils import openai_endpoints

These imports are for asynchronous operations, type checking, and the integration of BentoML and vLLM-specific functionalities. You will know more about them in the following sections.

2. Next, specify the model to use and set some ground rules for it. For this project, I will use mistralai/Mistral-7B-Instruct-v0.2, which is reported to have outperformed the Llama 2 13B model in all the benchmark tests. You can choose any other model supported by vLLM.

Also, set the maximum token limit for the model’s responses and use a template for our prompts. This template is like a script for how we want our model to behave — polite, respectful, and safe:

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"

MAX_TOKENS = 1024
PROMPT_TEMPLATE = """[INST]
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

{user_prompt} [/INST] """

3. Now we can begin to design the BentoML Service. Starting from BentoML 1.2, we use the @bentoml.service decorator to mark a Python class as a BentoML Service. Additional configurations like timeout can be set to customize its runtime behavior. The resources field specifies the GPU requirements as we will deploy this Service on BentoCloud later; cloud instances will be provisioned based on it.

In addition, use the @openai_endpoints decorator from bentovllm_openai.utils (available here) to set up OpenAI-compatible endpoints. This is like giving the Service a universal adapter, allowing it to interact with various clients as if it were an OpenAI service itself.

@openai_endpoints(served_model=MODEL_ID)
@bentoml.service(
    name="mistral-7b-instruct-service",
    traffic={
        "timeout": 300,
    },
    resources={
        "gpu": 1,
        "gpu_type": "nvidia-l4",
    },
)
class VLLM:

~~4. Within the class, set an LLM engine by specifying the model and how many tokens it should generate. Read the vLLM documentation to learn more about the modules imported here.~~

class VLLM:
    def __init__(self) -> None:
        from vllm import AsyncEngineArgs, AsyncLLMEngine
        ENGINE_ARGS = AsyncEngineArgs(
            model=MODEL_ID,
            max_model_len=MAX_TOKENS
        )
        
        self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS)

~~5. To interact with this Service, define an API method using @bentoml.api. It serves as the primary interface for processing input prompts and streaming back generated text.~~

    @bentoml.api
    async def generate(
        self,
        # Accept a prompt with a default value; users can override this when calling the API
        prompt: str = "Explain superconductors like I'm five years old",
        # Enforce the generated response to be within a specified range using type annotations
        max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS,
    ) -> AsyncGenerator[str, None]:
        from vllm import SamplingParams

        # Initialize the parameters for sampling responses from the LLM (maximum token in this case)
        SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens)
        # Format the user's prompt with the predefined prompt template
        prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
        # Send the formatted prompt to the LLM engine asynchronously
        stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)
 
        # Initialize a cursor to track the portion of the text already returned to the user
        cursor = 0
        async for request_output in stream:
            # Extract text from the first output
            text = request_output.outputs[0].text
            yield text[cursor:]
            cursor = len(text)

~~That’s all the code! To run this project with bentoml serve, you need a NVIDIA GPU with at least 16G VRAM.~~

bentoml serve .

~~The server will be active at http://localhost:3000. You can communicate with it by using the curl command:~~

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Explain superconductors like I'\''m five years old",
  "max_tokens": 1024
}'

Deploying the LLM to BentoCloud

Deploying LLMs in production often requires significant computational resources, particularly GPUs, which may not be readily available on local machines. Therefore, you can use BentoCloud, a platform designed to simplify the deployment, management, and scaling of machine learning models, including those as resource-intensive as LLMs.

~~Before you can deploy this LLM to BentoCloud, you’ll need to:~~

Sign up: If you haven’t already, create an account on BentoCloud for free. Navigate to the BentoCloud website and follow the sign-up process.
Log in: Once your account is set up, log in to BentoCloud.

~~With your BentoCloud account ready, navigate to your project’s directory where bentofile.yaml is stored (it is already available in the repo you cloned), then run:~~

bentoml deploy .

~~The deployment may take some time. When it is complete, you can interact with the LLM server on the BentoCloud console.~~

More on BentoML and vLLM

~~To learn more about BentoML and vLLM, check out the following resources:~~

~~LLM Inference Handbook~~
~~[Doc] vLLM documentation~~
~~[Doc] vLLM inference~~
~~[Blog] Introducing BentoML 1.2~~
~~[Blog] Deploying A Text-To-Speech Application with BentoML~~
~~[Blog] Deploying An Image Captioning Server With BentoML~~
Try BentoCloud and get $10 in free credits on signup! Experience a serverless platform tailored to simplify the building and management of your AI applications, ensuring both ease of use and scalability.

~~Deploying A Large Language Model with BentoML and vLLM was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.~~

Navigating the World of Large Language Models

Sherlock Xu — Fri, 22 Mar 2024 03:09:28 GMT

Over the past year and a half, the AI world has been abuzz with the rapid release of large language models (LLMs), each boasting advancements that push the boundaries of what’s possible with generative AI. The pace at which new models are emerging is breathtaking. Recently, Meta AI introduced Llama 3.1, with its 405B variant featuring better flexibility, control, and cutting-edge capabilities that can rival the best closed-source models. The very next day, Mistral launched Mistral Large 2, which competes on par with leading models like GPT-4o, Claude 3 Opus, and Llama 3 405B.
These models, powered by an ever-increasing number of parameters and trained on colossal datasets, have improved our efficiency to generate text and write complex code. However, the sheer number of options available can feel both exciting and daunting. Making informed decisions about which to use — considering output quality, speed, and cost — becomes a problem.
The answer lies not just in the specifications sheets or benchmark scores but in a holistic understanding of what each model brings to the table. In this blog post, we curate a select list of open-source LLMs making waves over the past year. At the same time, we look to provide answers to some of the frequently asked questions.
Llama 3.1
Meta AI continues to push the boundaries of open-source AI with the release of Llama 3.1, available in 8, 70, and 405 billion parameters. It can be used across a broad spectrum of tasks, including chatbots and various natural language generation applications. Llama 3.1 is the latest addition to the Llama family, which boasts 300 million total downloads of all Llama versions to date.
Why should you use Llama 3.1:
Performance: Based on Meta AI’s benchmarks, Llama 3.1 8B and 70B demonstrate superior comprehension, reasoning, and general intelligence capabilities compared to other open-source models like Gemma 2 9B IT and Mistral 7B & 8*22B Instruct. Its largest version, 405B, is competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet.
Fine-tuning: With three different sizes, Llama 3.1 is an ideal foundation for a wide range of specialized applications. Users can fine-tune these models to meet the unique needs of specific tasks or industries. This also extends to previous versions of the Llama model family like Llama 2 and Llama 3 (over 45,000 search results for “Llama” in Hugging Face Model Hub). These fine-tuned models not only save developers significant time and resources but also highlight Llama 3.1’s capacity for customization and improvement.
Context window: Llama 3.1 significantly improves upon its predecessors with a large context window of 128k tokens. This enhancement makes it useful for enterprise use cases such as handling long chatbot conversations and processing large documents.
Safety: Meta has implemented extensive safety measures for Llama 3.1, including Red Teaming exercises to identify potential risks. According to Meta’s research paper, Llama 3 performs generally better at refusing inappropriate requests with lower false refusal rates and violation rates. However, they acknowledge that the numbers in the safety tests are not reproducible externally since safety benchmarks are internal to Meta, so they choose to anonymize the competitors in the test.
Challenge with Llama 3.1:
Resource requirements: Given the large size, the 405 billion model requires substantial computational resources to run. Even with 4-bit quantization, the model remains around 200GB and may need multiple A100 GPUs to run effectively, which can be prohibitive for smaller organizations or individuals.
As it was just released recently, more investigation is still needed to fully understand the potential limitations of Llama 3.1.
Click the following links to deploy Llama 3.1 with BentoML:
llama3.1–8b-instruct
llama3.1–70b-instruct-awq
llama3.1–405b-instruct-awq
Mixtral 8x7B
Mixtral 8x7B, released by Mistral AI in December 2023, uses a sparse mixture-of-experts architecture. Simply put, it uses many small networks, each specialized in different things. Only a few of these “experts” work on each task, making the process efficient without using the full model’s power every time and thus controlling cost and latency.
Licensed under the Apache 2.0 license for commercial use, Mixtral 8x7B demonstrates exceptional versatility across various text generation tasks, including code generation, and features a fine-tuned variant, Mixtral 8x7B Instruct, optimized for chat applications.
Why should you use Mixtral 8x7B:
State-of-the-art performance: Mixtral 8x7B outperforms leading models like Llama 2 70B and GPT-3.5 across many benchmarks.
Source: https://mistral.ai/news/mixtral-of-experts/
Long context window: Mixtral 8x7B’s 32k-token context window significantly enhances its ability to handle lengthy conversations and complex documents. This enables the model to handle a variety of tasks, from detailed content creation to sophisticated retrieval-augmented generation, making it highly versatile for both research and commercial applications.
Optimized for efficiency: Despite its large parameter count, it offers cost-effective inference, comparable to much smaller models.
Versatile language support: Mixtral 8x7B handles multiple languages (French, German, Spanish, Italian, and English), making it ideal for global applications.
Source: https://mistral.ai/news/mixtral-of-experts/
Challenges with Mixtral 8x7B:
Lack of built-in moderation mechanisms: Without native moderation, there may be a risk of generating inappropriate or harmful content, especially when the model is prompted with sensitive or controversial inputs. Businesses aiming to deploy this model in environments where content control and safety are important should be careful about this.
Hardware requirements: The entire parameter set requires substantial RAM for operation, which could limit its use on lower-end systems.
Quickly serve a Mixtral 8x7B server with OpenLLM or self-host it with BentoML.
Zephyr 7B
Zephyr 7B, built on the base of Mistral 7B, has been fine-tuned to achieve better alignment with human intent, outperforming its counterparts in specific tasks and benchmarks. At the time of its release, Zephyr-7B-β is the highest ranked 7B chat model on the MT-Bench and AlpacaEval benchmarks.
Zephyr 7B’s training involves refinement of its abilities through exposure to a vast array of language patterns and contexts. This process allows it to comprehend complex queries and generate coherent, contextually relevant text, making it a versatile tool for content creation, customer support, and more.
Why should you use Zephyr 7B:
Efficiency and performance: Despite its smaller size relative to giants like GPT-3.5 or Llama-2–70B, Zephyr 7B delivers comparable or superior performance, especially in tasks requiring a deep understanding of human intent.
Multilingual capabilities: Trained on a diverse dataset, Zephyr 7B supports text generation and understanding across multiple languages, including but not limited to English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, and Korean.
Task flexibility: Zephyr 7B excels in performing a broad spectrum of language-related tasks, from text generation and summarization to translation and sentiment analysis. This positions it as a highly adaptable tool across numerous applications.
Challenges with Zephyr 7B:
Intent alignment: While Zephyr 7B has made some progress in aligning with human intent, continuous evaluation and adjustment may be necessary to ensure its outputs meet specific user needs or ethical guidelines.
Adaptation for specialized tasks: Depending on the application, additional fine-tuning may be required to optimize Zephyr 7B’s performance for specialized tasks, like reasoning, math, and coding.
SOLAR 10.7B
SOLAR 10.7B is a large language model with 10.7 billion parameters, using an upscaling technique known as depth up-scaling (DUS). This simplifies the scaling process without complex training or inference adjustments.
SOLAR 10.7B undergoes two fine-tuning stages: instruction tuning and alignment tuning. Instruction tuning enhances its ability to follow instructions in a QA format. Alignment tuning further refines the model to align more closely with human preferences or strong AI outputs, utilizing both open-source datasets and a synthesized math-focused alignment dataset.
Why should you use SOLAR 10.7B:
Versatility: Fine-tuned variants like SOLAR 10.7B-Instruct offer enhanced instruction-following capabilities, making the model capable for a broad range of applications.
Superior NLP performance: SOLAR 10.7B demonstrates exceptional performance in NLP tasks, outperforming other pre-trained models like Llama 2 and Mistral 7B.
Fine-tuning: SOLAR 10.7B is an ideal model for fine-tuning with solid baseline capabilities.
Challenges with SOLAR 10.7B:
Resource requirements: The model might require substantial computational resources for training and fine-tuning.
Bias concerns: The model’s outputs may not always align with ethical or fair use principles.
Code Llama
Fine-tuned on Llama 2, Code Llama is an advanced LLM specifically fine-tuned for coding tasks. It’s engineered to understand and generate code across several popular programming languages, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash, making it an ideal tool for developers.
The model is available in four sizes (7B, 13B, 34B, and 70B parameters) to accommodate various use cases, from low-latency applications like real-time code completion with the 7B and 13B models to more comprehensive code assistance provided by the 34B and 70B models.
Why should you use Code Llama:
Large input contexts: Code Llama can handle inputs with up to 100,000 tokens, allowing for better understanding and manipulation of large codebases.
Diverse applications: It’s designed for a range of applications such as code generation, code completion, debugging, and even discussing code, catering to different needs within the software development lifecycle.
Performance: With models trained on extensive datasets (up to 1 trillion tokens for the 70B model), Code Llama can provide more accurate and contextually relevant code suggestions. The Code Llama — Instruct 70B model even scores 67.8 in HumanEval test, higher than GPT 4 (67.0).
Challenges with Code Llama:
Hardware requirements: Larger models (34B and 70B) may require significant computational resources for optimal performance, potentially limiting access for individuals or organizations with limited hardware.
Potential for misalignment: While it has been fine-tuned for improved safety and alignment with human intent, there’s always a risk of generating inappropriate or malicious code if not properly supervised.
Not for general natural language tasks: Optimized for coding tasks, Code Llama is not recommended for broader natural language processing applications. Note that only Code Llama Instruct is specifically fine-tuned to better respond to natural language prompts.
Why should I choose open-source models over commercial ones?
All the language models listed in this blog post are open-source, so I believe this is the very first question to answer. In fact, the choice between open-source and commercial models often depends on specific needs and considerations, but the former may be a better option in the following aspects:
High controllability: Open-source models offer a high degree of control, as users can access and refine-tune the model as needed. This allows for customization and adaptability to specific tasks or requirements that might not be possible with commercial models.
Data security: Open-source models can be run locally, or within a private cloud infrastructure, giving users more control over data security. With commercial models, there may be concerns about data privacy since the data often needs to be sent to the provider’s servers for processing.
Cost-effectiveness: Utilizing open-source models can be more cost-effective, particularly when considering the cost of API calls or tokens required for commercial offerings. Open-source models can be deployed without these recurring costs, though there may be investments needed for infrastructure and maintenance.
Community and collaboration: Open-source models benefit from the collective expertise of the community, leading to rapid improvements, bug fixes, and new features driven by collaborative development.
No vendor lock-in: Relying on open-source models eliminates dependence on a specific vendor’s roadmap, pricing changes, or service availability.
How do specialized LLMs compare to general-purpose models?
Specialized LLMs like Code Llama offer a focused performance boost in their areas of specialization. They are designed to excel at specific tasks, providing outputs that are more accurate, relevant, and useful for those particular applications.
In contrast, general-purpose models like Llama 2 are built to handle a wide range of tasks. While they may not match the task-specific accuracy of specialized models, their broad knowledge base and adaptability make them helpful tools for a variety of applications.
The choice between specialized and general-purpose LLMs depends on the specific requirements of the task. Specialized models are preferable for high-stakes or niche tasks where precision is more important, while general-purpose models offer better flexibility and broad utility.
What are the ethical considerations in deploying LLMs at scale?
The ethical deployment of LLMs requires a careful examination of issues such as bias, transparency, accountability, and the potential for misuse. Ensuring that LLMs do not perpetuate existing biases present in their training data is a significant challenge, requiring ongoing vigilance and refinement of training methodologies. Transparency about how LLMs make decisions and the data they are trained on is crucial for building trust and accountability, particularly in high-stakes applications.
What should I consider when deploying LLMs in production?
Deploying LLMs in production can be a nuanced process. Here are some strategies to consider:
Choose the right model size: Balancing the model size with your application’s latency and throughput requirements is essential. Smaller models can offer faster responses and reduced computational costs, while larger models may provide more accurate and nuanced outputs.
Infrastructure considerations: Ensure that your infrastructure can handle the computational load. Using cloud services with GPU support or optimizing models with quantization and pruning techniques can help manage resource demands. A serverless platform with autoscaling capabilities can be a good choice for teams without infrastructure expertise.
Plan for scalability: Your deployment strategy should allow for scaling up or down based on demand. Containerization with technologies like Docker and orchestration with Kubernetes can support scalable deployments.
Build robust logging and observability: Implementing comprehensive logging and observability tools will help in monitoring the system’s health and quickly diagnosing issues as they arise.
Use APIs for modularity: APIs can abstract the complexity of model hosting, scaling, and management. They can also facilitate integration with existing systems and allow for easier updates and maintenance.
Consider model serving frameworks: Frameworks like BentoML, TensorFlow Serving, TorchServe, or ONNX Runtime can simplify deployment, provide version control, and handle request batching for efficiency.
Final thoughts
As we navigate the expanding universe of large language models, it’s clear that their potential is only just beginning to be tapped. The rapid innovation in this field signifies a future where AI can contribute even more profoundly to our work and creative endeavors.
Moving forward, I believe it’s vital to continue promoting AI models in open-source communities, pushing for advances that benefit all and ensuring responsible usage of these powerful tools. As we do so, hopefully, we’ll find the right balance that maximizes the benefits of LLMs for society while mitigating their risks.
Navigating the World of Large Language Models was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deploying Stable Diffusion XL with Latent Consistency Model LoRAs on BentoCloud

Sherlock Xu — Thu, 29 Feb 2024 00:42:39 GMT

Latent Consistency Models (LCM) can be used to streamline the image generation process, particularly for models like Stable Diffusion (SD) and SDXL. In this Hugging Face blog post, the authors introduced a new way of integrating LCM LoRAs into SDXL, which allows the model to achieve high-quality inference in just 2 to 8 steps, a significant reduction from its original requirement. This adaptation, or LCM LoRA, represents a universal acceleration module for SD models, making the inference process faster and more accessible.
In this blog post, I will talk about how to wrap SDXL with LCM LoRAs into a BentoML Service and deploy it on BentoCloud. This allows you to better run and manage an image generation application in production.
Before you begin
The source code of this image generation application powered by SDXL with LCM LoRAs is stored in the BentoLCM repo. Clone it and install all the required packages.
git clone https://github.com/bentoml/BentoLCM.git
cd BentoLCM
pip install -r requirements.txt
Note that this project uses BentoML 1.2.
Defining a BentoML Service
The key of wrapping the SDXL model and LCM LoRAs with BentoML is to create a BentoML Service. By convention, it is defined in a service.py file.
First, import the necessary libraries. Add these lines at the top of the service.py file:
import bentoml
from PIL.Image import Image # For handling images generated by SDXL
Specify the models to use. As mentioned above, we are using an SDXL model and an LCM LoRA. In addition, set a sample prompt so that we can test the application later:
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
lcm_lora_id = "latent-consistency/lcm-lora-sdxl"

sample_prompt = "close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux"
Now, let’s define a BentoML Service called LatentConsistency. It loads the models and defines an API endpoint for generating images based on text prompts.
Begin by defining a Python class and its constructor. Starting from BentoML 1.2, we use the @bentoml.service decorator to mark a Python class as a BentoML Service. As the application will be deployed on BentoCloud later, set configurations like resources to specify the GPU to use on BentoCloud.
# Annotate the class as a BentoML Service with the decorator
@bentoml.service(
traffic={"timeout": 300},
workers=1,
resources={
"gpu": 1,
"gpu_type": "nvidia-l4",
},
)
class LatentConsistency:
def __init__(self) -> None:
from diffusers import DiffusionPipeline, LCMScheduler
import torch

# Load the text-to-image model and the LCM LoRA weights
self.lcm_txt2img = DiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
variant="fp16",
)
self.lcm_txt2img.load_lora_weights(lcm_lora_id)
# Change the scheduler to the LCMScheduler
self.lcm_txt2img.scheduler = LCMScheduler.from_config(self.lcm_txt2img.scheduler.config)
# Move the model to the GPU for faster inference
self.lcm_txt2img.to(device="cuda", dtype=torch.float16)
Next, define an API endpoint in your class that takes a text prompt and generates an image:
@bentoml.api
def txt2img(
self,
prompt: str = sample_prompt,
num_inference_steps: int = 4,
guidance_scale: float = 1.0,
) -> Image:
image = self.lcm_txt2img(
prompt=prompt,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
).images[0]
return image
This function defines an endpoint txt2img that accepts a prompt, number of inference steps, and a guidance scale. It uses Python type annotations to specify the types of parameters it expects and the type of value it returns. The values specified in the code are the defaults provided to users. The returned image is a Path object.
Note that this example uses 4 inference steps to generate images. This parameter impacts the quality and generation time of the resulting image. See this Hugging Face blog post to learn more.
That’s all! Here is the complete Service code in service.py for your reference (available in the repo cloned):
import bentoml
from PIL.Image import Image

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
lcm_lora_id = "latent-consistency/lcm-lora-sdxl"
sample_prompt = "close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux"
@bentoml.service(
traffic={"timeout": 300},
workers=1,
resources={
"gpu": 1,
"gpu_type": "nvidia-l4",
},
)
class LatentConsistency:
def __init__(self) -> None:
from diffusers import DiffusionPipeline, LCMScheduler
import torch
self.lcm_txt2img = DiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
variant="fp16",
)
self.lcm_txt2img.load_lora_weights(lcm_lora_id)
self.lcm_txt2img.scheduler = LCMScheduler.from_config(self.lcm_txt2img.scheduler.config)
self.lcm_txt2img.to(device="cuda", dtype=torch.float16)
@bentoml.api
def txt2img(
self,
prompt: str = sample_prompt,
num_inference_steps: int = 4,
guidance_scale: float = 1.0,
) -> Image:
image = self.lcm_txt2img(
prompt=prompt,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
).images[0]
return image
Before deploying, let’s test this Service locally using the BentoML CLI.
bentoml serve service:LatentConsistency
The command starts the Service at http://localhost:3000. You can interact with it using the Swagger UI. Alternatively, create a BentoML client as below. As the response is a Path object, you can specify a custom directory to save the image.
import bentoml
from pathlib import Path

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
result_path = client.txt2img(
guidance_scale=1,
num_inference_steps=4,
prompt="close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux",
)
destination_path = Path("/path/to/save/image.png")
result_path.rename(destination_path)
An example image returned:
Deploying to BentoCloud
BentoCloud provides the underlying infrastructure optimized for running and managing AI applications on the cloud. To deploy this project to BentoCloud, make sure you have logged in, then run bentoml deploy in the cloned repo. I added the --scaling-min and --scaling-max flags here to tell BentoCloud the scaling limits of this Deployment, which means it will be scaled within this range according to the traffic received.
bentoml deploy . --scaling-min 1 --scaling-max 3
After the Deployment is ready, visit its details page on the BentoCloud console and interact with it on the Playground tab. This time I changed the prompt as below:
The generated image can be previewed or downloaded.
Conclusion
By wrapping an SDXL model enhanced with LCM LoRA in a BentoML Service, you can rapidly deploy efficient, high-quality image generation AI applications. While the LCM LoRA improves computational efficiency, BentoCloud helps streamline deployment and scaling, ensuring that your application remains responsive regardless of demand.
In future blog posts, we will see more production-ready AI projects deployed with BentoML and BentoCloud. Happy coding ⌨️!
More on BentoML
To learn more about BentoML, check out the following resources:
[Blog] Introducing BentoML 1.2
[Blog] Deploying A Text-To-Speech Application with BentoML
[Blog] Deploying An Image Captioning Server With BentoML
[Blog] BYOC to BentoCloud: Privacy, Flexibility, and Cost Efficiency in One Package
Try BentoCloud and get $10 in free credits on signup! Experience a serverless platform tailored to simplify the building and management of your AI applications, ensuring both ease of use and scalability.
Deploying Stable Diffusion XL with Latent Consistency Model LoRAs on BentoCloud was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deploying A Text-To-Speech Application with BentoML

Sherlock Xu — Thu, 15 Feb 2024 02:52:41 GMT

Text-to-speech (TTS) technology bridges the gap between written language and its spoken form. By converting text into lifelike speech, TTS enhances user experience across various applications, from aiding visually impaired individuals to providing voice responses in virtual assistants. As developers seek to integrate TTS models into their projects, the process of deploying and managing them efficiently becomes crucial.
In this blog post, I will guide you through the steps of deploying a text-to-speech application using BentoML and BentoCloud, powered by the model XTTS.
Before you begin
I recommend you create a virtual environment first for dependency isolation.
python -m venv bentoxtts
source bentoxtts/bin/activate
You can find all the code of this project in the BentoXTTS repo. Clone the project repo and install the dependencies for this project. Note that this project uses BentoML 1.2.
git clone https://github.com/bentoml/BentoXTTS.git
cd BentoXTTS
pip install -r requirements.txt
v1: Creating a basic text-to-speech script
Let’s see what code looks like without BentoML. Initially, you might start with a simple script that leverages a text-to-speech model to convert text into audio. This basic version directly interacts with the TTS API without considering deployment or service architecture. This is the example code I found in the XTTS Hugging Face repo:
from TTS.api import TTS

# Initialize the TTS model with GPU support
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# Generate speech from text and save it to a file
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav="/path/to/target/speaker.wav",
language="en")
The code should work well but you can’t expose it directly to your users. You need to think about how they can easily interact with it.
v2: Integrating BentoML for serving
To turn this script into a deployable service, you need to encapsulate the functionality somewhere and preferably expose it as an API endpoint. This is where BentoML comes in.
The first thing to do is to create a BentoML Service. Starting from BentoML 1.2, you use the @bentoml.service decorator to annotate a Python class as a BentoML Service. You can add configurations for it to customize the runtime behavior of your Service.
Let’s call this class XTTS. You can add the tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True) part in the v1 code for initialization in this class. Now, you may have something like this:
import bentoml
import torch

MODEL_ID = "tts_models/multilingual/multi-dataset/xtts_v2"

# Use the decorator to mark a class as a BentoML Service
@bentoml.service(
traffic={"timeout": 300} # The maximum duration (in seconds) that the Service will wait for a response before timing out.
)
class XTTS:
def __init__(self) -> None:
# Initialize the TTS model with GPU support based on system availability
self.tts = TTS(MODEL_ID, gpu=torch.cuda.is_available())
Next, let’s continue to define an API endpoint. This involves specifying a method (for example, synthesize) within the Service class that will handle requests. In BentoML, you use the @bentoml.api decorator to expose this method as a web endpoint. You can specify the types of input/output that the method will support using type annotations and add sample values as needed. This ensures that the data received by the Service is correctly typed and that users understand what data to provide.
For this example, you can let the model accept inputs of prompt text and language code and add samples like this:
sample_input_data = {
'text': 'It took me quite a long time to develop a voice and now that I have it I am not going to be silent.',
'language': 'en',
}

@bentoml.service(
traffic={"timeout": 300}
)
class XTTS:
def __init__(self) -> None:
self.tts = TTS(MODEL_ID, gpu=torch.cuda.is_available())
@bentoml.api
def synthesize(
self,
text: str = sample_input_data["text"],
lang: str = sample_input_data["language"],
):
With input logic in place, you can proceed to define the output logic of the synthesize method. This involves determining where the synthesized audio file will be stored and handling a sample path.
...
@bentoml.api
def synthesize(
self,
context: bentoml.Context,
text: str = sample_input_data["text"],
lang: str = sample_input_data["language"],
) -> t.Annotated[Path, bentoml.validators.ContentType('audio/*')]:
output_path = os.path.join(context.temp_dir, "output.wav")
sample_path = "./female.wav"
if not os.path.exists(sample_path):
sample_path = "./src/female.wav"
The output logic here tells BentoML that the method returns a path to a file (Path) and that the file is of an audio type (ContentType('audio/*')). This guides BentoML in handling the file appropriately when sending it over the network, ensuring that clients understand the format of the data they receive.
In addition, the Service uses context.temp_dir to create a temporary directory for the output file output.wav and a path to store a sample speaker file (it already exists in the project you cloned). If the sample is not found in the default location, it attempts to locate it under a secondary path.
Finally, you can integrate the TTS model’s logic to synthesize the audio file based on the input text and language, storing the result in the specified output path.
sample_input_data = {
'text': 'It took me quite a long time to develop a voice and now that I have it I am not going to be silent.',
'language': 'en',
}

@bentoml.api
def synthesize(
self,
context: bentoml.Context,
text: str = sample_input_data["text"],
lang: str = sample_input_data["language"],
) -> t.Annotated[Path, bentoml.validators.ContentType('audio/*')]:
output_path = os.path.join(context.temp_dir, "output.wav")
sample_path = "./female.wav"
if not os.path.exists(sample_path):
sample_path = "./src/female.wav"

self.tts.tts_to_file(
text,
file_path=output_path,
speaker_wav=sample_path,
language=lang,
split_sentences=True,
)
return Path(output_path)
This completes the definition of the synthesize method, which now fully integrates the TTS functionality within a BentoML Service, exposing it as an API endpoint.
Combining all the steps, you have the complete Service definition in service.py as follows (also available here on GitHub):
from __future__ import annotations

import os
import typing as t
from pathlib import Path

import bentoml

MODEL_ID = "tts_models/multilingual/multi-dataset/xtts_v2"

sample_input_data = {
'text': 'It took me quite a long time to develop a voice and now that I have it I am not going to be silent.',
'language': 'en',
}

@bentoml.service(
traffic={"timeout": 300}
)
class XTTS:
def __init__(self) -> None:
import torch
from TTS.api import TTS

self.tts = TTS(MODEL_ID, gpu=torch.cuda.is_available())

@bentoml.api
def synthesize(
self,
context: bentoml.Context,
text: str = sample_input_data["text"],
lang: str = sample_input_data["language"],
) -> t.Annotated[Path, bentoml.validators.ContentType('audio/*')]:
output_path = os.path.join(context.temp_dir, "output.wav")
sample_path = "./female.wav"
if not os.path.exists(sample_path):
sample_path = "./src/female.wav"

self.tts.tts_to_file(
text,
file_path=output_path,
speaker_wav=sample_path,
language=lang,
split_sentences=True,
)
return Path(output_path)
Compared with the v1 code, the v2 code mainly does the following two things:
Create a BentoML Service to wrap the model.
Create an API endpoint for the Service with custom input and output logic.
For the v1 code, you simply copy and paste it in this service.py file.
To start this BentoML Service locally, run the following command. You may need to set the environment variable COQUI_TTS_AGREED=1 to agree to the terms of Coqui TTS.
$ COQUI_TOS_AGREED=1 bentoml serve service:XTTS

2024-02-08T06:04:10+0000 [WARNING] [cli] Converting 'XTTS' to lowercase: 'xtts'.
2024-02-08T06:04:10+0000 [INFO] [cli] Starting production HTTP BentoServer from "service:XTTS" listening on http://localhost:3000 (Press CTRL+C to quit)
You can now interact with the Service at http://localhost:3000.
The expected output is a synthesized audio file of the prompt text based on this sample. The resulting speech mimics the characteristics of the sample voice.
Deploying the project in production
To run the TTS model in production, I recommend you deploy it to BentoCloud as the serverless platform can manage the underlying infrastructure for you. You only need to focus on your application development.
First, add resource requirements in the @bentoml.service decorator using the resources field. This allows BentoCloud to automatically schedule the most appropriate instances for deployment.
@bentoml.service(
resources={
"gpu": 1,
"memory": "8Gi",
},
traffic={"timeout": 300},
)
class XTTS:
def __init__(self) -> None:
Next, you need a bentofile.yaml file to define the build configurations for packaging this project into a Bento. bentofile.yaml for this project is already in the project directory.
Lastly, no need to manually build a Bento and simply deploy the project by running bentoml deploy. The Bento will be built automatically, then pushed and deployed to BentoCloud. You can also set additional configs like scaling and authorization. See Create Deployments to learn more.
bentoml deploy .
Note: You need to gain access to BentoCloud and log in first.
Once the Deployment is up and running, you can interact with it on the BentoCloud console. I used the Form tab to submit a request and the result was displayed on the right.
BentoCloud playground
Deployment info
Conclusion
In this blog post, we’ve explored how BentoML and BentoCloud simplify the deployment of machine learning models, specifically focusing on creating a text-to-speech Service. I encourage you to experiment with the new concepts of BentoML discussed in this post, explore the capabilities of BentoCloud, and consider creating other innovative AI projects by combining both tools. Happy coding ⌨️!
More on BentoML
To learn more about BentoML, check out the following resources:
[Blog] Introducing BentoML 1.2
[Blog] Deploying Stable Diffusion XL with Latent Consistency Model LoRAs on BentoCloud
[Blog] Deploying An Image Captioning Server With BentoML
[Blog] BYOC to BentoCloud: Privacy, Flexibility, and Cost Efficiency in One Package
Try BentoCloud and get $10 in free credits on signup! Experience a serverless platform tailored to simplify the building and management of your AI applications, ensuring both ease of use and scalability.
Deploying A Text-To-Speech Application with BentoML was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding Retrieval-Augmented Generation: Part 2

Sherlock Xu — Thu, 01 Feb 2024 13:36:58 GMT

This is the second installment of our blog series on Retrieval-Augmented Generation (RAG). In the first article, we explained the fundamentals of RAG, understanding its mechanics and how it combines data retrieval with language generation. We also touched upon the challenges and potential this technology holds.
In this article, we will focus on the following three parts and hopefully they can provide you with some insights as you prepare for the LlamaIndex RAG Hackathon.
Practical applications of RAG.
Build RAG systems.
The prospect of RAG.
Real-world applications of RAG
RAG has practical applications across various industries, impacting how businesses and organizations operate. When designing a RAG system, you may want to consider the needs of real-world scenarios.
Research and academia
In academia, literature reviews and research may see a new way with RAG. Imagine a system that assists a historian researching the French Revolution. The RAG system scans through hundreds of new academic papers, books, and historical records, summarizing key findings, and even identifying lesser-known but relevant sources, thereby enriching the research process and uncovering new insights.
Healthcare
In healthcare, RAG’s potential is particularly noteworthy. Medical professionals can use RAG-based systems to stay updated with the latest medical research, treatment protocols, and drug information. This technology can assist in diagnosing diseases or offering treatment recommendations, ensuring that patient care is supported by the most current medical knowledge.
Customer service and chatbots
In the world of customer service, particularly in chat applications, RAG is redefining the capabilities of chatbots by integrating real-time data for more dynamic interactions. For instance, a chatbot for a basketball league, enhanced with RAG, could offer real-time updates on game scores, player injuries, or post-match analyses. This is particularly valuable for fans following live events or seeking the latest statistics and player performance data. The chatbot, instead of being limited to pre-existing knowledge, becomes a dynamic source of current sports information.
Finance and market analysis
In finance, a RAG system can be used for real-time market analysis. For example, an investment firm might use RAG to analyze the impact of a sudden political event on market trends. The system can pull in the latest news reports, historical market data, and recent financial analyses, helping analysts quickly understand the event’s implications and make informed investment decisions.
These examples are only part of what RAG can do for different industries. By leveraging the latest information and contextual data, RAG is not only enhancing existing processes but also creating new possibilities for innovation and efficiency.
Building a RAG system
There are tons of different ways to build a RAG system. Here are some general points that may help you in its design:
Use two models in the system — one for text embedding and another as the primary LLM model. This allows for more specialized handling of data retrieval and response generation.
Integrate OpenLLM and BentoML. You can start your system using any LLM, easily expose API endpoints for interaction, and deploy this system anywhere after containerization. If you are a participant team of the LlamaIndex RAG Hackathon, you will have $100 BentoCloud credits. After you push your project to BentoCloud, you can better manage, monitor and scale it in production.
Use vLLM as the inference backend. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. For models with vLLM support, OpenLLM uses vLLM by default.
The system can start with a small set of initial test data, while it provides users the flexibility to upload their own data to the vector database.
Consider designing different endpoints for different purposes for interaction with users.
Consider adding features like automatic data categorization or tagging in the file upload process, enhancing the relevance and accuracy of retrieved information.
Optimize the vector database for efficient indexing and querying. Fine-tune parameters like index size, search algorithm, and memory usage to balance between speed and accuracy.
For data-intensive operations like file uploads and embedding generation, implement asynchronous processing to improve system responsiveness and user experience.
Some resources for your reference:
Building An Intelligent Query-Response System with LlamaIndex and OpenLLM
Deploy a large language model with OpenLLM and BentoML
OpenLLM readme
The future of RAG
As RAG technology continues to evolve, its applications are set to become even more sophisticated and impactful. Here are some examples of how RAG might change different aspects of our life in the coming years.
Personalized health and wellness coaching
In the near future, a RAG-enhanced personal health assistant could offer comprehensive wellness suggestions. For example, after a user inputs their dietary preferences, fitness goals, and current health metrics, the RAG system could analyze a vast array of update-to-date nutritional data, fitness regimes, and medical studies. It might then create a personalized health plan, suggesting specific diets, exercises, and even reminding them to take medications or schedule medical check-ups, all tailored to their unique health profile and goals.
Tailored educational experiences
In education, a RAG-powered tutoring system could provide students with highly personalized learning experiences. Based on a student’s learning style, progress, and interests, the system could source and integrate educational materials from various platforms, adapt the difficulty level in real time, and offer insights into topics that align with the student’s career goals or personal interests. This helps create a deeply engaging and effective learning environment.
Smart home automation
In a smart home setting, RAG could take automation to the next level. Imagine a system that not only controls home devices but also anticipates needs based on contextual data. For example, the RAG system could analyze weather forecasts, the homeowner’s schedule, and energy usage patterns to optimize heating and cooling, suggest grocery orders based on consumption trends, and even offer entertainment recommendations based on the user’s mood, interactions and preferences.
These scenarios demonstrate the remarkable potential of RAG to provide personalized, context-aware solutions. It promises not only to answer our queries but to anticipate our needs and offer solutions that are closely aligned with our personal preferences and life situations.
Conclusion
The journey ahead for RAG is filled with exciting possibilities. This technology is set to become more intuitive, more adaptive, and even more aligned with individual user needs. Thank you for joining us in this exploration of RAG and good luck to everyone who will be competing in the LlamaIndex RAG Hackathon!
Understanding Retrieval-Augmented Generation: Part 2 was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding Retrieval-Augmented Generation: Part 1

Sherlock Xu — Thu, 25 Jan 2024 11:54:06 GMT

Imagine you are a contestant on a competitive cooking show (like Hell’s Kitchen), required to create a dish that’s not only delicious but also tells a unique story. You already have some cooking skills thanks to your past training experience, but what if you could freely access a global library of recipes, regional cooking techniques, and even flavor combinations? That’s where your sous-chef, equipped with a vast culinary database, steps in. This sous-chef doesn’t just bring you ingredients; she also brings specialized knowledge and inspiration, helping you transforming your cooking into a masterpiece that tells a unique, flavorful story.
This is the essence of Retrieval-Augmented Generation, or RAG in the AI world. Like the sous-chef who elevates your cooking with a wealth of custom resources, RAG enhances the capabilities of large language models (LLMs). It’s not just about responding to queries based on pre-existing knowledge; RAG allows the model to dynamically access and incorporate a vast range of external information, just like tapping into a global culinary database for that unique recipe.
As a partner of LlamaIndex RAG Hackathon, we will release a two-article blog series about RAG to help the BentoML community gain a better understanding of its concepts and usage. In this first post, we will explore the mechanics of this technology, its benefits, as well as the challenges it faces, offering a comprehensive taste of how RAG is redefining the boundaries of AI interactions.
RAG 101
Patrick Lewis and his colleagues at Meta first proposed the concept of RAG in the 2020 paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. At its core, RAG has two important components: the retrieval system and the language model.
The retrieval system: This is like the data corpus of RAG. The retrieval system scans through extensive databases of information to find the most relevant and useful data that can enhance the response to a query. This process is similar to selecting the perfect ingredients for a recipe, ensuring that each one contributes to the final flavor profile.
The language model: This is the chef who knows how to combine ingredients into a dish. The language model takes the information sourced by the retrieval system and integrates it into contextually relevant responses.
Traditional language models are like chefs working with a fixed set of ingredients. They can create impressive dishes (responses) based on what they have (their training data), but they are limited to those ingredients. RAG, on the other hand, has the ability to constantly source new ingredients (information), making dishes (responses) far more diverse, accurate, and rich.
Embeddings and vector databases in RAG
In the world of RAG, when a user poses a question, it involves a complex computational process, where embeddings and vectors databases play important roles.
Embeddings
The first step in RAG’s retrieval process involves translating the user’s query into a format that the AI model can understand. This is often done through embeddings or vectors. An embedding is essentially a numeric representation of the query, capturing not just the words, but their context and semantic meaning. Think of it as translating a recipe request into a list of necessary flavor profiles and cooking skills.
Note: Previously, we published two blog posts on creating sentence embedding and image embedding applications with BentoML respectively. Read them for more details.
Embeddings allow the AI model to process and compare the query against a vast array of stored data efficiently. This process is similar to a chef understanding the essence of a dish and then knowing exactly what ingredients and techniques to use.
Vector databases
After you have the embeddings, the next crucial component is the vector database.
Vector databases in RAG store a massive amount of pre-processed information, each piece also represented as embeddings. When the AI model receives a query, it uses these embeddings to search through the database, looking for matches or closely related information.
The use of vector databases allows RAG to search through and retrieve relevant information with decent speed and precision. It’s like having an instant global connection to different flavors and ingredients, each cataloged not just by name, but by their taste profiles and culinary uses.
Ultimately, the embeddings, the vector database, and the language model work together to make sure the final response is a well-thought-out answer that blends the retrieved information with the AI’s pre-trained data.
The benefits of RAG
RAG comes with a number of benefits. To name a few:
Enhanced accuracy. By leveraging up-to-date external information, RAG ensures that the answers are not only contextually relevant but also enriched with the latest data. This is particularly important in fields like medicine, technology, and finance.
Dynamically updated Information. Unlike traditional models that rely solely on their training data, RAG models can access and incorporate dynamically updated information. Providing the information does not directly modify the underlying language model itself, without incurring any additional training costs.
Source attribution: Since the retrieval system knows which documents or text snippets it has pulled from the database, it can provide this information along with the generated response. This provides an extra layer of transparency and trust to the responses generated.
Personalized interactions. RAG has the potential for more personalized AI interactions. As the system understands and incorporates specific details from users’ queries, it can provide responses that are more aligned with individual needs and preferences.
The implications of RAG’s benefits extend far beyond just improved answers. They represent an important shift in how we interact with AI, transforming it into a tool capable of providing informed, accurate, and contextually rich interactions. This opens up new possibilities in education, customer service, research, and any other fields where access to updated, relevant information is important.
Challenges and limitations
Key challenges of RAG include:
Data retrieval complexity. One of the primary challenges in RAG is ensuring the accuracy and relevance of the data retrieved. While RAG is able to pull in vast amounts of information, filtering this data to find the most pertinent pieces can be complex. Ensuring the retrieval system can understand the nuances of different queries is important but not always easy.
Balancing relevance with reliability. The retrieval system may have access to a wide range of data, but not all sources are equally trustworthy. Therefore, it is important to balance the relevance of the latest information with the reliability and credibility of sources. This may require developing mechanisms to evaluate and prioritize reliable sources.
Computational resources and costs. RAG systems, particularly those handling large datasets and complex queries, require substantial computational resources. The process of retrieving, processing, and integrating external information in real time can be computationally intensive, leading to higher operational costs and potential efficiency concerns.
Future-proofing. Ensuring that RAG systems remain effective and up-to-date over time is another challenge. As information sources and user expectations evolve, RAG systems must also adapt and scale accordingly, which requires ongoing development and maintenance.
Conclusion
Despite these challenges, there is great potential of RAG in transforming AI interactions. Its role in enhancing AI’s capabilities is undeniable, and the journey to refine this technology further is both challenging and exciting.
In the next article, we will explore the real-world applications of RAG and its future outlook.
Understanding Retrieval-Augmented Generation: Part 1 was originally published in BentoML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Sherlock Xu on Medium

A Guide to Model Composition

What is model composition?

When should I compose models?

Multi-modal applications

Ensemble modeling

Pipeline processing

What are the benefits of model composition?

Improved accuracy and performance

Dedicated infrastructure and resource allocation

Customization and flexibility

Faster development and iteration

Resource optimization

Composing multiple models with BentoML

Frequently asked questions

What is the difference between ensemble modeling and multi-modal applications?

I am using a single model for my application. Should I move to multiple models?

How does model composition affect production deployment?

Final thoughts

Serving A LlamaIndex RAG App as REST APIs

Concepts

RAG

LlamaIndex

What are we building?

Setting up the environment

Serving a LlamaIndex RAG service

Querying the RAG API service

Deploying the RAG service for production

Docker

BentoCloud

Conclusion

More on BentoML and LlamaIndex

Building RAG with Open-Source and Custom AI Models

Simple RAG system

Challenges in production RAG

Retrieval performance

Response synthesis

Response evaluation

Improving RAG pipeline with custom AI models

Text embedding model

Large language model

Context-aware chunking

Parsing complex documents

Metadata filtering

Reranking models

Cross-modal retrieval

Recap: AI models in RAG systems

Scaling RAG services with multiple custom AI models

Serving embedding models

Self-hosting LLMs

Model composition

Document Processing Pipeline

Using small language models

Uniting separate RAG components

Conclusion

More on BentoML

A Guide to Open-Source Image Generation Models

Stable Diffusion

DeepFloyd IF

ControlNet

Animagine XL

Stable Video Diffusion

What is LoRA? What can I do with it and Stable Diffusion?

How can I create high-quality images?

Should I worry about copyright issues when using image generation models?

What is the difference between deploying LLMs and image generation models in production?

Final thoughts

More on image generation models

Deploying A Large Language Model with BentoML and vLLM

What is vLLM?

Setting up the environment

Defining a BentoML Service

Deploying the LLM to BentoCloud

More on BentoML and vLLM

Navigating the World of Large Language Models

Llama 3.1

Mixtral 8x7B

Zephyr 7B

SOLAR 10.7B

Code Llama