This repository contains a complete, production-ready AI chatbot application built with Meta's Llama 2 7B language model, designed for seamless deployment on Convox infrastructure.
This application provides a scalable, containerized AI assistant powered by the Llama 2 7B chat model. It features:
- A lightweight React frontend for user interaction
- A FastAPI service for handling chat session management
- A GPU-accelerated inference server running the Llama 2 model
- Fully containerized deployment supporting NVIDIA GPUs
- Ready for production deployment via Convox
The application is structured into three main services:
- Simple, intuitive chat interface
- Persistent chat sessions
- Responsive design with real-time feedback
- Handles chat session management
- Manages message history
- Communicates with the model server
- Provides a RESTful API interface
- Runs the Llama 2 7B chat model
- GPU-accelerated inference using vLLM
- Optimized for low-latency responses
- Configurable generation parameters
This project is pre-configured for easy deployment on Convox. The convox.yml file contains all necessary configuration to deploy the application with appropriate resource allocations.
- A Convox account and Rack
- GPU-enabled nodes in your Convox Rack (for the model server)
- Hugging Face access token with permission to download Llama 2 models
-
Configure Node Groups
Ensure your Convox Rack has a node group with GPU support. You can add a GPU-enabled node group with:
convox rack params set additional_node_groups_config='[{"type":"g4dn.xlarge","min_size":1,"max_size":1,"label":"alternate-test-tag-2"}]' -r <your-rack>
-
Set Environment Variables
Set your Hugging Face token:
convox env set HF_TOKEN=<your-hugging-face-token> -a <your-app>
-
Deploy the Application
convox deploy
-
Access Your Application
After deployment, you can access your chatbot at the URL provided by Convox:
convox services
The application is configured with the following resource allocations:
-
Model Server:
- 1 vCPU
- 8GB RAM
- 1 GPU
- Persistent volume for model storage
-
API Service:
- 0.25 vCPU
- 512MB RAM
- 2 replicas for high availability
-
Frontend:
- 0.25 vCPU
- 256MB RAM
- 2 replicas for high availability
These values can be adjusted in the convox.yml file based on your specific needs and traffic expectations.
- Docker and Docker Compose
- Node.js 16+ for frontend development
- Python 3.9+ for backend development
- NVIDIA GPU with CUDA support (for local model server testing)
-
Clone the repository:
git clone https://github.com/your-username/llama2-convox-chatbot.git
-
Start the services using Docker Compose:
docker-compose up
-
Access the application at http://localhost:3000
This project is released under the MIT License.
- You need access to Meta's Llama 2 model on Hugging Face. Request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- GPU resources are required for the model server
- The model server will download approximately 14GB of model weights
Contributions are welcome! Please feel free to submit a Pull Request.