m0nads

Reasoning without External Rewards

20 Jul 2025 af

From self-certainty to Reinforcement Learning from Internal Feedback

Article link.

LLMs: a 5 mins trip

16 Jul 2025 af

What Large Language Models actually do in mathematical terms

Article link.

Minimal RAG model

27 Jan 2025 af

Using Cohere and SerpAPI

Article link.

Byte Latent Transformer

14 Jan 202514 Jan 2025 af

The Efficiency of Dynamic Byte Patching

Article link.

Florence-2 is a novel vision foundation model developed by Microsoft that is redefining the capabilities of artificial intelligence in the field of computer vision. This powerful model serves as a versatile backbone for a wide range of vision tasks, from image captioning and object detection to visual grounding (locating and identifying specific objects or regions within an image based on a natural language query) and referring expression comprehension (localizing a specific object within an image based on a natural language expression). Key features of Florence-2 include:

Unified sequence-to-sequence architecture: Florence-2 employs a consistent underlying structure that can handle diverse vision tasks without the need for task-specific modifications, aligning with the principles of versatile model development.
Unprecedented zero-shot and fine-tuning capabilities: The model achieves new state-of-the-art results in various tasks, even outperforming larger specialist models after fine-tuning with public human-annotated data.
Multitask learning with extensive visual annotations: Florence-2 is trained on a comprehensive dataset, enabling it to learn intricate visual patterns and representations that can be applied to numerous domains.
Prompt-based representation: The model uses a unified prompt-based approach to accommodate a wide range of vision tasks, addressing the challenges of limited training data and the absence of a unified architecture.
Efficient solution for various tasks: Florence-2 can handle tasks from image-level understanding to fine-grained visual-semantic alignment without extensive fine-tuning, making it an ideal choice for practical applications.

By combining the power of multi-task learning and the seamless integration of textual and visual information, Florence-2 sets a new standard for comprehensive representation learning in computer vision. Its achievements pave the way for the future of vision foundation models, inspiring researchers to delve deeper into the realms of cross-modal understanding and adaptable AI systems.

Introduction

I will never tire of highlighting that MS researchers have always had a pioneering attitude towards how to train/create models by reducing their size while simultaneously increasing their capabilities; papers on models like Orca 2, Phi-3, and Florence-2 serve as a sort of manual on data preparation for training lean and efficient models. These efforts are based not so much on the discovery of new architectures (note from the author: at the architectural level, despite significant diversifications over the years, we are still in 2017, the advent of the Transformer) but on careful strategies that emphasize how to prepare training data and in what formats to input them into the models.

Achieving universal representation in computer vision poses distinct challenges due to the complexity of visual data. Key aspects include:

Spatial Hierarchy: models must effectively interpret spatial details at different scales, balancing between high-level concepts and fine-grained pixel information.
Semantic Granularity: successful universal representation should encompass a range of semantic details, from broad captions to intricate descriptions, facilitating diverse applications.

(All images from now on, except those in the coding section, are from the original Florence-2 article)

To build a unified vision foundation model suitable for various applications, the authors propose an innovative pre-training strategy that overcomes single-task limitations and integrates both textual and visual semantics. Their approach incorporates a diverse set of annotations, effectively capturing visual understanding nuances and bridging the gap between vision and language understanding. This strategy aims to capture multiple levels of granularity, from global semantics to local details, and comprehend spatial relationships between objects and entities in their semantic context.

Model and Data Engine

Florence-2 consists of an image encoder and standard multi-modality encoder-decoder. Florence-2 employs a sequence-to-sequence learning paradigm, integrating several tasks under a common language modeling objective. The model takes images coupled with task-prompt as task instructions, and generates the desirable results in text forms. Note how input image data is enriched with additional prompt info, and then encoded (see picture below).

Florence-2 data engine consists of three essential phases:

initial annotation employing specialist models;
data filtering to correct errors and remove irrelevant annotations;
an iterative process for data refinement.

The final dataset FLD-5B consists of over 5B annotations containing 126M images, 500M text annotations, 1.3B region-text annotations, and 3.6B text-phrase-region annotations.

Image Collection

Data is constructed by gathering a diverse collection of images from various sources. There are three key tasks that act as primary sources for image corpus: image classification, object detection, and image captioning. Five distinct datasets originating from the aforementioned tasks: ImageNet-22k, Object 365, Open Images, Conceptual Captions, and LAION. This combination results in a dataset of 126 million images in total.

Data Annotation Process

The primary goal of the data annotation process is to create comprehensive annotations that facilitate effective multitask learning. This process is divided into three main categories: text, region-text pairs, and text-phrase-region triplets.

The annotation workflow consists of three essential phases:

Initial Annotation: This phase employs specialist models to generate initial annotations. These models are a mix of offline models trained on diverse datasets and online services tailored for specific annotation types. In cases where datasets already contain partial annotations, such as the Object 365 dataset, these are merged with synthetic labels to enhance coverage.
Data Filtering and Enhancement: Initial annotations, while comprehensive, can contain noise and inaccuracies. A multifaceted filtering process is implemented to refine these annotations:
- Text Annotations: A parsing tool based on SpaCy is used to extract relevant objects, attributes, and actions. Texts with excessive objects are filtered out to reduce noise, and only those with a minimum complexity are retained.
- Region Annotations: Noisy bounding boxes are removed based on a confidence score, and non-maximum suppression is applied to eliminate redundant boxes.
Iterative Data Refinement: Using the filtered annotations, a multitask model is trained, leading to improved predictions, especially in cases of inaccuracies. The updated annotations are integrated with the original ones for further training. For tasks initially omitted due to insufficient data, the iteratively trained model is used for pre-training, followed by fine-tuning with the sparse dataset, resulting in better performance than models trained from scratch.

This comprehensive approach ensures high-quality annotations across an extensive dataset of 126 million images, facilitating robust multitask learning and enhancing the overall annotation process. The data annotation process is depicted below.

Annotation-Specific Variations

This section elaborates on the variations in the annotation procedures for different types of data within the general workflow introduced earlier.

Text annotations categorize images into three granularities: brief, detailed, and more detailed.

Brief Text: Consists of a single sentence highlighting the most salient objects and activities, similar to COCO captions. A Florence-2 model is trained to generate these annotations, with iterative refinement to reduce noise.
Detailed and More Detailed Text: Comprises multiple sentences that provide richer descriptions. These are generated using prompts from existing annotations fed into large language models (LLMs) or large multimodal models (LMMs). Due to the high cost, only a limited number of these texts are created to fine-tune the caption specialist.

Region-text pairs offer descriptive annotations for semantic regions in images, including both visual objects and text regions. Each region is represented by a bounding box and can be annotated with varying granularities.

Text Regions: Labeled using Azure AI Services’ OCR API.
Visual Object Regions: Initially annotated with a DINO object detector trained on public datasets. Data filtering techniques like confidence thresholding and non-maximum suppression are applied to remove noisy annotations. Each region receives three textual annotations: a phrase from the object category, brief text, and noun phrase chunks.

Text-Phrase-Region Triplets. These triplets consist of descriptive text, noun phrases related to image objects, and their corresponding region annotations.

The descriptive text includes brief, detailed, and more detailed annotations. The Grounding DINO model identifies noun phrases and generates bounding boxes, while the SAM model creates segmentation masks for precise localization. Data filtering applies a confidence score threshold to ensure relevance, and a blacklist is used to exclude irrelevant noun phrases.

This structured approach enhances the richness and accuracy of image annotations across various categories.

In the picture above, an illustrative example of an image and its corresponding annotations in FLD-5B dataset. Each image in FLD-5B is annotated with text, region-text pairs, and text-phrase-region triplets by Florence data engine, which covers multiple spatial hierarchies, brief-to-detailed progressive granularity, and a wide semantics spectrum, enabling more comprehensive visual understanding from diverse perspectives.

Coding example

To run the following code, you need a GPU supporting CUDA. Alternatively, you can run it on Google Colab or a similar hosted service.


# Install basic libraries
%pip install transformers pillow timm flash_attn


# Importing modules
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import copy
%matplotlib inline


# Setting the model 
model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(
    model_id, 
    trust_remote_code=True)


# Function generating text from prompt
def run_example(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    inputs = processor(
        text=prompt, 
        images=image, 
        return_tensors="pt")
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        early_stopping=False,
        do_sample=False,
        num_beams=3,
        )
    generated_text = processor.batch_decode(
        generated_ids, 
        skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )

    return parsed_answer


url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)


image


task_prompt = ''
run_example(task_prompt)

{'<CAPTION>': 'A green car parked in front of a yellow building.'}


task_prompt = ''
run_example(task_prompt)

{'<DETAILED_CAPTION>': 'The image shows a green Volkswagen Beetle parked in front of a yellow building with two brown doors. The sky is a mix of blue and white, and there are a few green trees in the background.'}


task_prompt = ''
run_example(task_prompt)

{'<MORE_DETAILED_CAPTION>': 'The image shows a vintage Volkswagen Beetle car parked on a cobblestone street in front of a yellow building with two wooden doors. The car is painted in a bright turquoise color and has a white stripe running along the side. It has two doors on either side of the car, one on top of the other, and a small window on the front. The building appears to be old and dilapidated, with peeling paint and crumbling walls. The sky is blue and there are trees in the background.'}

The output for object detection is textual, indicating various boxes and the respective labels.


# Object detection
task_prompt = ''
results = run_example(task_prompt)
print(results)

{'<OD>': {'bboxes': [[34.23999786376953, 160.0800018310547, 597.4400024414062, 371.7599792480469], [456.0, 97.68000030517578, 580.1599731445312, 261.8399963378906], [450.8800048828125, 276.7200012207031, 554.5599975585938, 370.79998779296875], [95.68000030517578, 280.55999755859375, 198.72000122070312, 371.2799987792969]], 'labels': ['car', 'door', 'wheel', 'wheel']}}

You can get a visual output showing boxes and labels using the following function.


# Function drawing boxes and labels on image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
def plot_bbox(image, data):
   # Create a figure and axes
    fig, ax = plt.subplots()

    # Display the image
    ax.imshow(image)

    # Plot each bounding box
    for bbox, label in zip(data['bboxes'], data['labels']):
        # Unpack the bounding box coordinates
        x1, y1, x2, y2 = bbox
        # Create a Rectangle patch
        rect = patches.Rectangle(
            (x1, y1), 
            x2-x1, 
            y2-y1, 
            linewidth=1, 
            edgecolor='r', 
            facecolor='none')
        # Add the rectangle to the Axes
        ax.add_patch(rect)
        # Annotate the label
        plt.text(x1, y1, label, color='white', 
            fontsize=8, 
            bbox=dict(facecolor='red', alpha=0.5))

    # Remove the axis ticks and labels
    ax.axis('off')

    # Show the plot
    plt.show()


plot_bbox(image, results[''])

There really is much more to Florence-2’s capabilities; however, we will stop here for brevity.

Useful links

Original Florence-2 article

Hugging Face “Florence-2-large” page

Phi-3 models

29 May 202429 May 2024 af

Tiny but mighty

Microsoft Phi-3 is a family of small language and multi-modal models whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5. Language models are available in short- and long-context lengths (4K and 128K).

Introduction

Microsoft researchers have always paid laudable attention to training techniques that take into account the careful selection of high-quality and more informative data combined with synthetic data produced with the aim of “instructing” or providing hacks to improve the performance of small models.

Phi-3 models are Transformer-based models trained using a selection of “textbook quality” data from the web together with synthetically generated textbooks and exercises with GPT-3.5.

Training strategy

Training data of consists of heavily filtered publicly available web data with high educational level from various open internet sources, as well as synthetic LLM-generated data. For example, suppose to compare coding files from training data: a high educational level code is code that is well formatted, easy to read and containing descriptions of what the code is for.

Pre-training is performed in two disjoint and sequential phases. The first phase comprises mostly of web sources aimed at teaching the model general knowledge and language understanding. The second phase merges even more heavily filtered web data (a subset used in the first phase) with some synthetic data whose purpose is to teach the model logical reasoning and various niche skills.

Researchers have attempted to calibrate the training data to bring it closer to the data optimal regime for small models, trying to retain those web pages that could potentially improve the “reasoning ability” of the model. For example, the result of a game in premier league in a particular day might be good training data for frontier models, but it is rather superfluous when focusing on reasoning capabilities of small sized models.

Phi-3 models range from 3.8 billion parameters (phi-3-mini, small enough to be deployed on a phone!) to 7 and 14 billion parameters (phi-3-small and phi-3-medium respectively). It is observed that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B. This could likely indicate that data mixture needs further work to be in the “data optimal regime” for 14B parameters model. Later we will introduce a coding hands-on example using phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.

The post-training process of phi-3-mini included two main stages: supervised finetuning (SFT) and direct preference optimization (DPO).

SFT: Utilized highly curated, high-quality data from various domains such as math, coding, reasoning, conversation, model identity, and safety. It began with English-only examples.
DPO: Focused on chat format data, reasoning, and responsible AI (RAI) efforts to guide the model away from undesirable behavior by marking those outputs as “rejected” responses.

Overall, this post-training process enhanced the model’s capabilities in math, coding, reasoning, robustness, and safety, transforming it into an efficient and safe AI assistant for user interactions.

Technical specifications

The phi-3-mini model is a Transformer decoder architecture with default context length 4K – there is also a long context version that extends the context length to 128K, called phi-3-mini-128K – and it is built upon a similar block structure as Llama-2. It uses the same tokenizer with vocabulary size of 320641, meaning that all packages developed for Llama-2 family of models can be directly adapted to phi-3-mini. The model uses 3072 hidden dimension, 32 heads and 32 layers. The model was trained using bfloat16 for a total of 3.3T tokens.

To leverage the better multilingual tokenization, phi-3-small model (7B parameters) uses the tiktoken tokenizer with a vocabulary size of 1003522 and default context length 8192. The decoder architecture meets the standards of a 7B model class, having 32 heads, 32 layers and a hidden size of 4096. The model

uses GEGLU activations and Maximal Update Parametrization – in practice, hyperparameters are tuned on a small proxy model and then transferred/extended to the larger target 7B model;
leverages a grouped-query attention with 4 queries sharing 1 key;
uses a novel block sparse attention which, for each attention head, enforces different sparsity patterns over Key-Values cache (you can click here for a good resource on block sparse attention);
alternates dense attention layers and block sparse attention layers to optimize KV cache savings while maintaining long context retrieval performance.

Benchmarks

Below, a list of different Phi-3 models benchmarks:

Phi-3-Small benchmarks (link);
Phi-3-Medium benchmarks (link);
Phi-3-Vision benchmarks (link).

Phi-3-Vision – quick testing

Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model (4.2B parameters) designed to process an image and a textual prompt as inputs, and subsequently generate textual outputs. The model is composed of an image encoder (CLIP ViT-L/14) and a transformer decoder (phi-3-mini-128K-instruct). The visual tokens, once extracted by the image encoder, are then combined with text tokens in an interleaved way (no particular order for image and text tokens).

In this section we will explore Phi-3-Vision model using a Jupyter notebook and Hugging Face Transformers library. The notebook needs Cuda Toolkit support because it uses Flash Attention – if you don’t have such a GPU, you could try phi-3-mini-128K model instead.

Let’s begin with installing and importing libraries.


%pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers

%pip install accelerate
%pip install flash-attn


from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor

Set the model.


model_id = "microsoft/Phi-3-vision-128k-instruct" 

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=True, 
    torch_dtype="auto")

processor = AutoProcessor.from_pretrained(
    model_id, 
    trust_remote_code=True)

Provide messages, including the request “Provide insightful questions to spark discussion”.


messages = [ 
    {"role": "user", "content": "\nWhat is shown in this image?"}, 
    {"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."}, 
    {"role": "user", "content": "Provide insightful questions to spark discussion."} 
]

Select the input image.


url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
 
image = Image.open(requests.get(url, stream=True).raw)

Prepare the prompt and preprocess the input data (text and image) converting it into a format that can be fed into the model for further processing or inference.


prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True)

inputs = processor(
    prompt,
    [image], 
    return_tensors="pt").to("cuda:0")

The following cell sets up the configuration parameters for a text generation process by creating a dictionary with specific settings for the maximum number of tokens to generate, the temperature (randomness) of the output, and whether to use sampling or not (in this case the model will not sample from the probability distribution but will instead choose the token with the highest probability at each step) .


generation_args = { 
    "max_new_tokens": 500, 
    "temperature": 0.0, 
    "do_sample": False, 
}

The generate method generates the new text based on the input and the learned patterns from the training data. The output it produces is generate_ids, which is a list or tensor containing the token IDs representing the generated text.


generate_ids = model.generate(
    **inputs,
    eos_token_id=processor.tokenizer.eos_token_id,
    **generation_args)

In the folowing cell, the purpose of the first line is to remove the input token IDs from the generated output. The second line decodes the remaining generated token IDs into human-readable text.


generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False)[0] 

print(response)

Here is the response.


1. What are the most significant barriers to meeting preparedness according to the respondents?
2. How does the level of agreement with each statement correlate with the respondents' overall satisfaction with their meetings?
3. Are there any notable differences in agreement levels between different demographics or job roles?
4. What strategies have been most effective in improving meeting preparedness based on the respondents' feedback?
5. How does the perceived importance of each statement vary across different industries or company sizes?

Useful links

Phi-3 article (link)

Phi-3 available on MS Azure (link)

Understanding BigBird’s Block Sparse Attention (link)

Jupyter notebook on Phi-3-Vision (link)

Local RAG using LLaMA3

21 May 202414 Jan 2025 af

A simple Retrieval-Augmented Generation (RAG) model

Article link.

Moondream

21 May 202412 Jun 2024 af

“A tiny open-source computer-vision model that runs everywhere and kicks ass” (cit.)

Moondream is a lightweight and fast transformer-based model that can be used for a variety of computer vision tasks. “moondream2” is a 1.86B parameter model. In essence, the model is a Vision Transformer (ViT) together with a language model that is capable of generating human-like text based on the input visual information. This allows the model to perform tasks such as image captioning, visual question answering, and multimodal reasoning. Our aim is to explore the Moondream model in the simplest way.

Although possible, there is no real need to clone the Moondream GitHub repository, instead there are two quick choices:

run the notebook using Google Colab (avoiding installing locally) or
open a Jupyter notebook from an environment with pip and PyTorch and run the following cells.

There is a third option, that is, testing directly the model from this demo page without following the content of this notebook.

We can leverage the Hugging Face transformers library to build powerful natural language processing models. The transformers library provides a wide range of pre-trained models for various NLP tasks, such as text classification, named entity recognition, and language generation. einops library is useful to transform tensors into the desired shape for the model.


!pip install transformers einops

The model is updated regularly, so it is recommended pinning the model version to a specific release.

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-05-08"

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)

Enabling Flash Attention by appropriately setting the parameters: this is shown in the cell below.


# Uncomment the following to enable Flash Attention on the text model

# model = AutoModelForCausalLM.from_pretrained(
#     model_id, trust_remote_code=True, revision=revision,
#     torch_dtype=torch.float16, attn_implementation="flash_attention_2"
# ).to("cuda")

Loading the tokenizer for the specified model and revision.


tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

The following cell is to enable URL image retrieval. However, we can also use paths to files as described later. Click on the link to see the image fed to the model.


# This is for opening images from URLs
import requests
from io import BytesIO

response = requests.get("https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-2.jpg")

The following cell creates a PIL Image object from an image file or an image URL. Uncomment to make a choice.


# image = Image.open('') # image from file
image = Image.open(BytesIO(response.content)) # image from URL

Now we encode the input image using the model’s image encoding function, then print the answer to the question.


enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

Batch inference is also supported.


# Uncomment below to use

# answers = moondream.batch_answer(  
#     images=[Image.open(''), Image.open('')],
#     prompts=["Describe this image.", "Are there people in this image?"],
#     tokenizer=tokenizer,
# )

Limitations

The following are taken verbatim from the repo page.

The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

Useful links

Same article on Substack (link)

Jupyter Notebook here

Moondream GitHub repository

Moondream Live Demo

Hugging Face page

Vision Transformer (ViT)

Support this blog

JetMoE

22 Apr 202429 Apr 2024 af

LLM training can be much cheaper than people generally thought

JetMoE is a recent Large Language Model (LLM) that supposedly outperforms LLaMA2-7B from Meta AI and was trained for 2 weeks using 96×H100 GPU cluster, spending only ~$80,000…

But how much does it cost to train a LLM?

Training costs

A first oddity is that the JetMoE article does not explicitly mention any training costs (except its own) for comparison with other models. Also, according to this page, Llama2-7B model requires less than $85,000 to train – so, if that were the case, what would be the big economic benefit of JetMoE? For example, where did the JetMoE staff get the amount of training costs for Llama2-7B and why didn’t they publish this data for direct comparison?

Anyway, the JetMoE article reports training costs as GPU hours (exactly, Nvidia H100 GPU hours). JetMoE training costs 30,000 H100 GPU Hours. A Microsoft “optimized version of the Llama 2 model” shows the table below expressed in A100 GPU Hours (the overall performance of the H100 is better than the previous generation A100)…

Meta’s largest LLaMA model, as of march 2023, used 2,048 Nvidia A100 GPUs to train on 1.4 trillion tokens (750 words is about 1,000 tokens), taking about 21 days: the cost was over $2.4 million. Analysts and technologists estimate that the critical process of training a large language model such as OpenAI’s GPT-3 could cost more than $4 million. You can find these numbers here.

GPT-4 training approximately costs over $100 million (here).

Key Messages

This is taken directly from the JetMoE’s page.

JetMoE-8B is trained with less than $ 0.1 million cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.
JetMoE-8B is very open and academia-friendly because:
1. It only uses public datasets for training, and the code is open-sourced. No proprietary resource is needed.
2. It can be finetuned with very limited compute budget (e.g., consumer-grade GPU) that most labs can afford.
JetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.

How JetMoE works

The JetMoE architecture is illustrated in the following figure.

JetMoE architecture takes advantage of sparse activation on both the attention and feed-forward layers, significantly reducing training and inference costs.

Let x be the input vector, consider a learnable matrix W_r that controls the routing. Let s be the routing output:

s = W_r x .

The Sparse Mixture of Experts (SMoE) output y is represented by a relation of the type

y = g₁ · f₁(x) + g₂ · f₂(x) + · · · + g_n · f_n(x) .

It’s just a weighted combination of n experts (these are normally 2-layer MLPs or, in case of Mixture of Attention, constructs of the type illustrated below) represented by the functions f_i with i = 1, 2, . . . , n with the various “weights” g_i as functions that select the top k logits (taking their softmax) from s, setting the rest to 0.

In essence, s is a vector whose larger components have a greater influence on the above combination defining output y. The usefulness of this approach lies in the fact that if g_i = 0 for several indices i, then all the corresponding f_i(x) will not be evaluated, thus reducing computation cost during training and inference. The mechanism of a single attention expert is illustrated in the following figure.

Matrices W_k and W_v are shared across experts to improve the training and inference efficiency, instead matrices W_q and W_o in orange vary from one expert to the other. a_e is obtained applying standard multi-head attention with RoPE to k, v and q_e .

A little coding

A very concise and quick PyTorch test Jupyter notebook for JetMoE can be found here (warning: you’ll need a lot of GPU memory). Alternatively, you can test the model directly using the Online Demo on Lepton AI (link).

Useful links

JetMoE article

H100 vs A100 performance comparison

Microsoft Llama-2-Onnx model details (link)

Article on training costs here

GPT-4 over $100 million training (here)

JetMoE’s page

Video – Rotary Positional Embedding (RoPE)

JetMoE Jupyter notebook

Online Demo on Lepton AI (link)

Also available on Substack

Support this blog

Reasoning without External Rewards

From self-certainty to Reinforcement Learning from Internal Feedback

LLMs: a 5 mins trip

Minimal RAG model

Byte Latent Transformer

Exploring Florence-2

Multimodal AI with Unified Vision-Language Capabilities

Introduction

Model and Data Engine

Image Collection

Data Annotation Process

Annotation-Specific Variations

Coding example

Useful links

Phi-3 models

Tiny but mighty

Introduction

Training strategy

Technical specifications

Benchmarks

Phi-3-Vision – quick testing

Useful links

Local RAG using LLaMA3

Moondream

“A tiny open-source computer-vision model that runs everywhere and kicks ass” (cit.)

Limitations

Useful links

JetMoE

LLM training can be much cheaper than people generally thought

Training costs

Key Messages

How JetMoE works

A little coding

Useful links

From self-certainty to Reinforcement Learning from Internal Feedback

An efficient and effective reinforcement learning algorithm

Multimodal AI with Unified Vision-Language Capabilities

Introduction

Model and Data Engine

Image Collection

Data Annotation Process

Annotation-Specific Variations

Coding example

Useful links

Tiny but mighty

Introduction

Training strategy

Technical specifications

Benchmarks

Phi-3-Vision – quick testing

Useful links

“A tiny open-source computer-vision model that runs everywhere and kicks ass” (cit.)

Limitations

Useful links

LLM training can be much cheaper than people generally thought

Training costs

Key Messages

How JetMoE works

A little coding

Useful links