Blog

Using olmOCR LLM to extract text from images

This blog post describes how to use the OpenAI API to convert images into clean plain text using Document Anchoring prompting to a locally deployed olmOCR-7B-0225-preview LLM.

🌟 Intro

In this blog I want to explain how to run a locally LLM deployed using LM Studio and how you can use the OpenAI SDK to interact with this vision enabled LLM named olmOCR.

 

πŸ’» LM Studio

LM Studio is a local desktop application that allows you to run and interact with large language models (LLMs) directly on your computer without relying on cloud-based services. It has a user-friendly interface for loading, managing, and using open-source models such as those from Hugging Face and other repositories. By enabling offline access to powerful AI models, LM Studio enhances privacy, reduces dependency on internet connectivity, and allows developers and researchers to experiment with LLMs in a controlled environment. It is particularly useful when you need AI capabilities without exposing sensitive data to external servers.

Installation

Installing LM Studio is easy, just go to the download page and select your OS, architecture and version.

Models

You can run any compatible Large Language Model (LLM) from Hugging Face, both in GGUF (llama.cpp) format, as well as in the MLX format (Mac only).
You can also run GGUF text embedding models.

GGUF / llama.cpp

GGUF (Generic GPT Unified Format) is a highly optimized file format used in llama.cpp, a fast and lightweight inference framework for running large language models (LLMs) such as Meta’s LLaMA models on local hardware. GGUF was introduced as an improvement over previous formats like GGML and GGJT, providing better efficiency, flexibility, and compatibility.
For more details, see also this blog post.

Load Model

Before you can use a model, you need to search it and download it in LM Studio. See the next screenshot on how to download for “olmOCR-7B-0225-preview”:

Discover:
LM Studio Discover

Search and Download the model:
LM Studio Search and Download

πŸ’‘ Note
Notice that this specific LLM has “Vision Mode Enabled”, which means that this model can process and analyze image inputs.

Use Model

Now load this model in LM Studio:
LM Studio Load Model

Headless mode

Whilst not necessary for running a VLM, LM Studio can also run in Headless mode (since version 0.3.5), which allows you to run models without a GUI. This is great for background processing tasks.
To enable headless mode, go to the settings menu and check the “Enable Local LLM Service” option:
LM Studio Headless

CLI

In addition to Headless mode, you also use the CLI to interact with LM Studio. Check out the documentation for more information on how to set this up.
Let’s see what the lms command shows:

PowerShell 7.5.0  
PS C:\Users\StefHeyenrath> lms  
   __   __  ___  ______          ___        _______   ____  
  / /  /  |/  / / __/ /___ _____/ (_)__    / ___/ /  /  _/  
 / /__/ /|_/ / _\ \/ __/ // / _  / / _ \  / /__/ /___/ /  
/____/_/  /_/ /___/\__/\_,_/\_,_/_/\___/  \___/____/___/  

lms - LM Studio CLI - v0.0.39  
GitHub: https://github.com/lmstudio-ai/lmstudio-cli  

Usage  
lms <subcommand>  

where <subcommand> can be one of:  

- status - Prints the status of LM Studio  
- server - Commands for managing the local server  
- ls - List all downloaded models  
- ps - List all loaded models  
- get - Searching and downloading a model from online.  
- load - Load a model  
- unload - Unload a model  
- create - Create a new project with scaffolding  
- log - Log operations. Currently only supports streaming logs from LM Studio via `lms log stream`  
- import - Import a model file into LM Studio  
- bootstrap - Bootstrap the CLI  
- version - Prints the version of the CLI  

For more help, try running `lms <subcommand> --help`

 

Local endpoints

Once a model is loaded, this model is available for use within LM Studio, but also accessible via an Open AI compatible HTTP interface at http://127.0.0.1:1234.

The following calls are supported:

  • GET /v1/models: List the current loaded models.
  • POST /v1/chat/completions: Chat completions. Send a chat history to the model to predict the next assistant response.
  • POST /v1/completions: Text Completions mode. Predict the next token(s) given a prompt. Note: OpenAI considers this endpoint deprecated.
  • POST /v1/embeddings: Text Embedding. Generate text embeddings for a given text input. Takes a string or array of strings.

 

πŸ’» OpenAI SDK

The OpenAI SDK is normally intended to be used for accessing the Open AI API, however by adjusting the endpoint URL, you can configure the OpenAI client to connect to your local LM Studio endpoint instead of the internet facing OpenAI endpoint.

The LM Studio endpoint is accessible at http://127.0.0.1:1234/v1, so you can use the following C# code to create a OpenAIClient which connects to the locally running LM Studio. Note that you don’t need any credentials.

using System.ClientModel;
using OpenAI;

ApiKeyCredential credential = new ApiKeyCredential("not needed for LM Studio");
OpenAIClientOptions options = new OpenAIClientOptions
{
    Endpoint = new Uri("http://127.0.0.1:1234/v1")
};

OpenAIClient client = new OpenAIClient(credential, options);

 

Chat

When you want to interact with an LLM using the SDK, the following steps are needed:

  1. Define a system message to give the LLM context
  2. Define a user message with the question or command the LLM should execute
  3. Create a ChatClient
  4. Provide the system and user message to the ChatClient to initiate the chat

System Message

A system message gives the LLM more context, in case you want the LLM instruct about OCR capabilities, you can use this prompt:

You are an advanced Optical Character Recognition (OCR) system designed to extract text from images with the highest accuracy.
 Your goal is to recognize and extract all readable text from an image while preserving its structure and formatting as much as possible. 
 Follow these guidelines:  

 ### Text Extraction Guidelines:
 1. Maximize Accuracy: Identify and extract all visible text, including printed, handwritten, and stylized fonts.  
 2. Preserve Formatting: Maintain line breaks, spacing, and paragraph structures where applicable.  
 3. Handle Noise & Distortions: Recognize text even if it is slightly blurred, tilted, or obstructed by minor elements.  
 4. Support Multiple Languages: Detect and extract text in different languages when present.  
 5. Ignore Non-Text Elements: Avoid extracting background patterns, watermarks, or irrelevant visual artifacts.  
 6. Extract Special Characters: Capture symbols, numbers, punctuation marks, and mathematical notations correctly.  
 7. Process Tables & Lists: Retain the structured format of tables, bullet points, and numbered lists where applicable.  

 ### Output Format:  
 - If plain text is required, return the extracted text in a simple, structured format.  
 - If formatting is crucial, return the text in Markdown, JSON, or a structured document format as per user request.  
 - If text is unclear or incomplete, indicate the uncertainty using placeholders (e.g., `[illegible]`).  

 You are optimized for precision and clarity. Extract the text exactly as it appears in the image, ensuring a high-quality output.

Creating a system message using the C# SDK can be done with this code:

SystemChatMessage systemChatMessage = new SystemChatMessage(Prompts.OcrSystemPrompt);

User Message

Creating a user message with a question using the C# SDK can be done with this code:

UserChatMessage userChatMessage = new UserChatMessage("Tell a joke about OCR.");

ChatClient

Creating a Chat Client and sending the system and user message to the LLM and getting the response-text can be done with this code:

var systemChatMessage = new SystemChatMessage(Prompts.OcrSystemPrompt);
var userChatMessage = new UserChatMessage("Tell a joke about OCR.");

var chatClient = client.GetChatClient("[Model]"); // Provide the model name here...

var response = await chatClient.CompleteChatAsync(systemChatMessage, userChatMessage);
var text = string.Concat(response.Value.Content.Select(c => c.Text)).Trim();

 

πŸ‘οΈ Vision enabled Chat

The examples above does only send a user text message, for the OCR scenario, we want to upload an image and ask a question.
To do this, we need to read the image and use the CreateTextPart and CreateImagePart factory methods on ChatMessageContentPart to create the a user message which contains 2 parts:

var imageBytes = await File.ReadAllBytesAsync("[Image Path]");

var userChatMessage = new UserChatMessage
(
    ChatMessageContentPart.CreateTextPart("Extract all text."),
    ChatMessageContentPart.CreateImagePart(BinaryData.FromBytes(imageBytes), "[Mime Type]")
);

This code will POST to /v1/chat/completions with JSON body to the endpoint:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an advanced Optical Character Recognition ...  ...ears in the image, ensuring a high-quality output."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract all text."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "...  ...YAgAQCNQAAJBCoAQAgwf8HAJGK1t/DmKMAAAAASUVORK5CYII="
          }
        }
      ]
    }
  ],
  "model": "olmocr-7b-0225-preview"
}

As you can see, the user message contains 2 parts, the first part is the text message, the second part is a Base64 encoded image.

 

πŸ“‘ olmOCR

Markdown

olmOCR is capable of processing a diversity of document types, covering different domains as well as visual layouts.
It uses Markdown to represent structured content, such as sections, lists, equations and tables.

The output from olOCR is a JSON message which complies to this JSON Schema:

{
  "name": "page_response",
  "schema": {
    "type": "object",
    "properties": {
      "primary_language": {
        "type": ["string", "null"],
        "description": "..."
      },
      "is_rotation_valid": {
        "type": "boolean",
        "description": "..."
      },
      "rotation_correction": {
        "type": "integer",
        "description": "...",
        "enum": [0, 90, 180, 270],
        "default": 0
      },
      "is_table": {
        "type": "boolean",
        "description": "..."
      },
      "is_diagram": {
        "type": "boolean",
        "description": "..."
      },
      "natural_text": {
        "type": ["string", "null"],
        "description": "..."
      }
    },
    "additionalProperties": false,
    "required": [
      "primary_language",
      "is_rotation_valid",
      "rotation_correction",
      "is_table",
      "is_diagram",
      "natural_text"
    ]
  },
  "strict": true
}

For a full description, see olmOCR – Paper.

Document-Anchoring

Many end-to-end OCR models, exclusively rely on rasterized pages to convert documents and images to plain text; that is, they process images of the document pages as input to autoregressively decode text tokens. This approach, while offering great compatibility with image-only digitization pipelines, misses the fact that most PDFs are born-digital documents, thus already contain either digitized text or other metadata that would help in correctly linearizing the content.

In contrast, the olmOCR pipeline leverages document text and metadata. This approach is called document-anchoring.

OlmOCR - Document Anchoring
The above figure provides an overview of the methods used by olmOCR; document-anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted from the PDF binary file. Crucially, the anchored text is provide as input to any VLM alongside a rasterized image of the page.

Using Document-Anchoring for images

The goal is to use olmOCR to extract text from the provide image, which means that we do not have a PDF as source. This should not limit us by using olmOCR because this LLM also supports images.

To fully utilize the capabilities from olmOCR, we need to provide some Document-Anchoring data , in order to get better results.
This is done which these steps:

  1. Read the source image on which the olmOCR LLM needs to do OCR.
  2. Resize the image to A4 format.
  3. Make sure that the maximum height does not exceed the recommended maximum height (e.g. 1024 pixels).
  4. Use the resized image, the new page dimensions (in pixels) and the location from the image on that page (in pixels) to build a Document Anchoring prompt.
  5. Provide that Document Anchoring prompt to olmOCR.
  6. Deserialize the JSON response and extract the natural_text as Markdown.

Example

Example image with text:
Example Text

For the above image with text, we need to resize it to A4 aspect-ratio and get the new page dimensions (in pixels) and the location from the image on that page (in pixels) to build a Document Anchoring prompt.
The image will be resized to:
Resized

The Document Anchoring prompt for this image looks like:

Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally.
Do not hallucinate.
RAW_TEXT_START
Page dimensions: 724.0x1024.0
[Image 0x414 to 724x610]

RAW_TEXT_END

πŸ’‘ Note
In the Document Anchoring prompt above, there is a text Do not hallucinate..
Adding this does not 100% guarantee the elimination of hallucinations (i.e., confident but false or misleading responses), but it might help a bit:

  • LLMs are sensitive to instructions, so explicitly stating “Do not hallucinate” could encourage the model to be more cautious.
  • It might shift the model toward generating more grounded and fact-checked responses.
  • Certain models trained with alignment and reinforcement learning could interpret this as a signal to avoid making up information.

When using the olmOCR-7B-0225-preview model, the default temperature value (which defines how much randomness will be used, a value of 0 will yield the same result every time, while higher values will increase creativity and variance) is set to 0,1, this means that there will be very little randomness in the response.

  • The model’s output will be mostly consistent across repeated queries.
  • However, because the value is not exactly 0, there will be a small amount of variability, allowing for occasional minor differences in phrasing or word choice.
  • The response will still prioritize predictability and coherence over creativity.

When using this prompt and the attached image on the olmOCR-7B-0225-preview model, the JSON result looks like:

{
  "primary_language": "en",
  "is_rotation_valid": true,
  "rotation_correction": 0,
  "is_table": false,
  "is_diagram": false,
  "natural_text": "1 Introduction\n\nAccess to clean, coherent textual data is a crucial component in the life cycle of modern language models (LMs). During model development, LMs require training on trillions of tokens derived from billions of documents (Schuh et al., 2024; Pesando et al., 2024; Li et al., 2014); errors from noisy or low fidelity content extraction and representation can result in training instabilities or even worse downstream performance (Pesando et al., 2023; Li et al., 2024; OhMo et al., 2024). During inference, LMs are often prompted with plain text representations of relevant document content to ground user prompts; for example, consider information extraction (Kim et al., 2021) or AI reading assistance (Li et al., 2024) over a user-provided document and cascading downstream errors due to low quality representation of the source document."
}

πŸ’‘ Note
The source text does not contain tables, that’s the reason that the natural_text does not contain any markdown syntax, but just plain text.
Another thing to keep in mind is that newline characters are just returned as \n, so in order to format this natural_text and write this to a output file, some replacements have to be done to replace \n by a real newline.
Also if needed some other characters can be sanitized.

Example C# method:

private static string Sanitize(string text)
{
    return text
        .Replace("\\n", "\r\n")
        .Replace("’", "'")
        .Replace("β€œ", "\"")
        .Replace("”", "\"");
}

For an POC example project, see olmOcr Example.

 

🎁 Bonus

Because the olmOCR-7B-0225-preview model is fine-tuned from a Qwen2-VL-7B-Instruct LLM, we can also use the olmOCR LLM to answer any questions on the text content.

So for the above text, we can ask the olmOCR-7B-0225-preview model a question like: “Explain what can cause bad downstream performance.”.
The answer will be like: “Bad downstream performance can be caused by errors from noisy or low fidelity content extraction and representation during model development, which can result in training instabilities. During inference, LMs are often prompted with plain text representations of relevant document content to ground user prompts, and cascading downstream errors due to low quality representation of the source document can also lead to bad performance.”.

For an POC example project, see olmOcr Question & Answer Example.

 

πŸ“Œ Conclusion

In this blog post, I’ve described how to use the olmOCR-7B-0225-preview LLM to extract text from images by leveraging LM Studio for local deployment and the OpenAI SDK for interaction. I’ve covered the essential steps, including installing and configuring LM Studio, loading the model, and enabling vision-based OCR through structured Document Anchoring prompts.

With the use of olmOCR-7B-0225-preview, you can achieve high-quality text extraction while maintaining privacy and full control over the process. Additionally, the model’s ability to understand structured layouts, such as tables and equations, ensures a more accurate and organized output. This approach enhances OCR accuracy beyond traditional methods by combining rasterized image processing with document-anchored text extraction.

If you’re looking to experiment with local OCR models or develop custom document parsing solutions, olmOCR-7B-0225-preview offers a powerful and flexible toolset which can be used in C#.

 


 

πŸ“ Notes

Some content in this blog is created with the help of an AI. I did review and revise the content where needed.

Image

Stef Heyenrath

Stef Heyenrath is naast een zeer ervaren ontwikkelaar ook de auteur van meerdere succesvolle NuGet packages, waaronder WireMock.Net.

Written by: Stef Heyenrath

Stef started writing software for the Microsoft .NET framework in 2007. Over the years, he has developed into a Microsoft specialist with experience in: backend technologies such as .NET, NETStandard, ASP.NET, Ethereum, Azure, and other cloud providers. In addition he worked with several frontend technologies such as Blazor, React, Angular, Vue.js.

He is the author from WireMock.Net.

Mission: Writing quality and structured software with passion in a scrum team for technically challenging projects.

Want to know more about our experts? Contact us!

Image