Blog
Using olmOCR LLM to extract text from images
This blog post describes how to use the OpenAI API to convert images into clean plain text using Document Anchoring prompting to a locally deployed olmOCR-7B-0225-preview LLM.
π Intro
In this blog I want to explain how to run a locally LLM deployed using LM Studio and how you can use the OpenAI SDK to interact with this vision enabled LLM named olmOCR.
π» LM Studio
LM Studio is a local desktop application that allows you to run and interact with large language models (LLMs) directly on your computer without relying on cloud-based services. It has a user-friendly interface for loading, managing, and using open-source models such as those from Hugging Face and other repositories. By enabling offline access to powerful AI models, LM Studio enhances privacy, reduces dependency on internet connectivity, and allows developers and researchers to experiment with LLMs in a controlled environment. It is particularly useful when you need AI capabilities without exposing sensitive data to external servers.
Installation
Installing LM Studio is easy, just go to the download page and select your OS, architecture and version.
Models
You can run any compatible Large Language Model (LLM) from Hugging Face, both in GGUF (llama.cpp) format, as well as in the MLX format (Mac only).
You can also run GGUF text embedding models.
GGUF / llama.cpp
GGUF (Generic GPT Unified Format) is a highly optimized file format used in llama.cpp, a fast and lightweight inference framework for running large language models (LLMs) such as Meta’s LLaMA models on local hardware. GGUF was introduced as an improvement over previous formats like GGML and GGJT, providing better efficiency, flexibility, and compatibility.
For more details, see also this blog post.
Load Model
Before you can use a model, you need to search it and download it in LM Studio. See the next screenshot on how to download for “olmOCR-7B-0225-preview”:
Discover:

Search and Download the model:

π‘ Note
Notice that this specific LLM has “Vision Mode Enabled”, which means that this model can process and analyze image inputs.
Use Model
Now load this model in LM Studio:

Headless mode
Whilst not necessary for running a VLM, LM Studio can also run in Headless mode (since version 0.3.5), which allows you to run models without a GUI. This is great for background processing tasks.
To enable headless mode, go to the settings menu and check the “Enable Local LLM Service” option:

CLI
In addition to Headless mode, you also use the CLI to interact with LM Studio. Check out the documentation for more information on how to set this up.
Let’s see what the lms command shows:
PowerShell 7.5.0 PS C:\Users\StefHeyenrath> lms __ __ ___ ______ ___ _______ ____ / / / |/ / / __/ /___ _____/ (_)__ / ___/ / / _/ / /__/ /|_/ / _\ \/ __/ // / _ / / _ \ / /__/ /___/ / /____/_/ /_/ /___/\__/\_,_/\_,_/_/\___/ \___/____/___/ lms - LM Studio CLI - v0.0.39 GitHub: https://github.com/lmstudio-ai/lmstudio-cli Usage lms <subcommand> where <subcommand> can be one of: - status - Prints the status of LM Studio - server - Commands for managing the local server - ls - List all downloaded models - ps - List all loaded models - get - Searching and downloading a model from online. - load - Load a model - unload - Unload a model - create - Create a new project with scaffolding - log - Log operations. Currently only supports streaming logs from LM Studio via `lms log stream` - import - Import a model file into LM Studio - bootstrap - Bootstrap the CLI - version - Prints the version of the CLI For more help, try running `lms <subcommand> --help`
Local endpoints
Once a model is loaded, this model is available for use within LM Studio, but also accessible via an Open AI compatible HTTP interface at http://127.0.0.1:1234.
The following calls are supported:
GET /v1/models: List the current loaded models.POST /v1/chat/completions: Chat completions. Send a chat history to the model to predict the next assistant response.POST /v1/completions: Text Completions mode. Predict the next token(s) given a prompt. Note: OpenAI considers this endpoint deprecated.POST /v1/embeddings: Text Embedding. Generate text embeddings for a given text input. Takes a string or array of strings.
π» OpenAI SDK
The OpenAI SDK is normally intended to be used for accessing the Open AI API, however by adjusting the endpoint URL, you can configure the OpenAI client to connect to your local LM Studio endpoint instead of the internet facing OpenAI endpoint.
The LM Studio endpoint is accessible at http://127.0.0.1:1234/v1, so you can use the following C# code to create a OpenAIClient which connects to the locally running LM Studio. Note that you don’t need any credentials.
using System.ClientModel;
using OpenAI;
ApiKeyCredential credential = new ApiKeyCredential("not needed for LM Studio");
OpenAIClientOptions options = new OpenAIClientOptions
{
Endpoint = new Uri("http://127.0.0.1:1234/v1")
};
OpenAIClient client = new OpenAIClient(credential, options);
Chat
When you want to interact with an LLM using the SDK, the following steps are needed:
- Define a system message to give the LLM context
- Define a user message with the question or command the LLM should execute
- Create a ChatClient
- Provide the system and user message to the ChatClient to initiate the chat
System Message
A system message gives the LLM more context, in case you want the LLM instruct about OCR capabilities, you can use this prompt:
You are an advanced Optical Character Recognition (OCR) system designed to extract text from images with the highest accuracy. Your goal is to recognize and extract all readable text from an image while preserving its structure and formatting as much as possible. Follow these guidelines: ### Text Extraction Guidelines: 1. Maximize Accuracy: Identify and extract all visible text, including printed, handwritten, and stylized fonts. 2. Preserve Formatting: Maintain line breaks, spacing, and paragraph structures where applicable. 3. Handle Noise & Distortions: Recognize text even if it is slightly blurred, tilted, or obstructed by minor elements. 4. Support Multiple Languages: Detect and extract text in different languages when present. 5. Ignore Non-Text Elements: Avoid extracting background patterns, watermarks, or irrelevant visual artifacts. 6. Extract Special Characters: Capture symbols, numbers, punctuation marks, and mathematical notations correctly. 7. Process Tables & Lists: Retain the structured format of tables, bullet points, and numbered lists where applicable. ### Output Format: - If plain text is required, return the extracted text in a simple, structured format. - If formatting is crucial, return the text in Markdown, JSON, or a structured document format as per user request. - If text is unclear or incomplete, indicate the uncertainty using placeholders (e.g., `[illegible]`). You are optimized for precision and clarity. Extract the text exactly as it appears in the image, ensuring a high-quality output.
Creating a system message using the C# SDK can be done with this code:
SystemChatMessage systemChatMessage = new SystemChatMessage(Prompts.OcrSystemPrompt);
User Message
Creating a user message with a question using the C# SDK can be done with this code:
UserChatMessage userChatMessage = new UserChatMessage("Tell a joke about OCR.");ChatClient
Creating a Chat Client and sending the system and user message to the LLM and getting the response-text can be done with this code:
var systemChatMessage = new SystemChatMessage(Prompts.OcrSystemPrompt);
var userChatMessage = new UserChatMessage("Tell a joke about OCR.");
var chatClient = client.GetChatClient("[Model]"); // Provide the model name here...
var response = await chatClient.CompleteChatAsync(systemChatMessage, userChatMessage);
var text = string.Concat(response.Value.Content.Select(c => c.Text)).Trim();
ποΈ Vision enabled Chat
The examples above does only send a user text message, for the OCR scenario, we want to upload an image and ask a question.
To do this, we need to read the image and use the CreateTextPart and CreateImagePart factory methods on ChatMessageContentPart to create the a user message which contains 2 parts:
var imageBytes = await File.ReadAllBytesAsync("[Image Path]");
var userChatMessage = new UserChatMessage
(
ChatMessageContentPart.CreateTextPart("Extract all text."),
ChatMessageContentPart.CreateImagePart(BinaryData.FromBytes(imageBytes), "[Mime Type]")
);
This code will POST to /v1/chat/completions with JSON body to the endpoint:
{
"messages": [
{
"role": "system",
"content": "You are an advanced Optical Character Recognition ... ...ears in the image, ensuring a high-quality output."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all text."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAtQA... ...YAgAQCNQAAJBCoAQAgwf8HAJGK1t/DmKMAAAAASUVORK5CYII="
}
}
]
}
],
"model": "olmocr-7b-0225-preview"
}
As you can see, the user message contains 2 parts, the first part is the text message, the second part is a Base64 encoded image.
π olmOCR
Markdown
olmOCR is capable of processing a diversity of document types, covering different domains as well as visual layouts.
It uses Markdown to represent structured content, such as sections, lists, equations and tables.
The output from olOCR is a JSON message which complies to this JSON Schema:
{
"name": "page_response",
"schema": {
"type": "object",
"properties": {
"primary_language": {
"type": ["string", "null"],
"description": "..."
},
"is_rotation_valid": {
"type": "boolean",
"description": "..."
},
"rotation_correction": {
"type": "integer",
"description": "...",
"enum": [0, 90, 180, 270],
"default": 0
},
"is_table": {
"type": "boolean",
"description": "..."
},
"is_diagram": {
"type": "boolean",
"description": "..."
},
"natural_text": {
"type": ["string", "null"],
"description": "..."
}
},
"additionalProperties": false,
"required": [
"primary_language",
"is_rotation_valid",
"rotation_correction",
"is_table",
"is_diagram",
"natural_text"
]
},
"strict": true
}
For a full description, see olmOCR – Paper.
Document-Anchoring
Many end-to-end OCR models, exclusively rely on rasterized pages to convert documents and images to plain text; that is, they process images of the document pages as input to autoregressively decode text tokens. This approach, while offering great compatibility with image-only digitization pipelines, misses the fact that most PDFs are born-digital documents, thus already contain either digitized text or other metadata that would help in correctly linearizing the content.
In contrast, the olmOCR pipeline leverages document text and metadata. This approach is called document-anchoring.

The above figure provides an overview of the methods used by olmOCR; document-anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted from the PDF binary file. Crucially, the anchored text is provide as input to any VLM alongside a rasterized image of the page.
Using Document-Anchoring for images
The goal is to use olmOCR to extract text from the provide image, which means that we do not have a PDF as source. This should not limit us by using olmOCR because this LLM also supports images.
To fully utilize the capabilities from olmOCR, we need to provide some Document-Anchoring data , in order to get better results.
This is done which these steps:
- Read the source image on which the olmOCR LLM needs to do OCR.
- Resize the image to A4 format.
- Make sure that the maximum height does not exceed the recommended maximum height (e.g. 1024 pixels).
- Use the resized image, the new page dimensions (in pixels) and the location from the image on that page (in pixels) to build a Document Anchoring prompt.
- Provide that Document Anchoring prompt to olmOCR.
- Deserialize the JSON response and extract the
natural_textas Markdown.
Example
Example image with text:

For the above image with text, we need to resize it to A4 aspect-ratio and get the new page dimensions (in pixels) and the location from the image on that page (in pixels) to build a Document Anchoring prompt.
The image will be resized to:

The Document Anchoring prompt for this image looks like:
Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate. RAW_TEXT_START Page dimensions: 724.0x1024.0 [Image 0x414 to 724x610] RAW_TEXT_END
π‘ Note
In the Document Anchoring prompt above, there is a text Do not hallucinate..
Adding this does not 100% guarantee the elimination of hallucinations (i.e., confident but false or misleading responses), but it might help a bit:
- LLMs are sensitive to instructions, so explicitly stating “Do not hallucinate” could encourage the model to be more cautious.
- It might shift the model toward generating more grounded and fact-checked responses.
- Certain models trained with alignment and reinforcement learning could interpret this as a signal to avoid making up information.
When using the olmOCR-7B-0225-preview model, the default temperature value (which defines how much randomness will be used, a value of 0 will yield the same result every time, while higher values will increase creativity and variance) is set to 0,1, this means that there will be very little randomness in the response.
- The model’s output will be mostly consistent across repeated queries.
- However, because the value is not exactly 0, there will be a small amount of variability, allowing for occasional minor differences in phrasing or word choice.
- The response will still prioritize predictability and coherence over creativity.
When using this prompt and the attached image on the olmOCR-7B-0225-preview model, the JSON result looks like:
{
"primary_language": "en",
"is_rotation_valid": true,
"rotation_correction": 0,
"is_table": false,
"is_diagram": false,
"natural_text": "1 Introduction\n\nAccess to clean, coherent textual data is a crucial component in the life cycle of modern language models (LMs). During model development, LMs require training on trillions of tokens derived from billions of documents (Schuh et al., 2024; Pesando et al., 2024; Li et al., 2014); errors from noisy or low fidelity content extraction and representation can result in training instabilities or even worse downstream performance (Pesando et al., 2023; Li et al., 2024; OhMo et al., 2024). During inference, LMs are often prompted with plain text representations of relevant document content to ground user prompts; for example, consider information extraction (Kim et al., 2021) or AI reading assistance (Li et al., 2024) over a user-provided document and cascading downstream errors due to low quality representation of the source document."
}
π‘ Note
The source text does not contain tables, that’s the reason that the natural_text does not contain any markdown syntax, but just plain text.
Another thing to keep in mind is that newline characters are just returned as \n, so in order to format this natural_text and write this to a output file, some replacements have to be done to replace \n by a real newline.
Also if needed some other characters can be sanitized.
Example C# method:
private static string Sanitize(string text)
{
return text
.Replace("\\n", "\r\n")
.Replace("β", "'")
.Replace("β", "\"")
.Replace("β", "\"");
}
For an POC example project, see olmOcr Example.
π Bonus
Because the olmOCR-7B-0225-preview model is fine-tuned from a Qwen2-VL-7B-Instruct LLM, we can also use the olmOCR LLM to answer any questions on the text content.
So for the above text, we can ask the olmOCR-7B-0225-preview model a question like: “Explain what can cause bad downstream performance.”.
The answer will be like: “Bad downstream performance can be caused by errors from noisy or low fidelity content extraction and representation during model development, which can result in training instabilities. During inference, LMs are often prompted with plain text representations of relevant document content to ground user prompts, and cascading downstream errors due to low quality representation of the source document can also lead to bad performance.”.
For an POC example project, see olmOcr Question & Answer Example.
π Conclusion
In this blog post, I’ve described how to use the olmOCR-7B-0225-preview LLM to extract text from images by leveraging LM Studio for local deployment and the OpenAI SDK for interaction. I’ve covered the essential steps, including installing and configuring LM Studio, loading the model, and enabling vision-based OCR through structured Document Anchoring prompts.
With the use of olmOCR-7B-0225-preview, you can achieve high-quality text extraction while maintaining privacy and full control over the process. Additionally, the model’s ability to understand structured layouts, such as tables and equations, ensures a more accurate and organized output. This approach enhances OCR accuracy beyond traditional methods by combining rasterized image processing with document-anchored text extraction.
If you’re looking to experiment with local OCR models or develop custom document parsing solutions, olmOCR-7B-0225-preview offers a powerful and flexible toolset which can be used in C#.
π Links
π Notes
Some content in this blog is created with the help of an AI. I did review and revise the content where needed.
Written by: Stef Heyenrath
Stef started writing software for the Microsoft .NET framework in 2007. Over the years, he has developed into a Microsoft specialist with experience in: backend technologies such as .NET, NETStandard, ASP.NET, Ethereum, Azure, and other cloud providers. In addition he worked with several frontend technologies such as Blazor, React, Angular, Vue.js.
He is the author from WireMock.Net.
Mission: Writing quality and structured software with passion in a scrum team for technically challenging projects.
Want to know more about our experts? Contact us!
