What is structured data extraction from images?

It means looking at a picture, like a photo of a receipt, and pulling out the useful fields in a tidy format. Instead of one long blob of text, you get neat values such as shop name, date, and total amount, ready to save in a database. A vision model reads the picture, and a schema tells it exactly which fields you want back.

Is Ollama free and does it run on my own computer?

Yes. Ollama is free and open source. It downloads AI models and runs them on your own machine. Your images never leave your computer, so there are no API bills and your data stays private. You just need enough memory, ideally 8 GB or more, to run a small vision model smoothly.

Which Ollama vision model is best for reading documents?

For everyday image description, LLaVA 7B is a solid default. For documents, receipts, and charts where you need accurate fields, Qwen2.5-VL 7B usually reads structure better. Llama 3.2 Vision is another strong choice. Start small, test on your own images, and move to a bigger model only if you need more accuracy.

Fundamentalsbeginner

How to Extract Structured Data From Images Using Ollama in .NET

A beginner-friendly guide to reading text and fields from images using a local Ollama vision model in .NET, returning clean, typed JSON every time.

13 min readUpdated October 25, 2025

Think about a kind shopkeeper near your home. You hand him a crumpled paper receipt and ask, "How much did I spend, and on what day?" He glances at it for a second and says, "Two hundred and forty rupees, last Tuesday, from the grocery store." He looked at a messy piece of paper and gave you back clean, useful answers.

That is exactly what we want our program to do. We will give it a photo (a receipt, an ID card, a form) and we want back tidy fields: shop name, date, total. Not one big paragraph. Neat values we can save in a database.

To do this we will use Ollama, a free tool that runs AI models on your own computer, and .NET. The best part is that the picture never leaves your machine. No cloud bill. No data sent away. We go slow, use simple words, and build it step by step.

What we are building

We are building a small .NET app that:

Loads an image from disk.
Sends it to a local vision model running in Ollama.
Asks for the answer in a fixed shape (a schema).
Gets back a clean C# object we can use right away.

The big picture: a photo goes in, tidy fields come out.

Let us understand the words first, because they sound scary but the ideas are simple.

Three small ideas

A vision model is an AI model that can "see" pictures, not just read text. You give it an image and a question, and it answers in words. Normal language models only read text. Vision models read images too. They are also called multimodal models, because they handle more than one kind of input.

Ollama is a free program that downloads these models and runs them on your computer. Think of it like a music app, but instead of songs it stores AI models, and instead of playing music it answers your questions. It listens on your machine at the address http://localhost:11434.

Structured output means we force the answer into a fixed shape. Without it, the model might reply, "This looks like a grocery bill for about 240 rupees." That sentence is hard for code to use. With structured output, we say, "Give me a JSON object with shopName, date, and total." Now the answer is predictable, and our program can read it safely.

From messy photo to clean fields

Photo

Model reads

Schema applied

Typed object

Steps

Photo

A real receipt image

Model reads

Vision model sees the text

Schema applied

We demand fixed fields

Typed object

Clean C# values to use

The journey of one receipt image.

Step 1: Install Ollama and a vision model

First, install Ollama from the official website. It works on Windows, macOS, and Linux. Once it is installed, it runs quietly in the background.

Next, pull a vision model. We will start with a small, popular one. Open your terminal and run this:

ollama pull llava

This downloads the LLaVA model. It is a friendly starting point for reading images. If you later need better accuracy on documents and receipts, you can try a stronger model:

ollama pull qwen2.5vl

You can check that the model is ready with ollama list. Here is a quick guide to help you choose.

Model	Size (approx)	Good for	Memory needed
`llava`	7B	General images, simple text	8 GB+
`qwen2.5vl`	7B	Documents, receipts, charts	8 GB+
`llama3.2-vision`	11B	Detailed reading, harder layouts	12 GB+

Start small. A 7B model runs on most laptops. Only move to a bigger one if your test images come out wrong. Bigger models are slower and need more memory, so do not jump to the largest one without a reason.

Step 2: Set up the .NET project

Now we create a console app. We will use .NET 10, which is the current long-term support version. Run these commands:

dotnet new console -n ReceiptReader
cd ReceiptReader
dotnet add package OllamaSharp
dotnet add package Microsoft.Extensions.AI

Two packages do the heavy lifting:

OllamaSharp is a .NET library that talks to Ollama for us. It is the easiest way to use Ollama from C#.
Microsoft.Extensions.AI gives us a shared interface called IChatClient. This is a common "shape" that many AI providers follow. Because OllamaSharp follows it too, our code looks the same as it would for other providers. If you ever switch providers, you change very little.

How the .NET pieces fit together.

The reason this layering matters is simple: your code only ever speaks to IChatClient. It does not care which library or model is behind it. That keeps your app tidy and easy to change later.

Step 3: Connect to Ollama

Let us write the connection code. We point OllamaSharp at the local Ollama address and tell it which model to use.

using Microsoft.Extensions.AI;
using OllamaSharp;
 
// Point at the local Ollama server and pick the vision model.
IChatClient client = new OllamaApiClient(
    new Uri("http://localhost:11434/"),
    "llava");
 
// A quick text-only test to confirm the connection works.
var hello = await client.GetResponseAsync("Say hello in one short sentence.");
Console.WriteLine(hello.Text);

Notice the type is IChatClient, not OllamaApiClient. We store it in the interface. This is good practice. Our later code does not need to know the brand of the engine, only that it can chat.

If this prints a friendly hello, your setup works. If it fails, make sure Ollama is running and that you pulled the model. Most early errors are just a model that was not downloaded yet, or Ollama not started.

Step 4: Send an image to the model

Now the fun part. We load an image from disk and send it along with a question. In Microsoft.Extensions.AI, an image is wrapped in a DataContent. We put it inside a chat message together with our text.

// Read the image bytes from disk.
byte[] imageBytes = await File.ReadAllBytesAsync("receipt.jpg");
 
// Build a message that holds BOTH text and the image.
var message = new ChatMessage(ChatRole.User,
[
    new TextContent("Read this receipt and list the shop, date, and total."),
    new DataContent(imageBytes, "image/jpeg")
]);
 
var response = await client.GetResponseAsync(message);
Console.WriteLine(response.Text);

This already works. The model will look at the picture and describe it in words. But there is a problem. The reply is free text. One time it might say "Total: 240". Another time it might say "The amount comes to about 240 rupees". Our code cannot rely on a moving target. We need fixed fields. That is the next step.

Why free text is risky

Run 1

Run 2

Run 3

Steps

Run 1

'Total is 240'

Run 2

'About Rs 240'

Run 3

'Amount: 240.00'

The same photo can give different sentences each run.

Step 5: Ask for structured output

Here is the magic step. We define a C# class that describes the shape we want. Then we ask for that shape directly. OllamaSharp and Microsoft.Extensions.AI work together to send a schema to the model and to parse the answer back into our class.

First, the class. Keep it small and clear.

// This class IS our schema. Each property is a field we want back.
public class Receipt
{
    public string ShopName { get; set; } = "";
    public string Date { get; set; } = "";
    public decimal Total { get; set; }
    public string Currency { get; set; } = "";
}

Now we call the typed version, GetResponseAsync<Receipt>. The <Receipt> part tells the library, "I want the answer shaped like this class."

byte[] imageBytes = await File.ReadAllBytesAsync("receipt.jpg");
 
var message = new ChatMessage(ChatRole.User,
[
    new TextContent(
        "Extract the shop name, date, total amount, and currency " +
        "from this receipt."),
    new DataContent(imageBytes, "image/jpeg")
]);
 
// Ask for a strongly typed Receipt, not free text.
ChatResponse<Receipt> response =
    await client.GetResponseAsync<Receipt>(message);
 
if (response.TryGetResult(out Receipt? receipt))
{
    Console.WriteLine($"Shop:     {receipt.ShopName}");
    Console.WriteLine($"Date:     {receipt.Date}");
    Console.WriteLine($"Total:    {receipt.Total} {receipt.Currency}");
}
else
{
    Console.WriteLine("The model did not return a valid Receipt.");
}

Look at what changed. We no longer read a sentence and try to guess the numbers. We get a real Receipt object with a Total of type decimal. We can save it, add it up, or put it in a report straight away.

The library does three jobs for us behind the scenes. It builds a JSON schema from our class. It tells Ollama to follow that schema. Then it reads the JSON reply and turns it into a Receipt. We just write normal C#.

What the typed call does for you under the hood.

Always check the result

There is one honest truth about AI models: they are not perfect. Most of the time they follow the schema. But sometimes a model returns something that does not fit, especially with a blurry image or a tiny model.

That is why we used TryGetResult. It returns true only when the reply was valid and could be turned into a Receipt. If it returns false, we handle it kindly instead of crashing. This small habit makes your app safe in the real world.

Here is a simple comparison of the two ways to read a reply.

Approach	What you get	Safety
`response.Text`	A sentence of free text	You parse it yourself, easy to break
`GetResponseAsync<Receipt>`	A typed `Receipt` object	Schema enforced, checked with `TryGetResult`

For real apps, always prefer the typed call with a check. It is the difference between a toy and a tool you can trust.

A complete tiny example

Let us put the pieces together into one short program you can run. It reads a receipt and prints the fields, with a safe fallback.

using Microsoft.Extensions.AI;
using OllamaSharp;
 
IChatClient client = new OllamaApiClient(
    new Uri("http://localhost:11434/"),
    "qwen2.5vl"); // a model that reads documents well
 
byte[] imageBytes = await File.ReadAllBytesAsync("receipt.jpg");
 
var message = new ChatMessage(ChatRole.User,
[
    new TextContent(
        "You are reading an Indian grocery receipt. Extract the shop " +
        "name, the date as text, the total amount as a number, and the " +
        "currency code. If a field is missing, leave it empty."),
    new DataContent(imageBytes, "image/jpeg")
]);
 
ChatResponse<Receipt> response =
    await client.GetResponseAsync<Receipt>(message);
 
if (response.TryGetResult(out Receipt? receipt))
{
    Console.WriteLine($"Shop:  {receipt.ShopName}");
    Console.WriteLine($"Date:  {receipt.Date}");
    Console.WriteLine($"Total: {receipt.Total} {receipt.Currency}");
}
else
{
    Console.WriteLine("Could not read the receipt. Try a clearer photo.");
}
 
public class Receipt
{
    public string ShopName { get; set; } = "";
    public string Date { get; set; } = "";
    public decimal Total { get; set; }
    public string Currency { get; set; } = "";
}

Notice the prompt gives gentle hints: it mentions the kind of receipt, and it tells the model what to do when a field is missing. Clear prompts give better answers. You are talking to the model like a helpful teacher giving instructions to a student.

Tips that make a real difference

Small habits raise your accuracy a lot. Here are the ones that matter most for beginners.

Use clear images. A sharp, well-lit photo helps the model far more than a fancy model on a blurry photo. Crop out the background if you can.

Keep your schema small. Ask for the fields you truly need. A class with four clear properties works better than one with twenty. Fewer fields means fewer mistakes.

Write plain prompts. Say exactly what each field means. "Total amount as a number, with no currency symbol" is better than just "total".

Pick the right model for the job. For documents and receipts, a model like Qwen2.5-VL reads structure better than a general one. Test on your own images and trust the results you see.

Always handle failure. Use TryGetResult and show a friendly message. Never assume the reply is perfect.

A safe loop for production use.

The diagram above shows a sensible loop. Load the image, send it, check the result. If it is valid, save it. If not, you might retry once with a clearer prompt before giving up. This keeps your app calm even when one image is hard to read.

Where you can use this

Once you can read fields from an image, many real tasks open up:

Receipts for expense tracking, like our example.
ID cards to read a name and number into a form.
Invoices to pull line items and totals for accounting.
Forms that people fill by hand and scan.
Screenshots where you want the text turned into data.

All of it runs on your own machine. For a small business in India counting daily bills, or a student building a project, that means zero cloud cost and full privacy. The photos stay with you.

Common real uses

Receipts

ID cards

Invoices

Forms

Steps

Receipts

Shop, date, total

ID cards

Name, number

Invoices

Items, amounts

Forms

Handwritten fields

The same pattern, many jobs.

A note on privacy and cost

When you use a cloud AI service, your image travels over the internet to someone else's computer, and you often pay per request. With Ollama, the model runs on your machine. The image is read locally and never uploaded. There is no per-request bill. You pay only for the electricity and the hardware you already own.

This is a big deal for sensitive documents like ID cards or medical forms. Keeping data on your own device is often the safest and simplest choice. It also means your app keeps working even with a weak internet connection, because nothing needs to be sent away.

Quick recap

Here is everything in short, easy points:

A vision model is an AI that can read pictures, not just text.
Ollama runs these models for free on your own computer at http://localhost:11434.
In .NET, OllamaSharp plus Microsoft.Extensions.AI give you a simple IChatClient.
Send an image with DataContent inside a ChatMessage.
Use GetResponseAsync<T> with a small class to get structured, typed output instead of messy text.
Always check the reply with TryGetResult so your app stays safe.
Use clear photos, small schemas, plain prompts, and the right model.
Everything runs locally, so your data stays private and there are no API bills.

You started with a crumpled photo and ended with clean fields, just like the friendly shopkeeper. That is the whole idea, and now you can build it yourself.

How to Extract Structured Data From Images Using Ollama in .NET

What we are building

Three small ideas

From messy photo to clean fields

Step 1: Install Ollama and a vision model

Step 2: Set up the .NET project

Step 3: Connect to Ollama

Step 4: Send an image to the model

Why free text is risky

Step 5: Ask for structured output

Always check the result

A complete tiny example

Tips that make a real difference

Where you can use this

Common real uses

A note on privacy and cost

Quick recap

References and further reading

Related Posts

Building Semantic Search With Amazon S3 Vectors and Semantic Kernel

Building Resilient Cloud Applications With .NET

Flexible PDF Reporting in .NET Using Razor Views

What Is Vector Search? A Concise Guide for .NET Developers

Working With LLMs in .NET Using Microsoft.Extensions.AI

Top AI Instruments for .NET Developers in 2025