Skip to main content
SEMastery
Fundamentalsbeginner

Building Resilient Cloud Applications With .NET

Learn to build resilient cloud apps in .NET with retries, timeouts, and circuit breakers using Polly and Microsoft.Extensions.Resilience.

13 min readUpdated December 18, 2025

Imagine you call your friend on the phone, but the line is busy. You do not throw your phone away. You wait a few seconds and try again. If it is still busy after three tries, you decide to send a message instead. And if you notice the number has been busy for the whole day, you stop calling for a while to save your time.

That simple, patient behaviour is exactly what we want our cloud applications to do. This is called resilience.

A resilient app is not an app where nothing ever breaks. Things break in the cloud all the time. Networks hiccup. A database gets busy. Another service is restarting. A resilient app is one that expects these small problems and handles them calmly, so the user barely notices.

In this article we will learn, step by step, how to build resilient cloud applications with .NET. We will use simple words and real examples. By the end you will understand retries, timeouts, circuit breakers, fallbacks, and health checks, and you will know which .NET tools give you each one.

Why do cloud apps need resilience?

In the old days, one big program ran on one computer. If that computer was fine, the app was fine.

Modern cloud apps are different. They are made of many small services that talk to each other over the network. Your app might call a payment service, a user service, and a database, all for one button click.

Every network call is a small risk. Any of those calls can be slow or can fail. And here is the scary part: one slow service can drag down everything else, like one stuck car causing a traffic jam behind it.

One slow service can cause a chain of failures across the whole system.

When the Bank API is slow, the Payment Service waits. While it waits, it holds a thread. If many users click at once, all the threads get stuck waiting. Soon the whole app freezes. This is called a cascading failure, and resilience patterns are how we stop it.

The main resilience patterns

There are a handful of well-known patterns. Each one solves a different problem. The good news: you do not have to invent them. .NET and the Polly library give them to you ready-made.

PatternWhat it doesEveryday example
RetryTries again after a short waitRedialling a busy phone number
TimeoutGives up if something takes too longHanging up after waiting 10 seconds
Circuit breakerStops calling a service that keeps failingNot calling a number that is busy all day
FallbackUses a backup plan when all else failsSending a text instead of calling
HedgingSends a second request to be fasterAsking two friends the same question

Let us look at how these fit together in a request.

A request protected by resilience patterns

Timeout
Retry
Circuit Breaker
Fallback

Steps

1

Timeout

Stop waiting after a set time

2

Retry

Try again a few times with a pause

3

Circuit Breaker

Stop calling a failing service

4

Fallback

Return a safe default value

Each layer adds protection so a single failure does not break the user experience.

Meet Polly and Microsoft.Extensions.Resilience

In .NET, the most loved resilience library is Polly. It lets you describe rules like "retry three times" or "give up after 5 seconds" in clean, readable code.

Microsoft liked Polly so much that they built an official package on top of it called Microsoft.Extensions.Http.Resilience. With one line of setup, it gives your HTTP calls a smart default pipeline: retries with exponential backoff and jitter, a circuit breaker, and a timeout. You get a lot of safety for very little code.

Here is the simplest possible example. We add resilience to an HTTP client.

// Program.cs
// Add the NuGet package: Microsoft.Extensions.Http.Resilience
 
var builder = WebApplication.CreateBuilder(args);
 
builder.Services
    .AddHttpClient("catalog", client =>
    {
        client.BaseAddress = new Uri("https://catalog.example.com");
    })
    // One line gives retries, a circuit breaker, and a timeout.
    .AddStandardResilienceHandler();
 
var app = builder.Build();
app.Run();

That single call to AddStandardResilienceHandler() is doing a lot of work behind the scenes. Now any request made with this client is automatically retried, timed, and protected by a circuit breaker.

Retry: give it a second chance

A retry simply tries the action again after a short pause. Many cloud problems are transient, which is a fancy word for "they go away on their own in a moment." A retry is perfect for these.

But there is a trap. If you retry instantly, and a thousand users do the same, you flood the struggling service with even more traffic. That makes things worse.

The fix is exponential backoff with jitter. Backoff means you wait longer after each failed try: 1 second, then 2, then 4. Jitter means you add a small random amount so all the clients do not retry at the exact same instant.

Each retry waits longer than the last, with a little randomness added.

Here is how you write a custom retry with Polly directly. This is useful when you are protecting something that is not an HTTP call, like a database query.

using Polly;
using Polly.Retry;
 
// Build a small pipeline with one retry strategy.
var pipeline = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        Delay = TimeSpan.FromSeconds(1),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true
    })
    .Build();
 
// Run our work through the pipeline.
await pipeline.ExecuteAsync(async token =>
{
    await SaveOrderToDatabaseAsync(token);
});

One important rule: only retry actions that are safe to repeat. Reading data is safe. Sending the same payment twice is not. We call safe-to-repeat actions idempotent. Always ask "what happens if this runs twice?" before adding a retry.

Timeout: do not wait forever

A timeout is the simplest pattern, and one of the most important. It says: "If this has not finished in X seconds, stop waiting and move on."

Without a timeout, a single slow call can hold a thread open for minutes. As we saw earlier, that is how a whole app freezes. A timeout puts a hard limit on patience.

using Polly;
using Polly.Timeout;
 
var pipeline = new ResiliencePipelineBuilder()
    .AddTimeout(TimeSpan.FromSeconds(5)) // give up after 5 seconds
    .Build();
 
try
{
    await pipeline.ExecuteAsync(async token =>
    {
        await CallSlowServiceAsync(token);
    });
}
catch (TimeoutRejectedException)
{
    // The call took too long. We handle it gracefully here.
    Console.WriteLine("The service was too slow, using a backup.");
}

A good habit is to combine a timeout with a retry. The timeout cuts off each slow attempt, and the retry tries again. Together they keep things both fast and forgiving.

Circuit breaker: stop knocking on a closed door

Retrying is great when a problem is short-lived. But what if a service is properly broken and will stay down for minutes? Retrying then just wastes time and energy, and keeps your users waiting.

A circuit breaker watches how many calls fail. If too many fail in a row, it "opens" and stops all calls for a while. During this cool-down, your app fails fast instead of waiting. After the wait, it lets a single test call through. If that works, it "closes" and normal traffic resumes.

It works just like the electrical fuse in your home. When something is wrong, the fuse trips and cuts the power to keep the rest of the house safe.

A circuit breaker moves between three states based on success and failure.

The three states are easy to remember:

StateMeaningWhat happens to calls
ClosedEverything is healthyCalls go through normally
OpenService looks brokenCalls fail fast, none are sent
Half-OpenTesting the watersOne test call is allowed through
using Polly;
using Polly.CircuitBreaker;
 
var pipeline = new ResiliencePipelineBuilder()
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        // Open the circuit if 50% of calls fail...
        FailureRatio = 0.5,
        // ...within a 10 second sampling window...
        SamplingDuration = TimeSpan.FromSeconds(10),
        // ...with at least 8 calls to judge from.
        MinimumThroughput = 8,
        // Stay open for 30 seconds before testing again.
        BreakDuration = TimeSpan.FromSeconds(30)
    })
    .Build();

When the circuit is open, Polly throws a BrokenCircuitException immediately. That is your signal to use a fallback, which is the next pattern.

Fallback: always have a backup plan

A fallback is your "plan B." When everything else fails, instead of showing the user an ugly error, you return something useful and safe.

Think of a weather widget. If the live weather service is down, you could show yesterday's weather with a small note saying "data may be old." The user still sees something helpful. That is a fallback.

Fallbacks are not always possible. You cannot fake a bank balance. But for many features, a cached value, an empty list, or a friendly message is far better than a crash.

How a fallback rescues a failed call

Call Service
Failed?
Use Cached Value
Show User

Steps

1

Call Service

Try the real service first

2

Failed?

Catch the error or open circuit

3

Use Cached Value

Return last known good data

4

Show User

User still sees a result

When the main path fails, the app quietly switches to a safe backup.

Health checks: let the cloud know you are okay

So far we have protected calls going out of our app. Health checks are about the signal we send about our app.

A health check is a small endpoint, usually at /health, that answers one question: "Are you okay?" The cloud platform, like Kubernetes or .NET Aspire, calls this endpoint over and over. If your app stops answering, the platform restarts it or stops sending it traffic. This is how apps heal themselves.

There are usually two kinds:

  • Liveness (/alive): Is the app running at all? If not, restart it.
  • Readiness (/health): Is the app ready to accept traffic? Maybe it is still warming up or its database is down, so hold off sending users to it.
var builder = WebApplication.CreateBuilder(args);
 
// Register health checks, including one for the database.
builder.Services
    .AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy())
    .AddSqlServer(builder.Configuration.GetConnectionString("Db")!);
 
var app = builder.Build();
 
// Expose the endpoints the platform will call.
app.MapHealthChecks("/health");   // ready for traffic?
app.MapHealthChecks("/alive");    // alive at all?
 
app.Run();

If you use .NET Aspire, a lot of this is handled for you. Aspire is Microsoft's stack for cloud-native .NET apps. Every app you build with it automatically gets structured logs, traces, metrics, and health checks, all pre-wired. Aspire integrations even add their own checks. For example, the PostgreSQL integration adds a check that opens a real connection and runs a query, so a broken database shows up in your health status right away.

Putting it all together

These patterns are stronger as a team than alone. A common, sensible order to layer them is shown below. The request flows through each guard before it ever touches the network.

A full resilience pipeline layers several patterns around one call.

With AddStandardResilienceHandler(), .NET arranges these layers for you in a tested, sensible way. You can still tune the numbers to fit your app. A fast internal service might use short timeouts. A slow third-party API might need more patience.

A few honest tips

Resilience is powerful, but it is not magic. Keep these in mind:

  • Do not retry everything. Retrying a non-idempotent action can create duplicate orders or double charges. Be careful.
  • Always set a timeout. A retry without a timeout can wait forever. The two belong together.
  • Tune your numbers. The right retry count and timeout depend on your real traffic. Start with the defaults, then watch and adjust.
  • Log when patterns trigger. If your circuit breaker keeps opening, that is telling you something real is wrong. Do not hide it.
  • Test failures on purpose. Turn off a dependency in a test environment and watch how your app behaves. This is called chaos testing.

A note on libraries and licensing

Two popular .NET libraries, MediatR and MassTransit, are now commercially licensed for many uses. They are not resilience libraries themselves, but they often appear in the same cloud projects, so it is worth knowing. Polly, by contrast, is open source and free, and it is the foundation of the official Microsoft resilience packages. For most resilience needs, Polly and Microsoft.Extensions.Http.Resilience are all you need, and they are free to use.

It is also a good time to be on a modern runtime. .NET 10 is the current LTS release, C# 14 has shipped, and C# 15 with union types is in .NET 11 preview. The resilience APIs shown here work great on these recent versions.

References and further reading

Quick recap

  • A resilient app expects small failures and handles them calmly instead of crashing.
  • Retry gives a failing action another chance; use exponential backoff with jitter, and only on safe (idempotent) actions.
  • Timeout stops you waiting forever and protects your threads. Always pair it with retries.
  • Circuit breaker stops calling a broken service and lets it recover, like a home fuse.
  • Fallback returns a safe backup value so users still see something useful.
  • Health checks let the cloud platform restart or reroute around an unhealthy app, so it can heal itself.
  • In .NET, Polly and Microsoft.Extensions.Http.Resilience give you all of this with very little code, and .NET Aspire wires up health checks and telemetry for you.

Related Posts