Building Resilient Cloud Applications With .NET
Learn to build resilient cloud apps in .NET with retries, timeouts, and circuit breakers using Polly and Microsoft.Extensions.Resilience.
Imagine you call your friend on the phone, but the line is busy. You do not throw your phone away. You wait a few seconds and try again. If it is still busy after three tries, you decide to send a message instead. And if you notice the number has been busy for the whole day, you stop calling for a while to save your time.
That simple, patient behaviour is exactly what we want our cloud applications to do. This is called resilience.
A resilient app is not an app where nothing ever breaks. Things break in the cloud all the time. Networks hiccup. A database gets busy. Another service is restarting. A resilient app is one that expects these small problems and handles them calmly, so the user barely notices.
In this article we will learn, step by step, how to build resilient cloud applications with .NET. We will use simple words and real examples. By the end you will understand retries, timeouts, circuit breakers, fallbacks, and health checks, and you will know which .NET tools give you each one.
Why do cloud apps need resilience?
In the old days, one big program ran on one computer. If that computer was fine, the app was fine.
Modern cloud apps are different. They are made of many small services that talk to each other over the network. Your app might call a payment service, a user service, and a database, all for one button click.
Every network call is a small risk. Any of those calls can be slow or can fail. And here is the scary part: one slow service can drag down everything else, like one stuck car causing a traffic jam behind it.
When the Bank API is slow, the Payment Service waits. While it waits, it holds a thread. If many users click at once, all the threads get stuck waiting. Soon the whole app freezes. This is called a cascading failure, and resilience patterns are how we stop it.
The main resilience patterns
There are a handful of well-known patterns. Each one solves a different problem. The good news: you do not have to invent them. .NET and the Polly library give them to you ready-made.
| Pattern | What it does | Everyday example |
|---|---|---|
| Retry | Tries again after a short wait | Redialling a busy phone number |
| Timeout | Gives up if something takes too long | Hanging up after waiting 10 seconds |
| Circuit breaker | Stops calling a service that keeps failing | Not calling a number that is busy all day |
| Fallback | Uses a backup plan when all else fails | Sending a text instead of calling |
| Hedging | Sends a second request to be faster | Asking two friends the same question |
Let us look at how these fit together in a request.
A request protected by resilience patterns
Steps
Timeout
Stop waiting after a set time
Retry
Try again a few times with a pause
Circuit Breaker
Stop calling a failing service
Fallback
Return a safe default value
Meet Polly and Microsoft.Extensions.Resilience
In .NET, the most loved resilience library is Polly. It lets you describe rules like "retry three times" or "give up after 5 seconds" in clean, readable code.
Microsoft liked Polly so much that they built an official package on top of it called Microsoft.Extensions.Http.Resilience. With one line of setup, it gives your HTTP calls a smart default pipeline: retries with exponential backoff and jitter, a circuit breaker, and a timeout. You get a lot of safety for very little code.
Here is the simplest possible example. We add resilience to an HTTP client.
// Program.cs
// Add the NuGet package: Microsoft.Extensions.Http.Resilience
var builder = WebApplication.CreateBuilder(args);
builder.Services
.AddHttpClient("catalog", client =>
{
client.BaseAddress = new Uri("https://catalog.example.com");
})
// One line gives retries, a circuit breaker, and a timeout.
.AddStandardResilienceHandler();
var app = builder.Build();
app.Run();That single call to AddStandardResilienceHandler() is doing a lot of work behind the scenes. Now any request made with this client is automatically retried, timed, and protected by a circuit breaker.
Retry: give it a second chance
A retry simply tries the action again after a short pause. Many cloud problems are transient, which is a fancy word for "they go away on their own in a moment." A retry is perfect for these.
But there is a trap. If you retry instantly, and a thousand users do the same, you flood the struggling service with even more traffic. That makes things worse.
The fix is exponential backoff with jitter. Backoff means you wait longer after each failed try: 1 second, then 2, then 4. Jitter means you add a small random amount so all the clients do not retry at the exact same instant.
Here is how you write a custom retry with Polly directly. This is useful when you are protecting something that is not an HTTP call, like a database query.
using Polly;
using Polly.Retry;
// Build a small pipeline with one retry strategy.
var pipeline = new ResiliencePipelineBuilder()
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = 3,
Delay = TimeSpan.FromSeconds(1),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true
})
.Build();
// Run our work through the pipeline.
await pipeline.ExecuteAsync(async token =>
{
await SaveOrderToDatabaseAsync(token);
});One important rule: only retry actions that are safe to repeat. Reading data is safe. Sending the same payment twice is not. We call safe-to-repeat actions idempotent. Always ask "what happens if this runs twice?" before adding a retry.
Timeout: do not wait forever
A timeout is the simplest pattern, and one of the most important. It says: "If this has not finished in X seconds, stop waiting and move on."
Without a timeout, a single slow call can hold a thread open for minutes. As we saw earlier, that is how a whole app freezes. A timeout puts a hard limit on patience.
using Polly;
using Polly.Timeout;
var pipeline = new ResiliencePipelineBuilder()
.AddTimeout(TimeSpan.FromSeconds(5)) // give up after 5 seconds
.Build();
try
{
await pipeline.ExecuteAsync(async token =>
{
await CallSlowServiceAsync(token);
});
}
catch (TimeoutRejectedException)
{
// The call took too long. We handle it gracefully here.
Console.WriteLine("The service was too slow, using a backup.");
}A good habit is to combine a timeout with a retry. The timeout cuts off each slow attempt, and the retry tries again. Together they keep things both fast and forgiving.
Circuit breaker: stop knocking on a closed door
Retrying is great when a problem is short-lived. But what if a service is properly broken and will stay down for minutes? Retrying then just wastes time and energy, and keeps your users waiting.
A circuit breaker watches how many calls fail. If too many fail in a row, it "opens" and stops all calls for a while. During this cool-down, your app fails fast instead of waiting. After the wait, it lets a single test call through. If that works, it "closes" and normal traffic resumes.
It works just like the electrical fuse in your home. When something is wrong, the fuse trips and cuts the power to keep the rest of the house safe.
The three states are easy to remember:
| State | Meaning | What happens to calls |
|---|---|---|
| Closed | Everything is healthy | Calls go through normally |
| Open | Service looks broken | Calls fail fast, none are sent |
| Half-Open | Testing the waters | One test call is allowed through |
using Polly;
using Polly.CircuitBreaker;
var pipeline = new ResiliencePipelineBuilder()
.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
// Open the circuit if 50% of calls fail...
FailureRatio = 0.5,
// ...within a 10 second sampling window...
SamplingDuration = TimeSpan.FromSeconds(10),
// ...with at least 8 calls to judge from.
MinimumThroughput = 8,
// Stay open for 30 seconds before testing again.
BreakDuration = TimeSpan.FromSeconds(30)
})
.Build();When the circuit is open, Polly throws a BrokenCircuitException immediately. That is your signal to use a fallback, which is the next pattern.
Fallback: always have a backup plan
A fallback is your "plan B." When everything else fails, instead of showing the user an ugly error, you return something useful and safe.
Think of a weather widget. If the live weather service is down, you could show yesterday's weather with a small note saying "data may be old." The user still sees something helpful. That is a fallback.
Fallbacks are not always possible. You cannot fake a bank balance. But for many features, a cached value, an empty list, or a friendly message is far better than a crash.
How a fallback rescues a failed call
Steps
Call Service
Try the real service first
Failed?
Catch the error or open circuit
Use Cached Value
Return last known good data
Show User
User still sees a result
Health checks: let the cloud know you are okay
So far we have protected calls going out of our app. Health checks are about the signal we send about our app.
A health check is a small endpoint, usually at /health, that answers one question: "Are you okay?" The cloud platform, like Kubernetes or .NET Aspire, calls this endpoint over and over. If your app stops answering, the platform restarts it or stops sending it traffic. This is how apps heal themselves.
There are usually two kinds:
- Liveness (
/alive): Is the app running at all? If not, restart it. - Readiness (
/health): Is the app ready to accept traffic? Maybe it is still warming up or its database is down, so hold off sending users to it.
var builder = WebApplication.CreateBuilder(args);
// Register health checks, including one for the database.
builder.Services
.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddSqlServer(builder.Configuration.GetConnectionString("Db")!);
var app = builder.Build();
// Expose the endpoints the platform will call.
app.MapHealthChecks("/health"); // ready for traffic?
app.MapHealthChecks("/alive"); // alive at all?
app.Run();If you use .NET Aspire, a lot of this is handled for you. Aspire is Microsoft's stack for cloud-native .NET apps. Every app you build with it automatically gets structured logs, traces, metrics, and health checks, all pre-wired. Aspire integrations even add their own checks. For example, the PostgreSQL integration adds a check that opens a real connection and runs a query, so a broken database shows up in your health status right away.
Putting it all together
These patterns are stronger as a team than alone. A common, sensible order to layer them is shown below. The request flows through each guard before it ever touches the network.
With AddStandardResilienceHandler(), .NET arranges these layers for you in a tested, sensible way. You can still tune the numbers to fit your app. A fast internal service might use short timeouts. A slow third-party API might need more patience.
A few honest tips
Resilience is powerful, but it is not magic. Keep these in mind:
- Do not retry everything. Retrying a non-idempotent action can create duplicate orders or double charges. Be careful.
- Always set a timeout. A retry without a timeout can wait forever. The two belong together.
- Tune your numbers. The right retry count and timeout depend on your real traffic. Start with the defaults, then watch and adjust.
- Log when patterns trigger. If your circuit breaker keeps opening, that is telling you something real is wrong. Do not hide it.
- Test failures on purpose. Turn off a dependency in a test environment and watch how your app behaves. This is called chaos testing.
A note on libraries and licensing
Two popular .NET libraries, MediatR and MassTransit, are now commercially licensed for many uses. They are not resilience libraries themselves, but they often appear in the same cloud projects, so it is worth knowing. Polly, by contrast, is open source and free, and it is the foundation of the official Microsoft resilience packages. For most resilience needs, Polly and Microsoft.Extensions.Http.Resilience are all you need, and they are free to use.
It is also a good time to be on a modern runtime. .NET 10 is the current LTS release, C# 14 has shipped, and C# 15 with union types is in .NET 11 preview. The resilience APIs shown here work great on these recent versions.
References and further reading
- Build resilient HTTP apps: Key development patterns (Microsoft Learn)
- Polly resilience library documentation
- Circuit breaker resilience strategy (Polly)
- .NET Aspire health checks and readiness gates (Microsoft Learn)
- App-vNext/Polly on GitHub
Quick recap
- A resilient app expects small failures and handles them calmly instead of crashing.
- Retry gives a failing action another chance; use exponential backoff with jitter, and only on safe (idempotent) actions.
- Timeout stops you waiting forever and protects your threads. Always pair it with retries.
- Circuit breaker stops calling a broken service and lets it recover, like a home fuse.
- Fallback returns a safe backup value so users still see something useful.
- Health checks let the cloud platform restart or reroute around an unhealthy app, so it can heal itself.
- In .NET, Polly and Microsoft.Extensions.Http.Resilience give you all of this with very little code, and .NET Aspire wires up health checks and telemetry for you.
Related Posts
Overriding Default HTTP Resilience Handlers in .NET
Learn how to override global HTTP resilience handlers in .NET so one HttpClient can use its own retry, timeout, and circuit breaker rules.
When Your Use Case Half-Succeeds: Designing for Partial Failure in .NET
Learn how to design .NET use cases that survive partial failure using outbox, saga, idempotency and compensation patterns, explained simply.
Retries and Resilience in .NET with Polly and Microsoft Resilience
Learn retries, timeouts, and circuit breakers in .NET using Polly v8 and Microsoft.Extensions.Http.Resilience, with simple examples a beginner can follow.
The Right Way to Use HttpClient in .NET (Beginner Guide)
Learn the right way to use HttpClient in .NET. Avoid socket exhaustion and stale DNS with IHttpClientFactory, typed clients, and resilience.
The False Comfort of the Happy Path: Decoupling Your Services
Learn why the happy path lies to you, and how decoupling .NET services with messaging, retries, and circuit breakers keeps your app calm when things break.
Building Semantic Search With Amazon S3 Vectors and Semantic Kernel
A beginner-friendly guide to building semantic search in .NET using Amazon S3 Vectors for cheap storage and Semantic Kernel for embeddings.