BiSoft Logo

Services

Products

Partnership

Learning Hub

English

General

Kubernetes Probes: The Operational Safety Net Your Cluster Needs

Your cluster is safely handling production traffic today - fine.

But what about tomorrow?

By design, Kubernetes relies on certain operational principles. Although it can manage itself to a large extent, as your applications and infrastructure grow, there are areas where Kubernetes expects explicit signals and control from you.

In this article, we'll examine one of the most critical ones:

Probe Definitions

Probe configuration affects every pod that receives production traffic.

Without probes, two common problems usually occur:

  • Even if a container freezes internally, Kubernetes won't notice it. The pod will continue to appear as Running.

  • During rolling updates, the Service starts sending traffic as soon as the container is marked as "started" — even if the application is not actually ready.

This is one of the most common causes of short-lived 5xx spikes after deployments.

So how should liveness, readiness, and startup probes be configured?

Understanding Kubernetes Probes

Kubernetes provides three different probe types, and each answers a different operational question.

1. Liveness Probe — "Is the Container Still Alive?"

If the probe fails:

→ Kubernetes kills the container and restarts it.

Purpose

Applications can become:

  • Frozen internally

  • Deadlocked

  • Stuck in an infinite loop

In these situations, the process may still exist, but the application is no longer responsive.

Kubernetes attempts recovery by restarting the container.

Example

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080

2. Readiness Probe — "Can I Receive Traffic Right Now?"

If the probe fails:

→ Kubernetes does NOT restart the container.

Instead, it removes the pod from the Service endpoint list.

Purpose

Common scenarios include:

  • Application warm-up still in progress

  • Cache loading incomplete

  • Database connectivity issues

  • Temporary overload conditions

In other words:

"I'm alive, but I shouldn't receive traffic right now."

Example

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

3. Startup Probe — "Has the Application Finished Starting?"

If the probe fails:

→ Liveness and readiness probes remain disabled.

They only become active after the startup probe succeeds.

Purpose

Startup probes are particularly valuable for applications with long initialization times:

  • Java / Spring Boot

  • .NET

  • Large Python applications

These workloads may require 60–120 seconds before becoming operational.

Without a startup probe, Kubernetes may interpret slow startup as failure and restart the container repeatedly.

Result

CrashLoopBackOff
CrashLoopBackOff
CrashLoopBackOff

Example

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
startupProbe:
  httpGet:
    path: /healthz
    port: 8080

Why Use All Three Together?

A healthy pod lifecycle typically looks like this:

[Container Start]
        
        ├── Startup Phase
        startupProbe runs
        liveness/readiness inactive
        
        ├── Startup Success
        liveness + readiness enabled
        
        ├── Runtime Phase
        liveness Am I alive?
        readiness Can I receive traffic?
        └── Pod Termination
[Container Start]
        
        ├── Startup Phase
        startupProbe runs
        liveness/readiness inactive
        
        ├── Startup Success
        liveness + readiness enabled
        
        ├── Runtime Phase
        liveness Am I alive?
        readiness Can I receive traffic?
        └── Pod Termination
[Container Start]
        
        ├── Startup Phase
        startupProbe runs
        liveness/readiness inactive
        
        ├── Startup Success
        liveness + readiness enabled
        
        ├── Runtime Phase
        liveness Am I alive?
        readiness Can I receive traffic?
        └── Pod Termination

Each probe serves a unique purpose.

Using only one or two of them leaves operational gaps.

Probe Mechanisms

Kubernetes supports three different probe methods.

HTTP GET (Recommended)

Expose a health endpoint and return HTTP 200 when healthy.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080

Advantages

  • Easy to implement

  • Lightweight

  • Human-readable

  • Preferred for most web applications

TCP Socket

Useful when no HTTP endpoint exists.

Kubernetes simply verifies that it can establish a TCP connection.

livenessProbe:
  tcpSocket:
    port: 5672
livenessProbe:
  tcpSocket:
    port: 5672
livenessProbe:
  tcpSocket:
    port: 5672

Common for:

  • RabbitMQ

  • Databases

  • Message brokers

Exec Probe

Executes a command inside the container.

Exit code 0 indicates success.

livenessProbe:
  exec:
    command

livenessProbe:
  exec:
    command

livenessProbe:
  exec:
    command

Important Note

Exec probes launch a new process during every probe interval.

This can become expensive from a CPU perspective.

Whenever possible:

  • Use HTTP probes

  • Use TCP probes if HTTP isn't available

  • Reserve exec probes for special cases

Kubernetes Probe Best Practices

1. Use Different Endpoints for Liveness and Readiness

One of the most common mistakes is using the same endpoint for both probes.

Liveness Endpoint

/healthz
/healthz
/healthz

Question:

Is the process alive?

This should only verify internal application health.

Avoid checking:

  • Databases

  • Redis

  • External APIs

Readiness Endpoint

/ready
/ready
/ready

Question:

Can I handle traffic right now?

This should validate required dependencies.

Examples:

  • Database connectivity

  • Cache availability

  • Queue access

Why It Matters

If the database becomes slow and liveness checks it:

→ Every pod starts restarting.

You create a second outage while already dealing with the first one.

2. Keep Liveness Checks Simple

Liveness is a last-resort recovery mechanism.

Avoid business logic.

A simple response is often sufficient:

@app.get("/healthz")
def healthz():
    return {"status": "ok"}, 200
@app.get("/healthz")
def healthz():
    return {"status": "ok"}, 200
@app.get("/healthz")
def healthz():
    return {"status": "ok"}, 200

3. Make Readiness Reflect Reality

Readiness should answer:

Can this application safely receive production traffic?

Example:

@app.get("/ready")
def ready():
    if not db.is_connected():
        return {"status": "db_down"}, 503

    if not cache.ping():
        return {"status": "cache_down"}, 503

    return {"status": "ready"}, 200
@app.get("/ready")
def ready():
    if not db.is_connected():
        return {"status": "db_down"}, 503

    if not cache.ping():
        return {"status": "cache_down"}, 503

    return {"status": "ready"}, 200
@app.get("/ready")
def ready():
    if not db.is_connected():
        return {"status": "db_down"}, 503

    if not cache.ping():
        return {"status": "cache_down"}, 503

    return {"status": "ready"}, 200

Checks may include:

  • Database connectivity

  • Cache availability

  • Configuration loading

  • Queue connections

4. Use Startup Probes for Slow-Starting Applications

Ideal candidates:

  • Spring Boot

  • .NET Core

  • Large Python services

Example configuration:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

# Allows up to 300 seconds startup time.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

# Allows up to 300 seconds startup time.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

# Allows up to 300 seconds startup time.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

With startup probes in place, initialDelaySeconds often becomes unnecessary.

5. Use Reasonable Probe Parameters

A good starting point:

periodSeconds: 10
timeoutSeconds: 1-3
failureThreshold: 3
periodSeconds: 10
timeoutSeconds: 1-3
failureThreshold: 3
periodSeconds: 10
timeoutSeconds: 1-3
failureThreshold: 3

Recommendation

Make liveness more tolerant than readiness.

Reason:

  • Readiness failure only stops traffic.

  • Liveness failure restarts the container.

Container restarts should require stronger evidence.

6. Keep Health Endpoints Authentication-Free

Health endpoints should not require authentication.

Bad:

/healthz → Requires JWT
/healthz → Requires JWT
/healthz → Requires JWT

Good:

/healthz → Internal cluster access only
/healthz → Internal cluster access only
/healthz → Internal cluster access only

Typically, these endpoints are exposed only within the cluster network.

7. Avoid Chaining Other Services' Health Checks

A service should report only its own state.

Bad pattern:

Service A checks Service B
Service B checks Service C
Service A checks Service B
Service B checks Service C
Service A checks Service B
Service B checks Service C

If Service C slows down:

→ B becomes unready

→ A becomes unready

→ Cascading failure spreads through the system

Instead, verify only the dependencies required for your service to function correctly.

The Real Value of Probes Appears During Failures

The absence of probes rarely causes problems when everything is healthy.

Their true value becomes visible during unexpected incidents.

Imagine a pod deadlocks at 3 AM.

With Probes

  • Kubernetes detects the issue

  • The pod is restarted automatically

  • Users never notice

Without Probes

  • The pod still shows as Running

  • Requests start returning 5xx errors

  • The issue remains hidden until someone investigates

Conclusion

Kubernetes probes are not just health checks.

They are operational safeguards that:

  • Detect application failures

  • Prevent premature traffic routing

  • Improve deployment reliability

  • Enable automatic recovery

  • Reduce user-facing outages

Think of probes as an insurance policy for your workloads.

You don't remove the fuse box simply because there hasn't been a fire yet.

Join our 250+ customers

Whether you need expert consulting, custom software, or full-scale data solutions, BiSoft is here to help. Let’s talk about how we can support your goals.

Join our 250+ customers

Whether you need expert consulting, custom software, or full-scale data solutions, BiSoft is here to help. Let’s talk about how we can support your goals.

Join our 250+ customers

Whether you need expert consulting, custom software, or full-scale data solutions, BiSoft is here to help. Let’s talk about how we can support your goals.

Smart data solutions for business growth and efficiency

Company

Services

Product

Vispeahen

BFM

BFM4Patroni

More content