Kubernetes Java SRE System Design
9 min read
Java App Scalability on Kubernetes: Patterns and Practices

Introduction

Scaling Java applications in Kubernetes isn’t just about adding replicas. Java’s unique memory management and runtime behavior introduce challenges that require deliberate tuning and monitoring. This post dives into the fundamentals of scaling Java apps in Kubernetes, best practices to follow, and common mistakes to avoid.

Whether you’re deploying a Spring Boot service or a high-throughput reactive API, understanding Kubernetes scaling strategies and how Java interacts with container environments is key to achieving reliable performance.


Understanding Scaling in Kubernetes

In real-world deployments, the most widely adopted strategy for scaling Java apps is using Horizontal Pod Autoscaler (HPA) in combination with well-configured Resource Requests and Limits. Here’s a breakdown of the most important components:

Horizontal Pod Autoscaler (HPA)

  • Automatically adjusts the number of pod replicas based on observed CPU utilization or custom metrics (e.g., from Prometheus).
  • Well-suited for stateless Java services such as REST APIs built with Spring Boot.
  • Supports external metrics via Prometheus Adapter.

How it works: HPA uses a control loop to continuously monitor specified metrics. For CPU-based scaling, it follows this algorithm:

desiredReplicas = ceil[currentReplicas * (averageUtilization / targetUtilization)]

For example, if one pod is at 30% CPU and another at 80%, the average is (30% + 80%) / 2 = 55%. With a target of 40% and 2 pods running:

desiredReplicas = ceil[2 * (55 / 40)] = ceil[2.75] = 3 pods

Kubernetes will scale to 3 pods to reduce the average utilization closer to the target. This algorithm promotes metric-based stability and helps avoid under- or over-provisioning. Combine with stabilization windows to prevent flapping.

đź“– Docs: Horizontal Pod Autoscaler

Resource Requests and Limits

  • Kubernetes uses these values to schedule pods and enforce maximum resource usage.
  • Proper configuration ensures application stability and reliability, helping prevent resource contention and OOM (Out of Memory) errors.
  • Critical for autoscaling accuracy—HPA uses these values as part of its decision-making.
  • đź“– Docs: Managing Resources for Containers

Vertical Pod Autoscaler (VPA)

  • Dynamically adjusts CPU and memory requests/limits for pods.
  • May cause pod restarts, which could impact latency-sensitive applications.
  • Not widely used for scaling Java applications—especially not for latency-sensitive services like REST APIs or microservices.

Most teams opt for:

  • Horizontal Pod Autoscaler (HPA) for scaling replicas based on CPU/custom metrics.
  • Carefully set Resource Requests and Limits to guide scheduling and avoid overcommitment.

VPA is better suited for batch jobs, ML workloads, or non-critical background processes—cases where restarts are acceptable and resource usage is unpredictable.


JVM Behavior in Kubernetes

Memory Handling

Java (especially versions prior to Java 10) may not respect Linux cgroup limits by default, which can lead to the JVM allocating more memory than the container actually has. This results in “OOMKilled” crashes in Kubernetes.

Modern JVMs (Java 10+) include support for detecting container memory constraints using the following flags:

-XX:+UseContainerSupport           # Enables awareness of cgroup limits (Java 10+)
-XX:MaxRAMPercentage=75.0         # Allocates 75% of container memory to the JVM heap

For example, if a pod is allocated 4Gi of memory, -XX:MaxRAMPercentage=75.0 will allow the heap to grow up to 3Gi, leaving 1Gi for non-heap usage like metaspace, thread stacks, and native buffers.

These flags are crucial for preventing the JVM from over-allocating memory in containers.


GC Overhead

The Garbage Collector (GC) is responsible for reclaiming memory used by objects that are no longer needed. It runs automatically and helps prevent memory leaks.

  • GC pauses can temporarily halt application threads.
  • These pauses may mislead metrics-based autoscaling.
  • Intensive GC can create misleading CPU spikes or make the app appear idle or overloaded.

Use modern collectors:

  • G1GC – Balanced default for most applications.
  • ZGC, ShenandoahGC – Designed for low-pause requirements (Java 11+).

Monitor GC time and frequency using Prometheus, JMX, or Micrometer.


Heap vs. Container Memory

Heap is the memory area used for object allocation. However, the JVM also uses other areas:

  • Metaspace – Stores class metadata
  • Thread stacks – Each thread consumes native memory (~1MB/thread)
  • Native memory – Used by JNI, NIO buffers, etc.

Total JVM memory usage = Heap + Metaspace + Threads + Native

  • Heap does not represent total memory use.
  • Always leave headroom to avoid hitting container memory limits and triggering OOMKilled.

Use metrics like jvm_memory_used_* and jvm_memory_max_* to monitor both heap and non-heap usage.


Best Practices for Scaling Java Apps

1. Set Resource Requests and Limits Wisely

Kubernetes uses requests to schedule pods and limits to enforce upper bounds. For Java apps:

  • Set CPU and memory requests high enough to prevent throttling.
  • Avoid over-provisioning, which leads to resource waste and scheduling failures.
  • Keep requests and limits equal for predictable autoscaling behavior.

2. Align JVM Heap with Container Limits

Modern JVMs support container memory limits. Prefer:

-XX:+UseContainerSupport                      # Java 10+
-XX:MaxRAMPercentage=75.0                     # Dedicate 75% of container memory to heap

This makes memory management dynamic and prevents OOMs caused by heap over-allocation.

3. Handle JVM Startup Bursts Gracefully

Java apps often consume high CPU at startup due to class loading and JIT compilation. Avoid routing traffic too soon.

Use readiness probes to delay traffic until the pod is fully initialized:

readinessProbe:
  httpGet:
    path: /health            # Path to check for readiness
    port: 8080               # Same port your app serves traffic on
  initialDelaySeconds: 30    # Wait 30s after container starts before checking
  periodSeconds: 10          # Check every 10 seconds
  failureThreshold: 3        # Mark pod as not ready after 3 failed checks

📌 Example Deployment implementing strategies 1–3:

# Deployment YAML in K8s
apiVersion: apps/v1
kind: Deployment
metadata:
  name: java-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: java-app
  template:
    metadata:
      labels:
        app: java-app
    spec:
      containers:
        - name: java-app-container
          image: your-java-app-image:latest
          resources:
            requests:
              memory: "32Gi"  # 32Gi memory and 8 vCPUs reserved
              cpu: "8000m"
            limits:
              memory: "32Gi"
              cpu: "8000m"
          env:
            - name: JAVA_TOOL_OPTIONS
              value: >
                -XX:MaxRAMPercentage=75.0               # 75% of memory to JVM heap
                -XX:+UseG1GC                            # G1 garbage collector
                -XX:+UseStringDeduplication             # Save memory on repeated strings
                -XX:MaxMetaspaceSize=512m              # Control metaspace usage
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health        # Endpoint checked for readiness
              port: 8080           # Port where app listens
            initialDelaySeconds: 30  # Delay before probe starts (JVM warm-up time)
            periodSeconds: 10        # Probe interval
            failureThreshold: 3      # Mark pod unready after 3 failed checks

4. Use CPU-Based HPA with Stabilization Windows

Memory-based scaling is unreliable for Java because the JVM may hold memory even when idle. Prefer CPU or custom application-level metrics, and prevent frequent rescaling by applying stabilization:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: java-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: java-app  # Must match the name of the Deployment above
  minReplicas: 2  # Never scale below 2 replicas
  maxReplicas: 10 # Scale up to 10 replicas if needed
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 40  # Target 40% average CPU usage
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Wait at least 60s before scaling up
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait at least 5 minutes before scaling down

5. Add Pod Disruption Budgets (PDBs) for Availability Guarantees

To ensure your service stays available during voluntary disruptions (e.g., node maintenance), use PodDisruptionBudget.

Stateless Java apps often benefit more from using maxUnavailable to allow faster updates while maintaining availability:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: java-app-pdb
spec:
  maxUnavailable: 1  # Allow at most 1 pod to be disrupted at a time
  selector:
    matchLabels:
      app: java-app

Guidance:

  • maxUnavailable is preferred for stateless services needing faster rollouts.
  • Use minAvailable if you need to guarantee a fixed number of pods are always running.
  • Always match your replica count and availability goals.

đź“– Kubernetes PDB Docs


Common Pitfalls to Avoid

🚫 Scaling on Memory Java apps may hold onto memory (heap) even when idle — misleading memory-based autoscalers.

đźš« Ignoring Warm-Up Time JVMs take time to JIT optimize and load classes. Cold pods may underperform.

đźš« Overreactive Autoscaling Short scale-in windows cause flapping. Use stabilizationWindowSeconds:

đźš« Not Considering GC Pauses Avoid aggressive autoscaling based on spiky CPU loads. Analyze GC logs if autoscaling seems jittery.


Monitoring and Observability

MetricToolWhy It Matters
CPU usagePrometheusTrigger for HPA
Heap/non-heap memoryJMX ExporterJVM health
GC frequency/durationMicrometer/JMXSpot GC thrashing
Thread countJMX ExporterDetect thread leaks or spikes
Pod readiness/livenessKubernetes probesDetermines scaling & traffic routing

Summary

Scaling Java apps in Kubernetes requires an understanding of both Kubernetes primitives and JVM internals. By carefully choosing metrics, setting correct resource requests, and avoiding common pitfalls, you can ensure that your application scales reliably and cost-effectively.

🚀 With the right observability and autoscaling setup, your Java services will be well-prepared to handle real-world workloads in the cloud.

Kubernetes Java SRE System Design