Introduction
Scaling Java applications in Kubernetes isn’t just about adding replicas. Java’s unique memory management and runtime behavior introduce challenges that require deliberate tuning and monitoring. This post dives into the fundamentals of scaling Java apps in Kubernetes, best practices to follow, and common mistakes to avoid.
Whether you’re deploying a Spring Boot service or a high-throughput reactive API, understanding Kubernetes scaling strategies and how Java interacts with container environments is key to achieving reliable performance.
Understanding Scaling in Kubernetes
In real-world deployments, the most widely adopted strategy for scaling Java apps is using Horizontal Pod Autoscaler (HPA) in combination with well-configured Resource Requests and Limits. Here’s a breakdown of the most important components:
Horizontal Pod Autoscaler (HPA)
- Automatically adjusts the number of pod replicas based on observed CPU utilization or custom metrics (e.g., from Prometheus).
- Well-suited for stateless Java services such as REST APIs built with Spring Boot.
- Supports external metrics via Prometheus Adapter.
How it works: HPA uses a control loop to continuously monitor specified metrics. For CPU-based scaling, it follows this algorithm:
desiredReplicas = ceil[currentReplicas * (averageUtilization / targetUtilization)]
For example, if one pod is at 30% CPU and another at 80%, the average is (30% + 80%) / 2 = 55%. With a target of 40% and 2 pods running:
desiredReplicas = ceil[2 * (55 / 40)] = ceil[2.75] = 3 pods
Kubernetes will scale to 3 pods to reduce the average utilization closer to the target. This algorithm promotes metric-based stability and helps avoid under- or over-provisioning. Combine with stabilization windows to prevent flapping.
đź“– Docs: Horizontal Pod Autoscaler
Resource Requests and Limits
- Kubernetes uses these values to schedule pods and enforce maximum resource usage.
- Proper configuration ensures application stability and reliability, helping prevent resource contention and OOM (Out of Memory) errors.
- Critical for autoscaling accuracy—HPA uses these values as part of its decision-making.
- đź“– Docs: Managing Resources for Containers
Vertical Pod Autoscaler (VPA)
- Dynamically adjusts CPU and memory requests/limits for pods.
- May cause pod restarts, which could impact latency-sensitive applications.
- Not widely used for scaling Java applications—especially not for latency-sensitive services like REST APIs or microservices.
Most teams opt for:
- Horizontal Pod Autoscaler (HPA) for scaling replicas based on CPU/custom metrics.
- Carefully set Resource Requests and Limits to guide scheduling and avoid overcommitment.
VPA is better suited for batch jobs, ML workloads, or non-critical background processes—cases where restarts are acceptable and resource usage is unpredictable.
JVM Behavior in Kubernetes
Memory Handling
Java (especially versions prior to Java 10) may not respect Linux cgroup limits by default, which can lead to the JVM allocating more memory than the container actually has. This results in “OOMKilled” crashes in Kubernetes.
Modern JVMs (Java 10+) include support for detecting container memory constraints using the following flags:
-XX:+UseContainerSupport # Enables awareness of cgroup limits (Java 10+)
-XX:MaxRAMPercentage=75.0 # Allocates 75% of container memory to the JVM heap
For example, if a pod is allocated 4Gi of memory, -XX:MaxRAMPercentage=75.0
will allow the heap to grow up to 3Gi, leaving 1Gi for non-heap usage like metaspace, thread stacks, and native buffers.
These flags are crucial for preventing the JVM from over-allocating memory in containers.
GC Overhead
The Garbage Collector (GC) is responsible for reclaiming memory used by objects that are no longer needed. It runs automatically and helps prevent memory leaks.
- GC pauses can temporarily halt application threads.
- These pauses may mislead metrics-based autoscaling.
- Intensive GC can create misleading CPU spikes or make the app appear idle or overloaded.
Use modern collectors:
G1GC
– Balanced default for most applications.ZGC
,ShenandoahGC
– Designed for low-pause requirements (Java 11+).
Monitor GC time and frequency using Prometheus, JMX, or Micrometer.
Heap vs. Container Memory
Heap is the memory area used for object allocation. However, the JVM also uses other areas:
- Metaspace – Stores class metadata
- Thread stacks – Each thread consumes native memory (~1MB/thread)
- Native memory – Used by JNI, NIO buffers, etc.
Total JVM memory usage = Heap + Metaspace + Threads + Native
- Heap does not represent total memory use.
- Always leave headroom to avoid hitting container memory limits and triggering OOMKilled.
Use metrics like jvm_memory_used_*
and jvm_memory_max_*
to monitor both heap and non-heap usage.
Best Practices for Scaling Java Apps
1. Set Resource Requests and Limits Wisely
Kubernetes uses requests
to schedule pods and limits
to enforce upper bounds. For Java apps:
- Set CPU and memory requests high enough to prevent throttling.
- Avoid over-provisioning, which leads to resource waste and scheduling failures.
- Keep requests and limits equal for predictable autoscaling behavior.
2. Align JVM Heap with Container Limits
Modern JVMs support container memory limits. Prefer:
-XX:+UseContainerSupport # Java 10+
-XX:MaxRAMPercentage=75.0 # Dedicate 75% of container memory to heap
This makes memory management dynamic and prevents OOMs caused by heap over-allocation.
3. Handle JVM Startup Bursts Gracefully
Java apps often consume high CPU at startup due to class loading and JIT compilation. Avoid routing traffic too soon.
Use readiness probes to delay traffic until the pod is fully initialized:
readinessProbe:
httpGet:
path: /health # Path to check for readiness
port: 8080 # Same port your app serves traffic on
initialDelaySeconds: 30 # Wait 30s after container starts before checking
periodSeconds: 10 # Check every 10 seconds
failureThreshold: 3 # Mark pod as not ready after 3 failed checks
📌 Example Deployment implementing strategies 1–3:
# Deployment YAML in K8s
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
replicas: 2
selector:
matchLabels:
app: java-app
template:
metadata:
labels:
app: java-app
spec:
containers:
- name: java-app-container
image: your-java-app-image:latest
resources:
requests:
memory: "32Gi" # 32Gi memory and 8 vCPUs reserved
cpu: "8000m"
limits:
memory: "32Gi"
cpu: "8000m"
env:
- name: JAVA_TOOL_OPTIONS
value: >
-XX:MaxRAMPercentage=75.0 # 75% of memory to JVM heap
-XX:+UseG1GC # G1 garbage collector
-XX:+UseStringDeduplication # Save memory on repeated strings
-XX:MaxMetaspaceSize=512m # Control metaspace usage
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health # Endpoint checked for readiness
port: 8080 # Port where app listens
initialDelaySeconds: 30 # Delay before probe starts (JVM warm-up time)
periodSeconds: 10 # Probe interval
failureThreshold: 3 # Mark pod unready after 3 failed checks
4. Use CPU-Based HPA with Stabilization Windows
Memory-based scaling is unreliable for Java because the JVM may hold memory even when idle. Prefer CPU or custom application-level metrics, and prevent frequent rescaling by applying stabilization:
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: java-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: java-app # Must match the name of the Deployment above
minReplicas: 2 # Never scale below 2 replicas
maxReplicas: 10 # Scale up to 10 replicas if needed
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 40 # Target 40% average CPU usage
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait at least 60s before scaling up
scaleDown:
stabilizationWindowSeconds: 300 # Wait at least 5 minutes before scaling down
5. Add Pod Disruption Budgets (PDBs) for Availability Guarantees
To ensure your service stays available during voluntary disruptions (e.g., node maintenance), use PodDisruptionBudget
.
Stateless Java apps often benefit more from using maxUnavailable
to allow faster updates while maintaining availability:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: java-app-pdb
spec:
maxUnavailable: 1 # Allow at most 1 pod to be disrupted at a time
selector:
matchLabels:
app: java-app
Guidance:
maxUnavailable
is preferred for stateless services needing faster rollouts.- Use
minAvailable
if you need to guarantee a fixed number of pods are always running. - Always match your replica count and availability goals.
đź“– Kubernetes PDB Docs
Common Pitfalls to Avoid
🚫 Scaling on Memory Java apps may hold onto memory (heap) even when idle — misleading memory-based autoscalers.
đźš« Ignoring Warm-Up Time JVMs take time to JIT optimize and load classes. Cold pods may underperform.
đźš« Overreactive Autoscaling
Short scale-in windows cause flapping. Use stabilizationWindowSeconds
:
đźš« Not Considering GC Pauses Avoid aggressive autoscaling based on spiky CPU loads. Analyze GC logs if autoscaling seems jittery.
Monitoring and Observability
Metric | Tool | Why It Matters |
---|---|---|
CPU usage | Prometheus | Trigger for HPA |
Heap/non-heap memory | JMX Exporter | JVM health |
GC frequency/duration | Micrometer/JMX | Spot GC thrashing |
Thread count | JMX Exporter | Detect thread leaks or spikes |
Pod readiness/liveness | Kubernetes probes | Determines scaling & traffic routing |
Summary
Scaling Java apps in Kubernetes requires an understanding of both Kubernetes primitives and JVM internals. By carefully choosing metrics, setting correct resource requests, and avoiding common pitfalls, you can ensure that your application scales reliably and cost-effectively.
🚀 With the right observability and autoscaling setup, your Java services will be well-prepared to handle real-world workloads in the cloud.