SRE Observability Elasticsearch
6 min read
Monitoring Elasticsearch: An SRE Perspective

Introduction

Elasticsearch is a powerful distributed search and analytics engine. But with great power comes great operational complexity. For Site Reliability Engineers (SREs), observability into these systems is non-negotiable. Poor visibility can lead to cascading failures, lost data, and frustrated users.

Note: OpenSearch is a community-driven fork of Elasticsearch 7.10 maintained by AWS. All monitoring concepts discussed here apply equally to OpenSearch unless explicitly stated.

This post explores how to monitor Elasticsearch from an SRE standpoint: focusing on performance, availability, and alerting while reducing toil and improving reliability.

Monitoring Priorities & Actions Cheat Sheet

PriorityAreaWhat to MonitorWhy It MattersIf Breached, Do This
πŸŸ₯ CriticalCluster Healthstatus != green, unassigned_shards, delayed_shardsRed/yellow clusters = major service riskInvestigate shard allocation, check logs, rebalance manually if needed
πŸŸ₯ CriticalNode ResourcesJVM heap > 75%, Disk usage > 85%, CPU > 80%Prevent OOM, GC issues, disk full crashesTune heap size, clean disk, scale nodes, check indexing pressure
🟧 HighThread PoolsRejected tasks in write, bulk, search poolsSignals overload or bottlenecksSlow down ingestion, increase queue size, add nodes
🟧 HighIndexing & SearchHigh query latency, throttle_time_in_millisImpacts end-user experienceIdentify hot shards, enable slowlogs, optimize queries and mappings
🟨 MediumLogsGC logs, slow logs, audit logsHelps troubleshoot performance/securityCorrelate with incidents, enable long-term log retention, adjust GC tuning
🟩 LowNode CountUnexpected node up/downMay indicate cluster instability or scale issuesCheck autoscaling events or node health, audit for configuration drift

βœ… Tip: Integrate these actions into your incident response playbooks or runbooks.

Why Monitoring Matters for Search Systems

Search engines are often mission-critical: they power internal dashboards, customer-facing search experiences, logs aggregation, and more. SREs need robust observability to:

  • Detect node or cluster failures early
  • Prevent performance degradation under load
  • Optimize query latency and throughput
  • Ensure shard distribution and replica health
  • Manage storage and JVM memory usage

Key Monitoring Dimensions

1. Cluster Health

MetricDescriptionEndpoint
statusCluster health status: green, yellow, red_cluster/health
number_of_nodesTotal nodes in the cluster_cluster/health
active_shardsNumber of active shards_cluster/health
unassigned_shardsNumber of shards not allocated_cluster/health
delayed_unassigned_shardsDelayed shard assignments (could indicate slow recovery)_cluster/health
curl -X GET 'localhost:9200/_cluster/health?pretty'

2. Node Metrics

MetricDescriptionEndpoint
jvm.mem.heap_used_percentJVM heap memory usage percentage_nodes/stats/jvm
fs.total.available_in_bytesAvailable disk space_nodes/stats/fs
process.cpu.percentNode-level CPU utilization_nodes/stats/process
thread_pool.*.rejectedRejected tasks per thread pool_nodes/stats/thread_pool
curl -X GET 'localhost:9200/_nodes/stats?pretty'

3. Indexing & Query Performance

MetricDescriptionEndpoint
indexing.index_totalNumber of indexing operations_stats/indexing
search.query_totalTotal number of queries_stats/search
search.query_time_in_millisTotal time spent on queries_stats/search
indexing.throttle_time_in_millisTime spent throttling indexing_stats/indexing
curl -X GET 'localhost:9200/_stats/indexing,search'

4. Alerts and Thresholds

Set alerts on:

ConditionThreshold
Cluster statusNot equal to green
JVM heap usage> 75%
Disk usage> 85%
Thread pool rejectionsSustained increase over baseline

Use alert grouping to avoid alert fatigue. Correlate with deployments or ingestion surges.

5. Logs and Audit Trails

Collect logs via Filebeat, Fluentd, or Data Prepper:

Log TypeContents
Application LogsIndexing/search slow logs, deprecation warnings
GC LogsJVM garbage collection activity
Audit LogsAccess control and security events

Ship them to a centralized location for querying and retention.

SRE Best Practices

Use Service-Level Objectives (SLOs)

Define SLOs for:

  • Query latency (e.g. 95th percentile < 300ms)
  • Error rate thresholds
  • Uptime and data freshness

Use SLIs (Service-Level Indicators) to track and trigger automated responses.

Reduce Toil with Dashboards and Automation

Set up Grafana dashboards via Prometheus exporters (e.g., elasticsearch_exporter). Automate recurring tasks such as:

  • Disk space checks
  • Snapshot verification
  • Shard rebalancing (via curator or ILM policies)

Integrate with Incident Management

Tie alerts to your incident platform (PagerDuty, Opsgenie, etc.). Include runbooks for resolving common issues like:

  • Red clusters
  • Shard allocation failures
  • Node restarts due to OOM

Common Gotchas

  • Missing metrics: Ensure all nodes are exporting metrics and logging is consistent.
  • Over-alerting: Calibrate thresholds to avoid noise.
  • Cluster sprawl: Too many indices can degrade performance β€” use rollover and ILM.
  • Heap sizing: Don’t exceed 31GB due to compressed object pointers (compressed oops).

Tools to Know

Final Thoughts

Effective Elasticsearch monitoring is essential to delivering reliable services. SREs should aim for proactive visibility, actionable alerts, and minimum manual intervention. Monitoring should not be a reactive afterthought but a design goal.

By investing in strong observability foundations, you empower your team to move faster without sacrificing reliability.

Want to go further? Consider implementing canary queries and chaos testing for resilience under pressure.


SRE Observability Elasticsearch