Elasticsearch Cluster Operations¶

This guide covers common Elasticsearch cluster maintenance tasks for Hyperion providers running multi-node ES clusters.

Cluster Health Check¶

Check the overall cluster health:

curl -sk "https://localhost:9200/_cluster/health?pretty" -u <user>:<password>

Healthy Cluster

A healthy cluster shows "status" : "green", all nodes present, and "unassigned_shards" : 0.

Check individual node status:

curl -sk "https://localhost:9200/_cat/nodes?v&h=name,ip,version,heap.percent,ram.percent,cpu,load_1m,master" \
  -u <user>:<password>

Check for unassigned shards:

curl -sk "https://localhost:9200/_cat/shards?v" -u <user>:<password> | grep UNASSIGNED

Rolling Upgrade (Zero Downtime)¶

To upgrade Elasticsearch across a multi-node cluster without downtime, upgrade one node at a time.

Prerequisites¶

Cluster must be green before starting
All nodes must have the target version available via apt or as a .deb package
Upgrade non-master nodes first, master node last

Step 1: Add the Elastic APT Repository (if not present)¶

curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | \
  gpg --batch --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] \
  https://artifacts.elastic.co/packages/9.x/apt stable main" \
  > /etc/apt/sources.list.d/elastic-9.x.list

apt-get update

Verify the target version is available:

apt-cache policy elasticsearch

Step 2: Disable Shard Allocation¶

This prevents the cluster from rebalancing shards while a node is down:

curl -sk -X PUT "https://localhost:9200/_cluster/settings" \
  -H "Content-Type: application/json" \
  -u <user>:<password> \
  -d '{ "transient": { "cluster.routing.allocation.enable": "primaries" } }'

Step 3: Upgrade Each Node¶

For each node (non-master first, master last):

# Stop ES
sudo systemctl stop elasticsearch

# Upgrade the package
sudo systemctl daemon-reload
sudo apt-get install -y elasticsearch=<VERSION>

# Start ES
sudo systemctl start elasticsearch

Wait for the node to rejoin the cluster before proceeding to the next node:

curl -sk "https://localhost:9200/_cat/nodes?v&h=name,version,master" -u <user>:<password>

Step 4: Re-enable Shard Allocation¶

After all nodes are upgraded:

curl -sk -X PUT "https://localhost:9200/_cluster/settings" \
  -H "Content-Type: application/json" \
  -u <user>:<password> \
  -d '{ "transient": { "cluster.routing.allocation.enable": "all" } }'

Step 5: Verify¶

curl -sk "https://localhost:9200/_cluster/health?pretty" -u <user>:<password>

The cluster should return to green once all shards are reallocated.

Kibana Upgrade

Kibana should be upgraded to match the ES version: sudo apt-get install -y kibana=<VERSION>

Troubleshooting¶

HTTP Port 9200 Not Listening¶

If a node's transport port (9300) is active but HTTP (9200) is not responding:

# Check if the port is listening
ss -tlnp | grep -E "9200|9300"

# Check ES thread pools for saturation
curl -sk "https://localhost:9200/_cat/thread_pool?v&h=name,active,queue,rejected&s=rejected:desc" \
  -u <user>:<password>

Thread Pool Saturation

If the system_read thread pool shows maximum active threads and a full queue, the HTTP layer may be unresponsive. Restart Elasticsearch to recover.

Node Won't Restart (node.lock Error)¶

If you see LockObtainFailedException: Lock held by another program: /data/ES/data/node.lock:

# Check what process holds the lock
fuser /path/to/es/data/node.lock

# Check if the process is a zombie or stuck in D-state
cat /proc/<PID>/status | grep State

Zombie (Z): The parent process was killed. Try systemctl reset-failed elasticsearch.
D-state (uninterruptible sleep): Threads stuck in kernel I/O. A reboot is required — no signal can kill D-state threads.

Preventing Automatic ES Restarts¶

On Ubuntu/Debian systems, unattended-upgrades with needrestart can automatically restart Elasticsearch after security patches. This is dangerous for production clusters because:

Mass service restarts can cause ES shutdown to race with disk I/O
Stuck shutdowns can leave threads in D-state, holding the node.lock
The node becomes unrecoverable without a reboot

To prevent this, create a needrestart exclusion:

sudo mkdir -p /etc/needrestart/conf.d
echo '$nrconf{override_rc}{qr(^elasticsearch)} = 0;' | \
  sudo tee /etc/needrestart/conf.d/no-restart-elasticsearch.conf

Apply to All Nodes

This exclusion should be applied to every ES node in your cluster. Elasticsearch restarts should only be performed manually using the rolling upgrade procedure above.

Cluster RED After Node Restart¶

This is expected and temporary. When a node leaves the cluster, its shards become unassigned. The cluster will:

Immediately go red (if primaries are missing) or yellow (if only replicas are missing)
Wait for the node to rejoin
Reallocate shards automatically
Return to green

The recovery time depends on the number of shards and data size. Monitor progress with:

curl -sk "https://localhost:9200/_cluster/health?pretty" -u <user>:<password>