Version: main 🚧

Resolve etcd NOSPACE alarm in vCluster

The etcdetcdA distributed key-value store that provides reliable storage for Kubernetes cluster data. In vCluster, etcd can be deployed externally or embedded within the vCluster pod.Related: Control Plane NOSPACE alarm signals that the etcd database inside the vClustervClusterAn open-source software product that creates and manages virtual Kubernetes clusters inside a host Kubernetes cluster. vCluster improves isolation and multi-tenancy capabilities while reducing infrastructure costs.Related: Virtual Cluster, Host Cluster has used its available disk space. When this occurs, etcd fails its health checks, which causes the control planeControl PlaneThe container orchestration layer that exposes the API and interfaces to define, deploy, and manage the lifecycle of containers. In vCluster, each virtual cluster has its own control plane components.Related: API Server, vCluster to become unresponsive. As a result, all cluster operations—such as deploying workloads, updating resources, or managing cluster components—are blocked, and the vCluster is unusable until the issue is resolved.

Error message

You might find the following error in the logs of your etcd pods, if the etcd has run out of storage space:

etcd NOSPACE alarm
etcdhttp/metrics.go:86 /health error ALARM NOSPACE status-code 503

Identifying an etcd NOSPACE alarm in vCluster

When interacting with the affected vCluster using kubectl, API requests fail with timeout errors:

Error from server: etcdserver: request timed out

Additionally, the etcd health metrics endpoint returns a 503 status code and the following error:

etcdhttp/metrics.go:86 /health error ALARM NOSPACE status-code 503

To verify the NOSPACE alarm, run the following command against the etcd instance:

etcdctl alarm list --endpoints=https://$ETCD_SRVNAME:2379 [...]

The output displays the triggered alarm:

memberID:XXXXX alarm:NOSPACE

Causes

The NOSPACE alarm occurs due to two common conditions:

Excessive etcd data growth: A large number of objects—such as Deployments, ConfigMaps, and Secrets—can fill etcd’s storage if regular compactionCompactionThe process of removing superseded data from etcd to free up storage space. Regular compaction prevents etcd from running out of disk space.Related: etcd, Defragmentation is not performed.
Synchronization conflicts: Conflicting objects between the vCluster and host clusterHost ClusterThe physical Kubernetes cluster where virtual clusters are deployed and run. The host cluster provides the infrastructure resources (CPU, memory, storage, networking) that virtual clusters leverage, while maintaining isolation between different virtual environments.Related: Virtual Cluster can trigger continuous sync loops. For example, a Custom Resource Definition (CRD) modified by the host cluster might sync back to the vCluster repeatedly. This behavior quickly fills etcd’s backend storage.

Solution

To resolve the issue, compact, and defragment the etcd database to free up space. Then, reconfigure etcd with automatic compaction and increase its storage quota to prevent recurrence.

Backing store differences

The compaction process differs between deployed and embedded etcd. Choose the appropriate tab below based on your vCluster configuration.

Identify if there's a syncing conflict.
Check for objects that might be caught in a sync loop:
```
kubectl -n <namespace> logs <vcluster-pod> | grep -i "sync" | grep -i "error"
```
If you find a problematic object, pause syncing for it in your vCluster config.

Compact and defragment etcd.

Deployed etcd
Embedded etcd

Connect to each etcd pod. Access the etcd pod using the following command:
```
kubectl -n <namespace> exec -it <etcd-pod-name> -- sh
```
Set environment variables. Export the etcd service name as an environment variable:
```
export ETCD_SRVNAME=<etcd-pod-name>
```

Get the current revision number. Retrieve the current revision number of etcd using the following command:

etcdctl endpoint status --write-out json \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Compact the etcd database. Compact etcd to remove old data and free up disk space:

etcdctl --command-timeout=600s compact <revision-number> \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Replace <revision-number> with the value retrieved from the previous command.

Defragment etcd. Defragment etcd to optimize disk usage and improve performance:

etcdctl --command-timeout=600s defrag \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Repeat for all etcd pods in your cluster.

Connect to the vCluster pod. Access the vCluster pod using the following command:
```
kubectl -n <namespace> exec -it <vcluster-pod-name> -- sh
```

Get the current revision number. Retrieve the current revision number of embedded etcd:

etcdctl endpoint status --write-out json \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/data/pki/etcd/ca.crt \
    --key=/data/pki/etcd/tls.key \
    --cert=/data/pki/etcd/tls.crt

Compact the etcd database. Compact etcd to remove old data and free up disk space:

etcdctl --command-timeout=600s compact <revision-number> \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/data/pki/etcd/ca.crt \
    --key=/data/pki/etcd/tls.key \
    --cert=/data/pki/etcd/tls.crt

Replace <revision-number> with the value retrieved from the previous command.

Defragment etcd. Defragment etcd to optimize disk usage and improve performance:

etcdctl --command-timeout=600s defrag \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/data/pki/etcd/ca.crt \
    --key=/data/pki/etcd/tls.key \
    --cert=/data/pki/etcd/tls.crt

For HA setups with multiple replicas, repeat for each vCluster pod (e.g., <vcluster-name>-0, <vcluster-name>-1, etc.).

Verify disk usage reduction.

Deployed etcd
Embedded etcd

Check that the operation freed up space:

etcdctl endpoint status -w table \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Check that the operation freed up space:

etcdctl endpoint status -w table \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/data/pki/etcd/ca.crt \
    --key=/data/pki/etcd/tls.key \
    --cert=/data/pki/etcd/tls.crt

Disarm the NOSPACE alarm.

Deployed etcd
Embedded etcd

Remove the alarm to restore normal operation:

etcdctl alarm disarm \
    --endpoints=https://$ETCD_SRVNAME:2379 \
    --cacert=/run/config/pki/etcd-ca.crt \
    --key=/run/config/pki/etcd-peer.key \
    --cert=/run/config/pki/etcd-peer.crt

Remove the alarm to restore normal operation:

etcdctl alarm disarm \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/data/pki/etcd/ca.crt \
    --key=/data/pki/etcd/tls.key \
    --cert=/data/pki/etcd/tls.crt

Prevention

Update your vCluster configuration to prevent future occurrences. Use the following recommended settings to enable automatic maintenance of your etcd database:

Deployed etcd
Embedded etcd

vcluster.yaml
controlPlane:
  backingStore:
    etcd:
      deploy:
        enabled: true
        statefulSet:
          enabled: true
          extraArgs:
            - '--auto-compaction-mode=periodic'
            - '--auto-compaction-retention=30m'
            - '--quota-backend-bytes=8589934592'  # 8 GB

vcluster.yaml
controlPlane:
  backingStore:
    etcd:
      embedded:
        enabled: true
        extraArgs:
          - '--auto-compaction-mode=periodic'
          - '--auto-compaction-retention=30m'
          - '--quota-backend-bytes=8589934592'  # 8GB

This configuration enables periodic compaction every 30 minutes and sets etcd quota to 8 GB. You can adjust parameters based on your needs.

Verification

After completing the solution steps:

Check that etcd pods are healthy:

kubectl -n <namespace> get pods | grep etcd

Verify that vCluster is functioning properly:

kubectl -n <namespace> get pods
kubectl -n <namespace> logs <vcluster-pod> | grep -i "alarm"

Best practices

To ensure optimal etcd performance in vCluster:

Monitor etcd disk usage: Use metrics tools to track disk usage and set up alerts for high usage levels.
Enable automated compaction: Configure compaction with --auto-compaction-mode=periodic and --auto-compaction-retention=30m to clean up old data.
Size etcd storage appropriately: Set --quota-backend-bytes based on usage, with a buffer for growth.
Defragment etcd regularly: Optimize disk usage by defragmenting etcd periodically.
Resolve syncing conflicts: Identify and fix syncing issues to prevent unnecessary data growth.
Consider backing store choice: Embedded etcd simplifies management but deployed etcd provides more control for complex scenarios.

Certificate and path reference

Configuration	Deployed etcd	Embedded etcd
Endpoint	`https://<etcd-pod>:2379`	`https://127.0.0.1:2379`
CA certificate	`/run/config/pki/etcd-ca.crt`	`/data/pki/etcd/ca.crt`
Client key	`/run/config/pki/etcd-peer.key`	`/data/pki/etcd/tls.key`
Client certificate	`/run/config/pki/etcd-peer.crt`	`/data/pki/etcd/tls.crt`
Data directory	Separate PVC	`/data` in vCluster PVC
Access method	Via etcd pod	Via vCluster pod

Error message​

Causes​

Solution​

Prevention​

Verification​

Best practices​

Certificate and path reference​