Understanding etcd Storage Performance Requirements

Understanding etcd Storage Performance Requirements

Why p99 Latency Matters

etcd is highly sensitive to storage latency spikes, which can impact the entire cluster’s performance. We follow the upstream recommendation of maintaining a maximum p99 latency of 10ms. This means that 99% of all storage operations must complete within 10ms to ensure reliable cluster operations and prevent cascading issues that could affect your application’s availability.

In a practical application these issues will generally arise during install, upgrade, backup, or recovery. In particular backup and recovery can be intensive for etcd.

Storage Configuration Best Practices

Local vs. Replicated Storage

etcd includes its own replication mechanism across cluster members, making redundant storage unnecessary. In fact, using replicated storage can introduce additional latency that impacts performance. Local storage, particularly local NVMe drives, typically provides better performance characteristics for etcd:

  • Local Storage Benefits:

    • Lower latency
    • More consistent performance
    • Better suited for etcd’s workload pattern
  • Replicated Storage Drawbacks:

    • Additional network overhead
    • Increased latency variability
    • Potential for compounded issues during network events

Cloud Provider Specific Recommendations

Azure

  • L-series VMs: These instances come with local NVMe storage and are ideal for etcd workloads
  • Ultra SSD: If local storage isn’t an option, Ultra SSD is the only Azure managed disk type specifically optimized for consistent low tail latency
  • Premium SSD: While commonly used, they may not consistently meet the 10ms p99 requirement

AWS

GCP

  • Local SSD: Available as an additional disk on many instance types
  • Persistent SSD: Can be used but should be thoroughly tested to ensure consistent performance. Performance scales with size, a pd-ssd that is 200GB or larger has demonstrated sufficient performance in the past.

Planning for Success

When planning a new installation, it’s crucial to communicate these storage requirements early in the process. Many organizations have standard VM templates or storage policies that may not meet etcd’s performance needs. By documenting and discussing these requirements during the initial planning phase, you can avoid deployment delays and ensure optimal cluster performance from day one.