Data Corruption
etcd has built in automated data corruption detection to prevent member state from diverging.
Enabling data corruption detection
Data corruption detection can be done using:
- Initial check, enabled with
--experimental-initial-corrupt-check
flag. - Periodic check of:
- Compacted revision hash, enabled with
--experimental-compact-hash-check-enabled
flag. - Latest revision hash, enabled with
--experimental-corrupt-check-time
flag.
- Compacted revision hash, enabled with
Initial check will be executed during bootstrap of etcd member. Member will compare its persistent state vs other members and exit if there is a mismatch.
Both periodic check will be executed by the cluster leader in a cluster that is already running. Leader will compare its persistent state vs other members and raise a CORRUPT ALARM if there is a mismatch. Both checks serve the same purpose, however they are both worth enabling to balance performance and time to detection.
- Compacted revision hash check - requires regular compaction, minimal performance cost, handles slow followers.
- Latest revision hash check - high performance cost, doesn’t handle slow followers or frequent compactions.
Compacted revision hash check
When enabled using --experimental-compact-hash-check-enabled
flag, check will be executed once every minute.
This can be adjusted using --experimental-compact-hash-check-time
flag using format: 1m
- every minute, 1h
- evey hour.
This check extends compaction to also calculate checksum that can be compared between cluster members.
Doesn’t cause additional database scan making it very cheap, but requiring a regular compaction in cluster.
Latest revision hash check
Enabled using --experimental-corrupt-check-time
flag, requires providing an execution period in format: 1m
- every minute, 1h
- evey hour.
Recommended period is a couple of hours due to a high performance cost.
Running a check requires computing a checksum by scanning entire etcd content at given revision.
Restoring a corrupted member
There are three ways to restore a corrupted member:
- Purge member persistent state
- Replace member
- Restore whole cluster
After the corrupted member is restored, CORRUPT ALARM can be removed.
Purge member persistent state
Members state can be purged by:
- Stopping the etcd instance.
- Backing up etcd data directory.
- Moving out the
snap
subdirectory from the etcd data directory. - Starting
etcd
with--initial-cluster-state=existing
and cluster members listed in--initial-cluster
.
Etcd member is expected to download up-to-date snapshot from the leader.
Replace member
Member can be replaced by:
- Stopping the etcd instance.
- Backing up the etcd data directory.
- Removing the data directory.
- Removing the member from cluster by running
etcdctl member remove
. - Adding it back by running
etcdctl member add
- Starting
etcd
with--initial-cluster-state=existing
and cluster members listed in--initial-cluster
.
Restore whole cluster
Cluster can be restored by saving a snapshot from current leader and restoring it to all members.
Run etcdctl snapshot save
against the leader and follow restoring a cluster procedure.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.