etcd persistent storage files

Reference of the persistent storage format and files

This document explains the etcd persistent storage format: naming, content and tools that allow developers to inspect them. Going forward the document should be extended with changes to the storage model. This document is targeted at etcd developers to help with their data recovery needs.

Prerequisites

The following articles provide helpful background information for this document:

Overview

Long leaving files

File nameHigh level purpose
./member/snap/db
bbolt b+tree that stores all the applied data, membership authorization information & metadata. It’s aware of what's the last applied WAL log index ("consistent_index").
./member/snap/0000000000000002-0000000000049425.snap
./member/snap/0000000000000002-0000000000061ace.snap

Periodic snapshots of legacy v2 store, containing:

  • basic membership information
  • etcd-version

As of etcd v3, the content is redundant to the content of /snap/db files.

Periodically (30s) these files are purged, and the last --max-snapshots=5 are preserved.

/member/snap/000000000007a178.snap.db

A complete bbolt snapshot downloaded from the etcd leader if the replica was lagging too much.

Has the same type of content as (./member/snap/db) file.

The file is used in 2 scenarios:

  • In response to the leader's request to recover from the snapshot.
  • During the server startup, when the last snapshot (.snap.db file) is found and detected to be having a newer index than the consistent_index in the current snap.db file.
Note: Periodic snapshots generated on each replica are only emitted in the form of *.snap file (not snap.db file). So there is no guarantee the most recent snapshot (in WAL log) has the *.snap.db file. But in such a case the backend (snap/db) is expected to be newer than the snapshot.

The file is not being deleted when the recovery is over (so whole content is populated to ./member/snap/db file). Periodically (30s) the files are purged. Here also --max-snapshots=5 are preserved. As these files can be O(GBs) this might create a risk of disk space exhaustion.

./member/wal/000000000000000f-00000000000b38c7.wal
./member/wal/000000000000000e-00000000000a7fe3.wal
./member/wal/000000000000000d-000000000009c70c.wal

Raft’s Write Ahead Logs, containing recent transactions accepted by Raft, periodic snapshots or CRC records.

Recent --max-wals=5 files are being preserved. Each of these files is ~64*10^6 bytes. The file is cut when it exceeds this hardcoded size, so the files might slightly exceed that size (so the preallocated 0.tmp does not offer full disk-exceeded protection).

If the snapshots are too infrequent, there can be more than --max-wals=5, as file-system level locks are protecting the files preventing them from being deleted too early.

./member/wal/0.tmp (or .../1.tmp)
Preallocated space for the next write ahead log file. Used to avoid Raft being stuck by a lack of WAL logs capacity without the possibility to raise an alarm.

Temporary files

During etcd internal processing, it is possible that several short living files might be encountered:

FileHigh level purpose
./member/snap/0000000000000002-000000000007a178.snap.broken

Snapshot files are renamed as ‘broken’ when they cannot be loaded.

The attempt to load the newest file happens when etcd is being started.

Or during backup/migrate commands of etcdctl.

./member/snap/tmp071677638 (random suffix)

Temporary (bbolt) file created on replicas in response to the msgSnap leaders request, so to the demand from the leader to recover storage from the given snapshot.

After successful (complete) retrieval of content the file is renamed to: /member/snap/[SNAPSHOT-INDEX].snap.db. In case of a server dying / being killed in the middle of the files download, the files remain on disk and are never automatically cleaned.They can be substantial in size (GBs).

See etcd/issues/12837. Fixed in etcd 3.5.

/member/snap/db.tmp.071677638 (random suffix)

A temporary file that contains a copy of the backend content (/member/snap/db), during the process of defragmentation. After the successful process the file is renamed to /member/snap/db, replacing the original backend.

On etcd server startup these files get pruned.

bbolt b+tree: member/snap/db

This file contains the main etcd content, applied to a specific point of the Raft log (see consistent_index).

Physical organization

The better bolt storage is physically organized as a b+tree. The physical pages of b-tree are never modified in-place1. Instead, the content is copied to a new page (reclaimed from the freepages list) and the old page is added to the free-pages list as soon as there is no open transaction that might access it. Thanks to this process, an open RO transaction sees a consistent historical state of the storage. The RW transaction is exclusive and blocking all other RW transactions.
Big values are stored on multiple continuous pages. The process of page reclamation combined with a need to allocate contiguous areas of pages of different sizes might lead to growing fragmentation of the bbolt storage.

The bbolt file never shrinks on its own. Only in the defragmentation process, the file can be rewritten to a new one that has some buffer of free pages on its end and has truncated size.

Logical organization

The bbolt storage is divided into buckets. In each bucket there are stored keys (byte[]->value byte[] pairs), in lexicographical order. The list below represents buckets used by etcd (as of version 3.5) and the keys in use.

BucketKeyExemplar valueDescription
alarmrpcpb.Alarm: {MemberID, Alarm: NONE|NOSPACE|CORRUPT}nilIndicates problems have been diagnosed in one of the members.
auth"authRevision""" (empty) or BigEndian.PutUint64

Any change of Roles or Users increments this field on transaction commit.

The value is used only for optimistic locking during the authorization process.

authRoles[roleName] as stringauthpb.Role marshalled
authUsers[userName] as stringauthpb.User marshalled
cluster"clusterVersion""3.5.0" (string)minor version of consensus-agreed common storage version.
"downgrade"JSON:
{
  "target-version": "3.4.0"
  "enabled": true/false
}

Persists intent configured by the most recent: Downgrade RPC request.

Since v3.5

key

[revisionId] encoded using bytesToRev{main,sub}

The key-value deletes are marshalled with 't' at the end (as a "Thumbstone")

mvccpb.KeyValue marshalled proto (key, create_rev, mod_rev, version, value, lease id)
leaseleasepb.Lease marshalled proto (ID, TTL, RemainingTTL)

Note: LeaseCheckpoint is extending only RemainingTTL. Just TTL is from the original Grant.

Note2: We persist TTLs in seconds (from the undefined 'now'). Crash-looping server does not release leases!!!

members[memberId] in hex as string: "8e9e05c52164694d"JSON as string serialized Member structure:
{
  "id":10276657743932975437,
  "peerURLs":[
  "http://localhost:2380"],
  "name":"default",
  "clientURLs": ["http://localhost:2379"]
}
Agreed cluster membership information.
members_removed[memberId] in hex as string: "8e9e05c52164694d"[]byte("removed")

Ids of all removed members. Used to validate that a removed member is never added again under the same id.

The field is currently (3.4) read from store V2 and never from V3. See https://github.com/etcd-io/etcd/pull/12820

meta"consistent_index"uint64 bytes (BigEndian)Represents the offset of the last applied WAL entry to the bolt DB storage.
"scheduledCompactRev"bytesToRev{main,sub} encoded. (16 bytes)Used to reinitialize compaction if a crash happened after a compaction request.
"finishedCompactRev"bytesToRev{main,sub} encoded. (16 bytes)Revision at which store was recently successfully compacted (https://github.com/etcd-io/etcd/blob/ae7862e8bc8007eb396099db4e0e04ac026c8df5/server/mvcc/kvstore_compaction.go#L54)
"confState"Since etcd 3.5
"term"Since etcd 3.5
"storage-version"

Tools

bbolt

bbolt has a command line tool that enables inspecting the file content.

Examples of use:

List all buckets in given bbolt file:
% go run go.etcd.io/bbolt/cmd/bbolt buckets ./default.etcd/member/snap/db
Read a particular key/value pair:
% go run go.etcd.io/bbolt/cmd/bbolt get ./default.etcd/member/snap/db cluster clusterVersion

etcd-dump-db

etcd-dump-db can be used to list content of v3 etcd backend (bbolt).

% go run go.etcd.io/etcd/v3/tools/etcd-dump-db  list-bucket default.etcd
alarm
auth
...

See more examples in: https://github.com/etcd-io/etcd/tree/master/tools/etcd-dump-db

WAL: Write ahead log

Write ahead log is a Raft persistent storage that is used to store proposals. First the leader stores the proposal in its log and then (concurrently) replicates it using Raft protocol to followers. Each follower persists the proposal in its WAL before confirming back replication to the leader.

The WAL log used in etcd differs from canonical Raft model 2-fold:

  • It does persist not only indexed entries, but also Raft snapshots (lightweight) & hard-state. So the entire Raft state of the member can be recovered from the WAL log alone.
  • It is append-only. Entries are not overridden in place, but an entry appended later in the file (with the same index) is superseding the previous one.

File names

The WAL log files are named using following pattern:

"%016x-%016x.wal", seq, index

Example: ./member/wal/0000000000000010-00000000000bf1e6.wal

So the file names contains hex-encoded:

  • Sequential number of the WAL log file
  • Index of the first entry or snapshot in the file. In particular the first file “0000000000000000-0000000000000000.wal” has the initial snapshot record with index=0.

Physical content

The WAL log file contains a sequence of “Frames”. Each frame contains:

  1. LittleEndian2 encoded uint64 that contains the length of the marshalled walpb.Record (3).
  2. Padding: Some number of 0 bytes, such that whole frame has aligned (mod 8) size
  3. Marshalled walpb.Record data:
    1. type - int encoded enum driving interpretation of the data-field below
    2. data - depending on type, usually marshalled proto
    3. crc - RC-32 checksum of all “data” fields combined (no type) in all the log records on this particular replica since WAL log creation. Please note that CRC takes in consideration ALL records (even if they didn’t get committedcomitted by Raft).

The files are “cut” (new file is started) when the current file is exceeding 64*10^6 bytes.

Logical content

Write ahead log files in the logical layer contains:

  • Raftpb.Entry: recent proposals replicated by Raft leader. Some of these proposals are considered ‘committed’ and the others are subject to be logically overridden.
  • Raftpb.HardState(term,commit,vote): periodic (very frequent) information about the index of a log entry that is ‘committed’ (replicated to the majority of servers), so guaranteed to be not changed/overridden and that can be applied to the backends (v2, v3). It also contains a “term” (indicator whether there were any election related changes) and a vote - a member the current replica voted for in the current term.
  • walpb.Snapshot(term, index): periodic snapshots of Raft state (no DB content, just snapshot log index and Raft term)
    • V2 store content is stored in a separate *.store files.
    • V3 store content is maintained in the bbolt file, and it’s becoming an implicit snapshot as soon as entries are applied there.
  • crc32 checksum record (at the beginning of each file), used to resume CRC checking for the remainder of the file.
  • etcdserverpb.Metadata(node_id, cluster_id) - identifying the cluster & replica the log represents.

Each WAL-log file is build from (in order):

  1. CRC-32 frame (running crc from all previous files, 0 for the first file).

  2. Metadata frame (cluster & replica IDs)

  3. For the initial WAL file only:

    • Empty Snapshot frame (Index:0, Term: 0). The purpose of this frame is to hold an invariant that all entries are ‘preceded’ by a snapshot.

    For not initial (2nd+) WAL file:

    • HardState frame.
  4. Mix of entry, hard-state & snapshot records

The WAL log can contain multiple entries for the same index. Such a situation can happen in cases described in figure 7. of the Raft paper. The etcd WAL log is appended only, so the entries are getting overridden, by appending a new entry with the same index.

In particular during the WAL reading, the logic is overriding old entries with newer entries. Thus only the last version of entries with entry.index <= HardState.commit can be considered as final. Entries with index > HardState.commit are subject to change.

The “terms” in the WAL log are expected to be monotonic.

The “indexes” in the WAL log are expected to:

  1. start from some snapshot
  2. be sequentially growing after that snapshot as long as they stay in the same ‘term’
  3. if the term changes, the index can decrease, but to a new value that is higher than the latest HardState.commit.
  4. a new snapshot might happen with any index >= HardState.commit, that opens a new sequence for indexes.

Tools

etcd-dump-logs

etcd WAL logs can be read using etcd-dump-logs tool:

% go install go.etcd.io/etcd/v3/tools/etcd-dump-logs@latest

% go run go.etcd.io/etcd/v3/tools/etcd-dump-logs --start-index=0 aname.etcd

Be aware that:

  • The tool shows only Entries, and not all the WAL records (Snapshots, HardStates) that are in the WAL log files.
  • The tool automatically applies ‘overrides’ on the entries. If an entry got overridden (by a fresher entry under the same index), the tool will print only the final value.
  • The tool also prints uncommitted entries (from the tail of the LOG), without information about HardState.commitIndex, so it’s not known whether entries are final or not.

Snapshots of (Store V2): member/snap/{term}-{index}.snap

File names:

member/snap/{term}-{index}.snap

The filenames are generated here ("%016x-%016x.snap") and are using 2 hex-encoded compounds:

  • term -> Raft term (period between elections) at the time snapshot is emitted
  • index -> of last applied proposal at the time snapshot is emitted

Creation

The *.snap files are created by Snapshotter.SaveSnap method.

There are 2 triggers controlling creation of these files:

  • A new file is created every (approximately) --snapshotCount=(by default 100'000) applied proposals. It’s an approximation as we might receive proposals in batches and we consider snapshotting only at the end of batch, finally the snapshotting process is asynchronously scheduled. The flag name (--snapshotCount) is pretty misleading as it drives differences in index value between last snapshot index and last applied proposal index.
  • Raft requests the replica to restore from the snapshot. As a replica is receiving the snapshot over wire (msgSnap) message, it also checkpoints (lightweight) it into WAL log. This guarantees that in the WAL logs tail there is always a valid snapshot followed by entries. So it suppresses potential lack of continuity in the WAL logs.

Currently the files are roughly3 associated 1-1 with WAL logs Snapshot entries. With store v2 decommissioning we expect the files to stop being written at all (opt-in: 3.5.x, mandatory 3.6.x).

Content

The file contains marshalled snapdb.snapshot proto (uint32 crc, bytes data),

that in the ‘data’ field holds Raftpb.Snapshot:

(bytes data, SnapshotMetadata{index, term, conf} metadata),

Finally the nested data holds a JSON serialized store v2 content.

In particular there is:

  • Term
  • Index
  • Membership data:
    • /0/members/8e9e05c52164694d/attributes -> {"name":"default","clientURLs":["[http://localhost:2379](http://localhost:2379)"]}
    • /0/members/8e9e05c52164694d/RaftAttributes -> "{"peerURLs":["http://localhost:2380"]}"
  • Storage version: /0/version-> 3.5.0

Tools

protoc

Following command allows you to see the file content when executed from etcd root directory:

cat default.etcd/member/snap/0000000000000002-0000000000049425.snap |
  protoc --decode=snappb.snapshot \
    server/etcdserver/api/snap/snappb/snap.proto \
    -I $(go list -f '{{.Dir}}' github.com/gogo/protobuf/proto)/.. \
    -I . \
    -I $(go list -m -f '{{.Dir}}' github.com/gogo/protobuf)/protobuf

Analogously you can extract ‘data’ field and decode as ‘Raftpb.Snapshot'

Exemplar JSON serialized store v2 content in etcd 3.4 *.snap files:

{
  "Root":{
    "Path":"/",
    "CreatedIndex":0,
    "ModifiedIndex":0,
    "ExpireTime":"0001-01-01T00:00:00Z",
    "Value":"",
    "Children":{
      "0":{
        "Path":"/0",
        "CreatedIndex":0,
        "ModifiedIndex":0,
        "ExpireTime":"0001-01-01T00:00:00Z",
        "Value":"",
        "Children":{
          "members":{
            "Path":"/0/members",
            "CreatedIndex":1,
            "ModifiedIndex":1,
            "ExpireTime":"0001-01-01T00:00:00Z",
            "Value":"",
            "Children":{
              "8e9e05c52164694d":{
                "Path":"/0/members/8e9e05c52164694d",
                "CreatedIndex":1,
                "ModifiedIndex":1,
                "ExpireTime":"0001-01-01T00:00:00Z",
                "Value":"",
                "Children":{
                  "attributes":{
                    "Path":"/0/members/8e9e05c52164694d/attributes",
                    "CreatedIndex":2,
                    "ModifiedIndex":2,
                    "ExpireTime":"0001-01-01T00:00:00Z",
                    "Value":"{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}",
                    "Children":null
                  },
                  "RaftAttributes":{
                    "Path":"/0/members/8e9e05c52164694d/RaftAttributes",
                    "CreatedIndex":1,
                    "ModifiedIndex":1,
                    "ExpireTime":"0001-01-01T00:00:00Z",
                    "Value":"{\"peerURLs\":[\"http://localhost:2380\"]}",
                    "Children":null
                  }
                }
              }
            }
          },
          "version":{
            "Path":"/0/version",
            "CreatedIndex":3,
            "ModifiedIndex":3,
            "ExpireTime":"0001-01-01T00:00:00Z",
            "Value":"3.5.0",
            "Children":null
          }
        }
      },
      "1":{
        "Path":"/1",
        "CreatedIndex":0,
        "ModifiedIndex":0,
        "ExpireTime":"0001-01-01T00:00:00Z",
        "Value":"",
        "Children":{


        }
      }
    }
  },
  "WatcherHub":{
    "EventHistory":{
      "Queue":{
        "Events":[
          {
            "action":"create",
            "node":{
              "key":"/0/members/8e9e05c52164694d/RaftAttributes",
              "value":"{\"peerURLs\":[\"http://localhost:2380\"]}",
              "modifiedIndex":1,
              "createdIndex":1
            }
          },
          {
            "action":"set",
            "node":{
              "key":"/0/members/8e9e05c52164694d/attributes",
              "value":"{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}",
              "modifiedIndex":2,
              "createdIndex":2
            }
          },
          {
            "action":"set",
            "node":{
              "key":"/0/version",
              "value":"3.5.0",
              "modifiedIndex":3,
              "createdIndex":3
            }
          }
        ]
      }
    }
  }
}

Changes

This section is reserved to describe changes to the file formats introduces between different etcd versions.


  1. The metadata pages at the beginning of the bbolt file are modified in-place. ↩︎

  2. Inconsistent, as majority of uint’s are written bigendian ↩︎

  3. The initial (index:0) snapshot at the beginning of WAL log is not associated with *.snap file. Also the old *.snap files (or WAL logs) might get purged. ↩︎