Implementation of etcd based on Raft protocol

Storage design

There are three modules related to the storage part in Etcd, including log entries stored in the Raft state machine, log entries persisted to files, and back-end KV storage.

Raft state machine storage

Review the overall architecture of Etcd mentioned in Chapter 1. The raft module is only responsible for algorithm implementation, so all received log entries are stored in memory. The data structure is as follows:

EtcdServer

In the figure above, all log entries are stored in a raftLog structure

type raftLog struct {

    // Logs that have been persisted since the last snapshot

    storage Storage

    // Logs that have not been persisted

    unstable unstable

    // Committed log index in cluster

    committed uint64

    // This node has been applied to the log index of the state machine

    applied uint64

...

...

Two fields are used in raftLog to store logs. Storage stores the logs that have been persisted to disk and the information of the latest snapshot, that is, the data written to wal in the above figure. This part of the log will not be lost even when the node is restarted. When the node is restarted, etcd will read this part of the data from Wal and write it to raft memory. Why is undefined after the last snapshot? Because the memory of the raft node is limited after all, the KV will be taken snapshots regularly in etcd. After the snapshot is completed, the storage only needs to store the snapshot information and the logs received after the snapshot, which is also defined in the raft protocol.
The unsustainable structure stores log entries and snapshots that have not been persisted. After the log is persisted, it will be moved from the unsustainable to the storage.
The committed and applied attributes of the raft protocol also exist in the raft log, because according to the raft protocol, these two attributes also need to be stored persistently.

Storage

The raft state machine Storage interface is defined as follows:

type Storage interface {

    // HardState and ConfState information that has been persisted

    InitialState() (pb.HardState, pb.ConfState, error)

    // Return log entries

    Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)

    // Current election cycle

    Term(i uint64) (uint64, error)

    // index of the last log

    LastIndex() (uint64, error)

    // index of the first log

    FirstIndex() (uint64, error)

    // Returns the most recent Snapshot

    Snapshot() (pb.Snapshot, error)

The Storage interface defines all information interfaces required in the Raft protocol that need to be persisted, such as term, commitIndex, and log entries in the HardState.

The default implementation of this interface in etcd is MemoryStorage. It can be seen from the name that the data is stored in memory. It seems that this is inconsistent with the requirements of raft. This is because the logs and status information stored in MemoryStorage are in the WAL, so only memory is needed here. When restarting, etcd will recover data from the WAL and write it to Storage. MemoryStorage is defined as follows:

type MemoryStorage struct {

    // Read write lock

    sync.Mutex

   // term, commitIndex and vote are encapsulated in HardState

    hardState pb.HardState

   //Last Snapshot

    snapshot  pb.Snapshot

   //For the log entries after the snapshot, the index of the first log entry is snapshot.Metadata.Index

    ents []pb.Entry

unstable

The Raft module has received that the log entries that have not been persisted to the WAL exist in unstable

type unstable struct {

    // snapshot received from leader

    snapshot *pb.Snapshot

    // Newly received log entries that have not been persisted

    entries []pb.Entry

  //Offset of the first log

    offset  uint64

Log persistent storage

The log data persistence of Raft module is realized through WAL, which writes data to disk through additional writing to improve performance. Etcd will add a record to WAL in the following cases:

The node and cluster information are recorded when the node is started, and the corresponding record type is metadata type;
A new log entry is received, and the corresponding record type is entryType;
When the status changes, such as new election cycle, commitIndex change, etc., the corresponding record type is stateType;
When taking data snapshots, the corresponding record type is snapshotType;
When generating a new wal file, when the wal file reaches a certain size, etcd will generate a new file. The first record of the new file will record the crc of the previous file for data verification. The corresponding record type is crcType

type WAL struct {

    lg *zap.Logger

   // Storage directory of wal files

    dir string

    dirFile *os.File

   // The first metadata record that will be written after the wal file is built

    metadata []byte

   // The first state record that will be written after the wal file is built

    state    raftpb.HardState

   // The snapshot starting with wal means that the reading of wal starts after the record of this snapshot

    start     walpb.Snapshot

   //Deserializer for wal records

    decoder   *decoder

...

   //List of underlying data files

    locks []*fileutil.LockedFile

//

The bottom layer of WAL corresponds to a series of files on the disk. When receiving log entries that need to be persisted, they will be appended to the end of the file. When the file reaches a certain size, WAL will actively create a new disk file to prevent a single WAL file from being too large.

etcd

The data will be taken snapshots regularly, and a record will be appended to the wal during the snapshot. When the etcd node restarts recovery, it will find the record of the last snapshot in wal, and re send the log entries after the snapshot to the raft module to recover memory data.

KV database storage

The final effective data of Etcd is stored in KV database, and a Backend interface is abstracted for back-end storage. The implementation of Backend needs to support transaction and multi version management. The definition of the Backend interface is as follows:

type Backend interface {

    // Open read transaction

    ReadTx() ReadTx

    //Open write transaction

    BatchTx() BatchTx

    // Enable concurrent read transactions without blocking each other

    ConcurrentReadTx() ReadTx

    // Snapshot db

    Snapshot() Snapshot

    Hash(ignores map[IgnoreKey]struct{}) (uint32, error)

    // The physical disk size occupied by DB can be pre allocated, so it is not the actual data size

    Size() int64

    // Actual disk space used

    SizeInUse() int64

    // Returns the number of current read transactions

    OpenReadTxN() int64

    // Data file collation will recycle the disks occupied by the old version of deleted keys and updated keys

    Defrag() error

    ForceCommit()

    Close() error

The default implementation of this interface is as follows:

type backend struct {

    // Disk size occupied

    size int64

    // Actual size used

    sizeInUse int64

    // Number of committed transactions

    commits int64

    // Number of read transactions currently open

    openReadTxN int64

   // Read write lock

    mu sync.RWMutex

    //The underlying storage is boltDB

    db *bolt.DB

   // Batch write commit interval

    batchInterval time.Duration

   // Maximum number of bulk write transactions

    batchLimit    int

   // Write transaction buffer queue

    batchTx       *batchTxBuffered

   // Write transaction

    readTx *readTx

    stopc chan struct{}

    donec chan struct{}

    lg *zap.Logger

As can be seen from the above implementation, the default underlying storage of etcd uses boltDB. In order to improve the reading and writing efficiency, etcd will maintain a cache queue for write transactions. When the queue size reaches a certain number or a certain time has elapsed since the last time, the data will be written to the disk.

Storage summary

After the data is submitted from the client to etcd, it will pass through three storage places. First, it will enter the raft algorithm module. Raft will save the log in memory, and then notify etcd to persist. In order to improve efficiency, etcd will write the data to wal. Because the wal underlying file is only appended without updating and deleting, the data will not be lost after completing this step. Then the etcd leader node distributes the logs to the cluster. When more than half of the nodes respond, it will submit the data and store the data in the back-end KV storage.

Log synchronization

After understanding the storage design in etcd, you can better understand the whole flow process of a data change request. Let's take a look at the source code.

Request processing

When the client submits a data change request, for example, put hello is

For the write request of the world, the Put() method of EtcdServer will be called in v3, and finally processInternalRaftRequestOnce() will be called.

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*applyResult, error) {

    //Judge whether the submitted but not applied records exceed the limit

    ai := s.getAppliedIndex()

    ci := s.getCommittedIndex()

    if ci > ai+maxGapBetweenApplyAndCommitIndex {

        return nil, ErrTooManyRequests

    //Generate a requestID

    r.Header = &pb.RequestHeader{

        ID: s.reqIDGen.Next(),

    authInfo, err := s.AuthInfoFromCtx(ctx)

    if err != nil {

        return nil, err

    if authInfo != nil {

        r.Header.Username = authInfo.Username

        r.Header.AuthRevision = authInfo.Revision

    //Deserialize request data

    data, err := r.Marshal()

    if err != nil {

        return nil, err

    if len(data) > int(s.Cfg.MaxRequestBytes) {

        return nil, ErrRequestTooLarge

    id := r.ID

    if id == 0 {

        id = r.Header.ID

    //Register a channel and wait for the processing to complete

    ch := s.w.Register(id)

    //Set request timeout

    cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())

    defer cancel()

    start := time.Now()

    // Call the proposal of the raft module to process the request

    err = s.r.Propose(cctx, data)

    if err != nil {

        proposalsFailed.Inc()

        s.w.Trigger(id, nil) // GC wait

        return nil, err

    proposalsPending.Inc()

    defer proposalsPending.Dec()

    select {

    // Wait for the application result to be returned to the client

    case x := <-ch:

        return x.(*applyResult), nil

    case <-cctx.Done():

        proposalsFailed.Inc()

        s.w.Trigger(id, nil) // GC wait

        return nil, s.parseProposeCtxErr(cctx.Err(), start)

    case <-s.done:

        return nil, ErrStopped

In the above method, after etcd performs basic verification on the request, it will submit it to raft for processing by calling the Propose() method, and then wait for feedback. In etcd implementation, the result will not be returned to the client until the data is applied to the state machine. In the Propose() method, raft encapsulates the request into an MsgProp message and calls the Step function.

func (rn *RawNode) Propose(data []byte) error {

    return rn.raft.Step(pb.Message{

        Type: pb.MsgProp,

        From: rn.raft.id,

        Entries: []pb.Entry{

            {Data: data},

}})

etcd only allows the leader to process the data change request, so if the Follower receives the command from the client, it will directly transfer it to the leader for processing, and then wait for the leader's feedback and return the result to the client. Therefore, we only need to look at the processing logic of the leader. The Step() function above finally calls the stepLeader(*raft) of the raft module,

pb.Message) function.

The reason why we entered the stepLeader method has been mentioned in the previous article. If you are not impressed, you can look back

func stepLeader(r *raft, m pb.Message) error {

    // These message types do not require any progress for m.From.

    switch m.Type {

    case pb.MsgBeat:

...

    case pb.MsgCheckQuorum:

...

    case pb.MsgProp:

        if len(m.Entries) == 0 {

            r.logger.Panicf("%x stepped empty MsgProp", r.id)

        if r.prs.Progress[r.id] == nil {

            // Judge whether the current node has been removed from the cluster

            return ErrProposalDropped

        if r.leadTransferee != None {

            // If a leader switch is in progress, write is rejected

            return ErrProposalDropped

        for i := range m.Entries {

            //Determine whether there is a log of configuration changes, and if so, do some special processing

        //Append logs to the raft state machine

        if !r.appendEntry(m.Entries...) {

            return ErrProposalDropped

        // Send logs to other nodes in the cluster

        r.bcastAppend()

        return nil

    case pb.MsgReadIndex:

...

        return nil

...

...

    return nil

Raft protocol is a protocol based on log replication, so the client data change request will be encapsulated into a log entry. In the above logic, some basic checks are made first. After passing, the log entries in Message are added to the log list of raft. After the addition is successful, the log will be broadcast to all followers.

Raft log add

When talking about storage, it was mentioned that the raft algorithm implementation module only stores logs in memory, so the logic of appendEntry is also very simple.

func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {

     //1. Get the index of the last log entry of the raft node

    li := r.raftLog.lastIndex()

     //2. Set term and index for new log entries

    for i := range es {

        es[i].Term = r.Term

        es[i].Index = li + 1 + uint64(i)

    // 3. Judge whether the uncommitted log entries exceed the limit. If yes, reject and return failure

    if !r.increaseUncommittedSize(es) {

        return false

    // 4. Append log entries to raftLog

    li = r.raftLog.append(es...)

    // 5. Check and update the log progress

    r.prs.Progress[r.id].MaybeUpdate(li)

    // 6. Judge whether to make a commit

    r.maybeCommit()

    return true

Gets the index of the last log entry in the current raft log
The log entry index of raft is monotonically increasing
etcd limits the maximum number of uncommitted entries on the leader to prevent entries from accumulating all the time due to network problems between the leader and the follower.
Append the log entries to the raftLog memory queue and return the index of the largest log. For the case where the leader appends the log, the li returned here must be equal to the li obtained in line 1 of the method
The leader node of raft saves the log synchronization progress of all nodes, including itself
Ignore the maybeCommit() result and return true directly to start broadcasting logs.

Sync to Follower

After the Leader node stores the log entries in the memory of raftLog, the bcastAppend() method triggers a broadcast operation and synchronize the log to Follower.

func (r *raft) bcastAppend() {

    //Traverse all nodes and send log Append messages to all nodes except yourself

    r.prs.Visit(func(id uint64, _ *tracker.Progress) {

        if id == r.id {

            return

        r.sendAppend(id)

})

func (r *raft) sendAppend(to uint64) {

    r.maybeSendAppend(to, true)

func (r *raft) maybeSendAppend(to uint64, sendIfEmpty bool) bool {

    //1. Obtain the current synchronization progress of the peer node

    pr := r.prs.Progress[to]

    if pr.IsPaused() {

        return false

    m := pb.Message{}

    m.To = to

    //2. Note that the term here is the term of the first log entry sent to the follower this time

    term, errt := r.raftLog.term(pr.Next - 1)

    ents, erre := r.raftLog.entries(pr.Next, r.maxMsgSize)

    if len(ents) == 0 && !sendIfEmpty {

        return false

    if errt != nil || erre != nil {

        //3. If you fail to get the term or log, it means that the follower is too far behind, and the log in raftLog memory has been deleted after snapshot

        if !pr.RecentActive {

            r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)

            return false

        //4. Send a Snapshot message instead

        m.Type = pb.MsgSnap

        snapshot, err := r.raftLog.snapshot()

        if err != nil {

            if err == ErrSnapshotTemporarilyUnavailable {

                r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)

                return false

            panic(err) // TODO(bdarnell)

        if IsEmptySnap(snapshot) {

            panic("need non-empty snapshot")

        m.Snapshot = snapshot

        sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term

        pr.BecomeSnapshot(sindex)

    } else {

        //5. Send Append message

        m.Type = pb.MsgApp

        m.Index = pr.Next - 1

        m.LogTerm = term

        m.Entries = ents

        //6. Each time you send a log or heartbeat, you will bring the latest commitIndex

        m.Commit = r.raftLog.committed

        if n := len(m.Entries); n != 0 {

...

...

    //7. Send message

    r.send(m)

    return true

In the above logic, after receiving a new update log, the leader will traverse all follower nodes in the cluster and trigger a log synchronization.

According to the raft protocol, the leader needs to cache the log synchronization progress of all followers
When retrieving log entries according to the log progress, it is found that the follower log lags too much, which usually occurs when a new node has just joined or the network connection fails. In this case, the leader sends the latest snapshot to the follower to improve the synchronization efficiency
Under normal circumstances, a new log will be sent to Follower. The message type is MsgApp. Finally, r.send(m) will be called to submit the message.

Log write WAL

As mentioned in the previous article on sending heartbeat messages, there will be a goroutine in EtcdServer to monitor whether there is new Ready data in the raft channel. After receiving it, it will send the msgs to the receiver. This MsgApp type message is also submitted, so it will not be repeated here.

When the log is sent to the Follower, the Leader will drop the log, that is, write it to the WAL, which is realized by calling the WAL.Save() method.

func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {

    //Get the write lock of wal

    w.mu.Lock()

    defer w.mu.Unlock()

    // If the HardState changes or new log entries need to be written to wal

    if raft.IsEmptyHardState(st) && len(ents) == 0 {

        return nil

    mustSync := raft.MustSync(st, w.state, len(ents))

    // Write log entries

    for i := range ents {

        if err := w.saveEntry(&ents[i]); err != nil {

            return err

    // Write state change

    if err := w.saveState(&st); err != nil {

        return err

    // Determine whether the file size exceeds the maximum

    curOff, err := w.tail().Seek(0, io.SeekCurrent)

    if err != nil {

        return err

    if curOff < SegmentSizeBytes {

        if mustSync {

            return w.sync()

        return nil

    // File segmentation

    return w.cut()

The wal file structure has been described above. For the new log, the record of entryType is added in wal.

Follower log processing

After the Leader node processes the command, the sending log and persistence operations are asynchronous, but this does not mean that the client has received a reply. The Raft protocol requires that the log must have been submitted when the client is returned successfully, so the Leader needs to wait for more than half of the Follower nodes to process the log and give feedback. Let's take a look at the Follower's log processing first.

After the log message arrives at the Follower, it is also processed by the EtcdServer.Process() method, and finally enters the stepFollower() function of the Raft module.

func stepFollower(r *raft, m pb.Message) error {

    switch m.Type {

...

    case pb.MsgApp:

        // Reset heartbeat count

        r.electionElapsed = 0

        // Set up Leader

        r.lead = m.From

        // Processing log entries

        r.handleAppendEntries(m)

...

...

After receiving the message, the Follower first resets the heartbeat count and leader as the processing logic of heartbeat message, and then processes the log entries.

func (r *raft) handleAppendEntries(m pb.Message) {

    // Determine whether it is an outdated message

    if m.Index < r.raftLog.committed {

        r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: r.raftLog.committed})

        return

    if mlastIndex, ok := r.raftLog.maybeAppend(m.Index, m.LogTerm, m.Commit, m.Entries...); ok {

        // Processing succeeded. Send msgapresp to the Leader

        r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})

    } else {

        // The index of the log does not match the lastIndex of the Follower, and a reject message is returned

        r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: m.Index, Reject: true, RejectHint: r.raftLog.lastIndex()})

Call raftLog to store the log and return the result to the Leader. Here, the failure of follower may be caused by two situations: one is that the term in the log entry is inconsistent with the term of follower, and the other is that the minimum index in the log list is greater than the maximum log index of follower.

maybeAppend() above

The method will only store the log in the memory queue maintained by RaftLog. The log persistence is asynchronous, which is basically the same as the storage wal logic of the leader node. One difference is that the msgapresp message officially sent by the follower node will be saved after the wal is successfully saved, while the leader node sends the message first and then saves the wal.

Commit

After broadcasting the log to the follower, the Leader node has been waiting for the msgapresp message from the follower. After receiving it, it will still enter the stepLeader function.

func stepLeader(r *raft, m pb.Message) error {

...

...

    pr := r.prs.Progress[m.From]

    switch m.Type {

    case pb.MsgAppResp:

        pr.RecentActive = true

        if m.Reject {

            //If you receive a reject message, resend the log according to the index fed back by the follower

            if pr.MaybeDecrTo(m.Index, m.RejectHint) {

                if pr.State == tracker.StateReplicate {

                    pr.BecomeProbe()

                r.sendAppend(m.From)

        } else {

            oldPaused := pr.IsPaused()

            //Update cached log synchronization progress

            if pr.MaybeUpdate(m.Index) {

                switch {

                case pr.State == tracker.StateProbe:

                    pr.BecomeReplicate()

                case pr.State == tracker.StateSnapshot && pr.Match >= pr.PendingSnapshot:

                    pr.BecomeProbe()

                    pr.BecomeReplicate()

                case pr.State == tracker.StateReplicate:

                    pr.Inflights.FreeLE(m.Index)

                //If the progress is updated, judge and update the commitIndex

                if r.maybeCommit() {

                    //If the commitIndex changes, the log will be sent immediately

                    r.bcastAppend()

                } else if oldPaused {

                    r.sendAppend(m.From)

                // Circularly send all remaining logs to follower

                for r.maybeSendAppend(m.From, false) {

                // Is leader transfer in progress

                if m.From == r.leadTransferee && pr.Match == r.raftLog.lastIndex() {

                    r.logger.Infof("%x sent MsgTimeoutNow to %x after received MsgAppResp", r.id, m.From)

                    r.sendTimeoutNow(m.From)

...

...

    return nil

func (r *raft) maybeCommit() bool {

    //Get the largest index with more than half of the confirmations

    mci := r.prs.Committed()

    //Update commitIndex

    return r.raftLog.maybeCommit(mci, r.Term)

After receiving the reply from Follower, if it is reject ed, the leader will resend the log according to the returned index. If it is a successful message, update the log synchronization progress in the cache and judge whether more than half of the confirmed indexes have changed. If there is any change, notify raftLog to update the commitIndex. So far, the data update command of the client is officially submitted. Let's take a look at how the data is written to the DB.

Data update (Apply)

As mentioned earlier, EtcdServer will start a goroutine to monitor whether the raft module has a ready message. When the commitIndex in the previous step changes, the HardState in ready will have a value. Etcd will get the committedententries in the ready structure and submit them to the Apply module for application in the back-end storage.

func (r *raftNode) start(rh *raftReadyHandler) {

    internalTimeout := time.Second

    go func() {

        defer r.onStop()

        islead := false

        for {

...

            case rd := <-r.Ready():

                if rd.SoftState != nil {

...

...

                if len(rd.ReadStates) != 0 {

...

...

                // Generate apply Request

                notifyc := make(chan struct{}, 1)

                ap := apply{

                    entries:  rd.CommittedEntries,

                    snapshot: rd.Snapshot,

                    notifyc:  notifyc,

                // Update the commitIndex of etcdServer cache to the latest value

                updateCommittedIndex(&ap, rh)

                // Apply committed logs to the state machine

                select {

                case r.applyc <- ap:

                case <-r.stopped:

                    return

                if islead {

                    // If there are new log entries

                    r.transport.Send(r.processMessages(rd.Messages))

                // If there is a snapshot

                if !raft.IsEmptySnap(rd.Snapshot) {

...

...

                //Save hardState and log entries to WAL

                if err := r.storage.Save(rd.HardState, rd.Entries); err != nil {

...

...

                if !raft.IsEmptyHardState(rd.HardState) {

                    proposalsCommitted.Set(float64(rd.HardState.Commit))

                if !raft.IsEmptySnap(rd.Snapshot) {

...

...

                r.raftStorage.Append(rd.Entries)

                if !islead {

...

...

                } else {

                    notifyc <- struct{}{}

                //Update the applied index of the raft module and transfer the log from unstable to stable

                r.Advance()

            case <-r.stopped:

                return

}()

It should be noted here that the operation of applying the submitted log entries to the state machine is completed asynchronously. After the application is completed, the results will be written to the channel registered when the client calls in. In this way, a complete write operation is completed.

Posted by pjleonhardt on Mon, 22 Nov 2021 19:55:32 -0800

Programmer Group