Implementation of etcd based on Raft protocol

Storage design

There are three modules related to the storage part in Etcd, including log entries stored in the Raft state machine, log entries persisted to files, and back-end KV storage.

Raft state machine storage

Review the overall architecture of Etcd mentioned in Chapter 1. The raft module is only responsible for algorithm implementation, so all received log entries are stored in memory. The data structure is as follows:

EtcdServer

In the figure above, all log entries are stored in a raftLog structure

type raftLog struct {
    // Logs that have been persisted since the last snapshot
    storage Storage
    // Logs that have not been persisted
    unstable unstable
    // Committed log index in cluster
    committed uint64
    // This node has been applied to the log index of the state machine
    applied uint64
        ...
        ...
}
  1. Two fields are used in raftLog to store logs. Storage stores the logs that have been persisted to disk and the information of the latest snapshot, that is, the data written to wal in the above figure. This part of the log will not be lost even when the node is restarted. When the node is restarted, etcd will read this part of the data from Wal and write it to raft memory. Why is undefined after the last snapshot? Because the memory of the raft node is limited after all, the KV will be taken snapshots regularly in etcd. After the snapshot is completed, the storage only needs to store the snapshot information and the logs received after the snapshot, which is also defined in the raft protocol.
  2. The unsustainable structure stores log entries and snapshots that have not been persisted. After the log is persisted, it will be moved from the unsustainable to the storage.
  3. The committed and applied attributes of the raft protocol also exist in the raft log, because according to the raft protocol, these two attributes also need to be stored persistently.

Storage

The raft state machine Storage interface is defined as follows:

type Storage interface {
    // HardState and ConfState information that has been persisted
    InitialState() (pb.HardState, pb.ConfState, error)
    // Return log entries
    Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
    // Current election cycle
    Term(i uint64) (uint64, error)
    // index of the last log
    LastIndex() (uint64, error)
    // index of the first log
    FirstIndex() (uint64, error)
    // Returns the most recent Snapshot
    Snapshot() (pb.Snapshot, error)
}

The Storage interface defines all information interfaces required in the Raft protocol that need to be persisted, such as term, commitIndex, and log entries in the HardState.

The default implementation of this interface in etcd is MemoryStorage. It can be seen from the name that the data is stored in memory. It seems that this is inconsistent with the requirements of raft. This is because the logs and status information stored in MemoryStorage are in the WAL, so only memory is needed here. When restarting, etcd will recover data from the WAL and write it to Storage. MemoryStorage is defined as follows:

type MemoryStorage struct {
    // Read write lock
    sync.Mutex
   // term, commitIndex and vote are encapsulated in HardState
    hardState pb.HardState
   //Last Snapshot
    snapshot  pb.Snapshot
   //For the log entries after the snapshot, the index of the first log entry is snapshot.Metadata.Index
    ents []pb.Entry
}

unstable

The Raft module has received that the log entries that have not been persisted to the WAL exist in unstable

type unstable struct {
    // snapshot received from leader
    snapshot *pb.Snapshot
    // Newly received log entries that have not been persisted
    entries []pb.Entry
  //Offset of the first log
    offset  uint64
}

Log persistent storage

The log data persistence of Raft module is realized through WAL, which writes data to disk through additional writing to improve performance. Etcd will add a record to WAL in the following cases:

  • The node and cluster information are recorded when the node is started, and the corresponding record type is metadata type;
  • A new log entry is received, and the corresponding record type is entryType;
  • When the status changes, such as new election cycle, commitIndex change, etc., the corresponding record type is stateType;
  • When taking data snapshots, the corresponding record type is snapshotType;
  • When generating a new wal file, when the wal file reaches a certain size, etcd will generate a new file. The first record of the new file will record the crc of the previous file for data verification. The corresponding record type is crcType
type WAL struct {
    lg *zap.Logger
   // Storage directory of wal files
    dir string 
    dirFile *os.File
   // The first metadata record that will be written after the wal file is built
    metadata []byte     
   // The first state record that will be written after the wal file is built
    state    raftpb.HardState 
   // The snapshot starting with wal means that the reading of wal starts after the record of this snapshot
    start     walpb.Snapshot
   //Deserializer for wal records
    decoder   *decoder    
    ...
   //List of underlying data files
    locks []*fileutil.LockedFile 
   //
}

The bottom layer of WAL corresponds to a series of files on the disk. When receiving log entries that need to be persisted, they will be appended to the end of the file. When the file reaches a certain size, WAL will actively create a new disk file to prevent a single WAL file from being too large.

etcd

The data will be taken snapshots regularly, and a record will be appended to the wal during the snapshot. When the etcd node restarts recovery, it will find the record of the last snapshot in wal, and re send the log entries after the snapshot to the raft module to recover memory data.

KV database storage

The final effective data of Etcd is stored in KV database, and a Backend interface is abstracted for back-end storage. The implementation of Backend needs to support transaction and multi version management. The definition of the Backend interface is as follows:

type Backend interface {
    // Open read transaction
    ReadTx() ReadTx
    //Open write transaction
    BatchTx() BatchTx
    // Enable concurrent read transactions without blocking each other
    ConcurrentReadTx() ReadTx
    // Snapshot db
    Snapshot() Snapshot
    Hash(ignores map[IgnoreKey]struct{}) (uint32, error)
    // The physical disk size occupied by DB can be pre allocated, so it is not the actual data size
    Size() int64
    // Actual disk space used
    SizeInUse() int64
    // Returns the number of current read transactions
    OpenReadTxN() int64
    // Data file collation will recycle the disks occupied by the old version of deleted keys and updated keys
    Defrag() error
    ForceCommit()
    Close() error
}

The default implementation of this interface is as follows:

type backend struct {
    // Disk size occupied
    size int64
    // Actual size used
    sizeInUse int64
    // Number of committed transactions
    commits int64
    // Number of read transactions currently open
    openReadTxN int64
   // Read write lock
    mu sync.RWMutex
    //The underlying storage is boltDB
    db *bolt.DB
   // Batch write commit interval
    batchInterval time.Duration
   // Maximum number of bulk write transactions
    batchLimit    int
   // Write transaction buffer queue
    batchTx       *batchTxBuffered
   // Write transaction
    readTx *readTx
    stopc chan struct{}
    donec chan struct{}
    lg *zap.Logger
}

As can be seen from the above implementation, the default underlying storage of etcd uses boltDB. In order to improve the reading and writing efficiency, etcd will maintain a cache queue for write transactions. When the queue size reaches a certain number or a certain time has elapsed since the last time, the data will be written to the disk.

Storage summary

After the data is submitted from the client to etcd, it will pass through three storage places. First, it will enter the raft algorithm module. Raft will save the log in memory, and then notify etcd to persist. In order to improve efficiency, etcd will write the data to wal. Because the wal underlying file is only appended without updating and deleting, the data will not be lost after completing this step. Then the etcd leader node distributes the logs to the cluster. When more than half of the nodes respond, it will submit the data and store the data in the back-end KV storage.

Log synchronization

After understanding the storage design in etcd, you can better understand the whole flow process of a data change request. Let's take a look at the source code.

Request processing

When the client submits a data change request, for example, put hello is

For the write request of the world, the Put() method of EtcdServer will be called in v3, and finally processInternalRaftRequestOnce() will be called.

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*applyResult, error) {
    //Judge whether the submitted but not applied records exceed the limit
    ai := s.getAppliedIndex()
    ci := s.getCommittedIndex()
    if ci > ai+maxGapBetweenApplyAndCommitIndex {
        return nil, ErrTooManyRequests
    }
    //Generate a requestID
    r.Header = &pb.RequestHeader{
        ID: s.reqIDGen.Next(),
    }
    authInfo, err := s.AuthInfoFromCtx(ctx)
    if err != nil {
        return nil, err
    }
    if authInfo != nil {
        r.Header.Username = authInfo.Username
        r.Header.AuthRevision = authInfo.Revision
    }
    //Deserialize request data
    data, err := r.Marshal()
    if err != nil {
        return nil, err
    }
    if len(data) > int(s.Cfg.MaxRequestBytes) {
        return nil, ErrRequestTooLarge
    }
    id := r.ID
    if id == 0 {
        id = r.Header.ID
    }
    //Register a channel and wait for the processing to complete
    ch := s.w.Register(id)
    //Set request timeout
    cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
    defer cancel()
    start := time.Now()
    // Call the proposal of the raft module to process the request
    err = s.r.Propose(cctx, data)
    if err != nil {
        proposalsFailed.Inc()
        s.w.Trigger(id, nil) // GC wait
        return nil, err
    }
    proposalsPending.Inc()
    defer proposalsPending.Dec()
    select {
    // Wait for the application result to be returned to the client
    case x := <-ch:
        return x.(*applyResult), nil
    case <-cctx.Done():
        proposalsFailed.Inc()
        s.w.Trigger(id, nil) // GC wait
        return nil, s.parseProposeCtxErr(cctx.Err(), start)
    case <-s.done:
        return nil, ErrStopped
    }
}

In the above method, after etcd performs basic verification on the request, it will submit it to raft for processing by calling the Propose() method, and then wait for feedback. In etcd implementation, the result will not be returned to the client until the data is applied to the state machine. In the Propose() method, raft encapsulates the request into an MsgProp message and calls the Step function.

func (rn *RawNode) Propose(data []byte) error {
    return rn.raft.Step(pb.Message{
        Type: pb.MsgProp,
        From: rn.raft.id,
        Entries: []pb.Entry{
            {Data: data},
        }})
}

etcd only allows the leader to process the data change request, so if the Follower receives the command from the client, it will directly transfer it to the leader for processing, and then wait for the leader's feedback and return the result to the client. Therefore, we only need to look at the processing logic of the leader. The Step() function above finally calls the stepLeader(*raft) of the raft module,

pb.Message) function.

The reason why we entered the stepLeader method has been mentioned in the previous article. If you are not impressed, you can look back

func stepLeader(r *raft, m pb.Message) error {
    // These message types do not require any progress for m.From.
    switch m.Type {
    case pb.MsgBeat:
        ...
    case pb.MsgCheckQuorum:
        ...
    case pb.MsgProp:
        if len(m.Entries) == 0 {
            r.logger.Panicf("%x stepped empty MsgProp", r.id)
        }
        if r.prs.Progress[r.id] == nil {
            // Judge whether the current node has been removed from the cluster
            return ErrProposalDropped
        }
        if r.leadTransferee != None {
            // If a leader switch is in progress, write is rejected
            return ErrProposalDropped
        }
        for i := range m.Entries {
            //Determine whether there is a log of configuration changes, and if so, do some special processing
        }
        //Append logs to the raft state machine
        if !r.appendEntry(m.Entries...) {
            return ErrProposalDropped
        }
        // Send logs to other nodes in the cluster
        r.bcastAppend()
        return nil
    case pb.MsgReadIndex:
        ...
        return nil
    }
    ...
    ...
    return nil
}

Raft protocol is a protocol based on log replication, so the client data change request will be encapsulated into a log entry. In the above logic, some basic checks are made first. After passing, the log entries in Message are added to the log list of raft. After the addition is successful, the log will be broadcast to all followers.

Raft log add

When talking about storage, it was mentioned that the raft algorithm implementation module only stores logs in memory, so the logic of appendEntry is also very simple.

func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
     //1. Get the index of the last log entry of the raft node
    li := r.raftLog.lastIndex()
     //2. Set term and index for new log entries
    for i := range es {
        es[i].Term = r.Term
        es[i].Index = li + 1 + uint64(i)
    }
    // 3. Judge whether the uncommitted log entries exceed the limit. If yes, reject and return failure
    if !r.increaseUncommittedSize(es) {
        return false
    }
    // 4. Append log entries to raftLog
    li = r.raftLog.append(es...)
    // 5. Check and update the log progress
    r.prs.Progress[r.id].MaybeUpdate(li)
    // 6. Judge whether to make a commit
    r.maybeCommit()
    return true
}
  1. Gets the index of the last log entry in the current raft log
  2. The log entry index of raft is monotonically increasing
  3. etcd limits the maximum number of uncommitted entries on the leader to prevent entries from accumulating all the time due to network problems between the leader and the follower.
  4. Append the log entries to the raftLog memory queue and return the index of the largest log. For the case where the leader appends the log, the li returned here must be equal to the li obtained in line 1 of the method
  5. The leader node of raft saves the log synchronization progress of all nodes, including itself
  6. Ignore the maybeCommit() result and return true directly to start broadcasting logs.

Sync to Follower

After the Leader node stores the log entries in the memory of raftLog, the bcastAppend() method triggers a broadcast operation and synchronize the log to Follower.

func (r *raft) bcastAppend() {
    //Traverse all nodes and send log Append messages to all nodes except yourself
    r.prs.Visit(func(id uint64, _ *tracker.Progress) {
        if id == r.id {
            return
        }
        r.sendAppend(id)
    })
}
func (r *raft) sendAppend(to uint64) {
    r.maybeSendAppend(to, true)
}
func (r *raft) maybeSendAppend(to uint64, sendIfEmpty bool) bool {
    //1. Obtain the current synchronization progress of the peer node
    pr := r.prs.Progress[to]
    if pr.IsPaused() {
        return false
    }
    m := pb.Message{}
    m.To = to
    //2. Note that the term here is the term of the first log entry sent to the follower this time
    term, errt := r.raftLog.term(pr.Next - 1)
    ents, erre := r.raftLog.entries(pr.Next, r.maxMsgSize)
    if len(ents) == 0 && !sendIfEmpty {
        return false
    }
    if errt != nil || erre != nil { 
        //3. If you fail to get the term or log, it means that the follower is too far behind, and the log in raftLog memory has been deleted after snapshot
        if !pr.RecentActive {
            r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)
            return false
        }
        //4. Send a Snapshot message instead
        m.Type = pb.MsgSnap
        snapshot, err := r.raftLog.snapshot()
        if err != nil {
            if err == ErrSnapshotTemporarilyUnavailable {
                r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)
                return false
            }
            panic(err) // TODO(bdarnell)
        }
        if IsEmptySnap(snapshot) {
            panic("need non-empty snapshot")
        }
        m.Snapshot = snapshot
        sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term
        pr.BecomeSnapshot(sindex)
    } else {
        //5. Send Append message
        m.Type = pb.MsgApp
        m.Index = pr.Next - 1
        m.LogTerm = term
        m.Entries = ents
        //6. Each time you send a log or heartbeat, you will bring the latest commitIndex
        m.Commit = r.raftLog.committed
        if n := len(m.Entries); n != 0 {
            ...
            ...
        }
    }
    //7. Send message
    r.send(m)
    return true
}

In the above logic, after receiving a new update log, the leader will traverse all follower nodes in the cluster and trigger a log synchronization.

  1. According to the raft protocol, the leader needs to cache the log synchronization progress of all followers
  2. When retrieving log entries according to the log progress, it is found that the follower log lags too much, which usually occurs when a new node has just joined or the network connection fails. In this case, the leader sends the latest snapshot to the follower to improve the synchronization efficiency
  3. Under normal circumstances, a new log will be sent to Follower. The message type is MsgApp. Finally, r.send(m) will be called to submit the message.

Log write WAL

As mentioned in the previous article on sending heartbeat messages, there will be a goroutine in EtcdServer to monitor whether there is new Ready data in the raft channel. After receiving it, it will send the msgs to the receiver. This MsgApp type message is also submitted, so it will not be repeated here.

When the log is sent to the Follower, the Leader will drop the log, that is, write it to the WAL, which is realized by calling the WAL.Save() method.

func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
    //Get the write lock of wal
    w.mu.Lock()
    defer w.mu.Unlock()
    // If the HardState changes or new log entries need to be written to wal
    if raft.IsEmptyHardState(st) && len(ents) == 0 {
        return nil
    }
    mustSync := raft.MustSync(st, w.state, len(ents))
    // Write log entries
    for i := range ents {
        if err := w.saveEntry(&ents[i]); err != nil {
            return err
        }
    }
    // Write state change
    if err := w.saveState(&st); err != nil {
        return err
    }
    // Determine whether the file size exceeds the maximum
    curOff, err := w.tail().Seek(0, io.SeekCurrent)
    if err != nil {
        return err
    }
    if curOff < SegmentSizeBytes {
        if mustSync {
            return w.sync()
        }
        return nil
    }
    // File segmentation
    return w.cut()
}

The wal file structure has been described above. For the new log, the record of entryType is added in wal.

Follower log processing

After the Leader node processes the command, the sending log and persistence operations are asynchronous, but this does not mean that the client has received a reply. The Raft protocol requires that the log must have been submitted when the client is returned successfully, so the Leader needs to wait for more than half of the Follower nodes to process the log and give feedback. Let's take a look at the Follower's log processing first.

After the log message arrives at the Follower, it is also processed by the EtcdServer.Process() method, and finally enters the stepFollower() function of the Raft module.

func stepFollower(r *raft, m pb.Message) error {
    switch m.Type {
    ...
    case pb.MsgApp:
        // Reset heartbeat count
        r.electionElapsed = 0
        // Set up Leader
        r.lead = m.From
        // Processing log entries
        r.handleAppendEntries(m)
    ...
    }
    ...
}

After receiving the message, the Follower first resets the heartbeat count and leader as the processing logic of heartbeat message, and then processes the log entries.

func (r *raft) handleAppendEntries(m pb.Message) {
    // Determine whether it is an outdated message
    if m.Index < r.raftLog.committed {
        r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: r.raftLog.committed})
        return
    }
    if mlastIndex, ok := r.raftLog.maybeAppend(m.Index, m.LogTerm, m.Commit, m.Entries...); ok {
        // Processing succeeded. Send msgapresp to the Leader
        r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})
    } else {
        // The index of the log does not match the lastIndex of the Follower, and a reject message is returned
        r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: m.Index, Reject: true, RejectHint: r.raftLog.lastIndex()})
    }
}

Call raftLog to store the log and return the result to the Leader. Here, the failure of follower may be caused by two situations: one is that the term in the log entry is inconsistent with the term of follower, and the other is that the minimum index in the log list is greater than the maximum log index of follower.

maybeAppend() above

The method will only store the log in the memory queue maintained by RaftLog. The log persistence is asynchronous, which is basically the same as the storage wal logic of the leader node. One difference is that the msgapresp message officially sent by the follower node will be saved after the wal is successfully saved, while the leader node sends the message first and then saves the wal.

Commit

After broadcasting the log to the follower, the Leader node has been waiting for the msgapresp message from the follower. After receiving it, it will still enter the stepLeader function.

func stepLeader(r *raft, m pb.Message) error {
    ...
    ...
    pr := r.prs.Progress[m.From]
    switch m.Type {
    case pb.MsgAppResp:
        pr.RecentActive = true
        if m.Reject {
            //If you receive a reject message, resend the log according to the index fed back by the follower
            if pr.MaybeDecrTo(m.Index, m.RejectHint) {
                if pr.State == tracker.StateReplicate {
                    pr.BecomeProbe()
                }
                r.sendAppend(m.From)
            }
        } else {
            oldPaused := pr.IsPaused()
            //Update cached log synchronization progress
            if pr.MaybeUpdate(m.Index) {
                switch {
                case pr.State == tracker.StateProbe:
                    pr.BecomeReplicate()
                case pr.State == tracker.StateSnapshot && pr.Match >= pr.PendingSnapshot:
                    pr.BecomeProbe()
                    pr.BecomeReplicate()
                case pr.State == tracker.StateReplicate:
                    pr.Inflights.FreeLE(m.Index)
                }
                //If the progress is updated, judge and update the commitIndex
                if r.maybeCommit() {
                    //If the commitIndex changes, the log will be sent immediately
                    r.bcastAppend()
                } else if oldPaused {
                    r.sendAppend(m.From)
                }
                // Circularly send all remaining logs to follower
                for r.maybeSendAppend(m.From, false) {
                }
                // Is leader transfer in progress
                if m.From == r.leadTransferee && pr.Match == r.raftLog.lastIndex() {
                    r.logger.Infof("%x sent MsgTimeoutNow to %x after received MsgAppResp", r.id, m.From)
                    r.sendTimeoutNow(m.From)
                }
            }
        }
    ...
    ...
    return nil
}
func (r *raft) maybeCommit() bool {
    //Get the largest index with more than half of the confirmations
    mci := r.prs.Committed()
    //Update commitIndex
    return r.raftLog.maybeCommit(mci, r.Term)
}

After receiving the reply from Follower, if it is reject ed, the leader will resend the log according to the returned index. If it is a successful message, update the log synchronization progress in the cache and judge whether more than half of the confirmed indexes have changed. If there is any change, notify raftLog to update the commitIndex. So far, the data update command of the client is officially submitted. Let's take a look at how the data is written to the DB.

Data update (Apply)

As mentioned earlier, EtcdServer will start a goroutine to monitor whether the raft module has a ready message. When the commitIndex in the previous step changes, the HardState in ready will have a value. Etcd will get the committedententries in the ready structure and submit them to the Apply module for application in the back-end storage.

func (r *raftNode) start(rh *raftReadyHandler) {
    internalTimeout := time.Second
    go func() {
        defer r.onStop()
        islead := false
        for {
            ...
            case rd := <-r.Ready():
                if rd.SoftState != nil {
                    ...
                    ...
                }
                if len(rd.ReadStates) != 0 {
                    ...
                    ...
                }
                // Generate apply Request
                notifyc := make(chan struct{}, 1)
                ap := apply{
                    entries:  rd.CommittedEntries,
                    snapshot: rd.Snapshot,
                    notifyc:  notifyc,
                }
                // Update the commitIndex of etcdServer cache to the latest value
                updateCommittedIndex(&ap, rh)
                // Apply committed logs to the state machine
                select {
                case r.applyc <- ap:
                case <-r.stopped:
                    return
                }
                if islead {
                    // If there are new log entries
                    r.transport.Send(r.processMessages(rd.Messages))
                }
                // If there is a snapshot
                if !raft.IsEmptySnap(rd.Snapshot) {
                    ...
                    ...
                }
                //Save hardState and log entries to WAL
                if err := r.storage.Save(rd.HardState, rd.Entries); err != nil {
                    ...
                    ...
                }
                if !raft.IsEmptyHardState(rd.HardState) {
                    proposalsCommitted.Set(float64(rd.HardState.Commit))
                }
                if !raft.IsEmptySnap(rd.Snapshot) {
                    ...
                    ...
                }
                r.raftStorage.Append(rd.Entries)
                if !islead {
                    ...
                    ...
                } else {
                    notifyc <- struct{}{}
                }
                //Update the applied index of the raft module and transfer the log from unstable to stable
                r.Advance()
            case <-r.stopped:
                return
            }
        }
    }()
}

It should be noted here that the operation of applying the submitted log entries to the state machine is completed asynchronously. After the application is completed, the results will be written to the channel registered when the client calls in. In this way, a complete write operation is completed.

Posted by pjleonhardt on Mon, 22 Nov 2021 19:55:32 -0800