Principle of distributed algorithm

Keywords: Java Algorithm Interview Distribution

Common generation strategies of distributed ID

The common distributed ID generation strategies are as follows:

  • Database autoincrement ID.
  • UUID generation.
  • Redis's atomic self increasing mode.
  • Split the database horizontally, and set the initial value and the same self increasing step size.
  • Self increment ID for batch application.
  • Snowflake algorithm.
  • Baidu UidGenerator algorithm (custom timestamp based on snowflake algorithm).
  • Meituan Leaf algorithm (database dependent, ZK).

The core idea is to use a 64 bit long number as the globally unique ID. It is widely used in distributed systems, and ID introduces timestamp to maintain self increment and non repetition.

  • 1 bit is meaningless:
    Because if the first bit in the binary is 1, it is all negative, but the IDS we generate are all positive, so the first bit is 0.

  • 41 bit: indicates the timestamp in milliseconds.
    41 bit can represent up to 2 ^ 41 - 1, that is, it can identify 2 ^ 41 - 1 milliseconds. Conversion to adulthood means 69 years.

  • 10 bit: record the working machine id, which means that the service can be deployed on 2 ^ 10 machines at most, that is, 1024 machines.
    However, in 10 bit s, 5 bits represent the machine room id and 5 bits represent the machine id. It means that it can represent 2 ^ 5 machine rooms (32 machine rooms) at most. Each machine room can represent 2 ^ 5 machines (32 machines). It can be split at will. For example, take out 4 digits to identify the business number and the other 6 digits as the machine number. Can be combined at will.

  • 12 bit: This is used to record different IDs generated in the same millisecond.
    The maximum positive integer that 12 bits can represent is 2 ^ 12 - 1 = 4096, that is, 4096 different IDS in the same millisecond can be distinguished by the number represented by 12 bits. That is, the maximum number of IDs generated by the same machine in the same millisecond is 4096

Simply put, if a service of yours is supposed to generate a globally unique id, it can send a request to the system deployed with the SnowFlake algorithm, and the SnowFlake algorithm system will generate the unique id. The SnowFlake algorithm system must first know its machine number (let's say that all 10 bits are used as the working machine id). Then, after receiving the request, the SnowFlake algorithm system will first generate a 64 bit long id by binary bit operation. The first bit of the 64 bits is meaningless. Then use the current timestamp (in milliseconds) to occupy 41 bits, and then set the machine id with 10 bits. Finally, judge the number of requests on the machine in the current machine room in one millisecond. Add a sequence number to the request for id generation as the last 12 bits.

If there are 5000 requests, the request will arrive at the next millisecond

public class SnowflakeIdWorker {

    // ==============================Fields===========================================
    /** Start timestamp (2015-01-01) */
    private final long twepoch = 1420041600000L;

    /** Number of digits occupied by machine id */
    private final long workerIdBits = 5L;

    /** Number of bits occupied by data id */
    private final long datacenterIdBits = 5L;

    /** The number of bits the sequence occupies in the id */
    private final long sequenceBits = 12L;

    /** The maximum machine id supported is 31 (this shift algorithm can quickly calculate the maximum decimal number represented by several binary numbers) */
    private final long maxWorkerId = -1L ^ (-1L << workerIdBits);

    /** The maximum supported data id is 31 */
    private final long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);

    /** The machine ID shifts 12 bits to the left */
    private final long workerIdShift = sequenceBits;

    /** The data id shifts 17 bits to the left (12 + 5) */
    private final long datacenterIdShift = sequenceBits + workerIdBits;

    /** Shift the timestamp 22 bits to the left (5 + 5 + 12) */
    private final long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;

    /** The mask of the generated sequence is 4095 (0b111111 = 0xfff = 4095) */
    private final long sequenceMask = -1L ^ (-1L << sequenceBits);

    /** Work machine ID(0~31) */
    private long workerId;

    /** Data center ID(0~31) */
    private long datacenterId;

    /** Sequence in milliseconds (0 ~ 4095) */
    private long sequence = 0L;

    /** Timestamp of last generated ID */
    private long lastTimestamp = -1L;

    //==============================Constructors=====================================
    /**
     * Constructor
     * @param workerId Job ID (0~31)
     * @param datacenterId Data center ID (0~31)
     */
    public SnowflakeIdWorker(long workerId, long datacenterId) {
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
        }
        this.workerId = workerId;
        this.datacenterId = datacenterId;
    }

    // ==============================Methods==========================================
    /**
     * Get the next ID (the method is thread safe)
     * @return SnowflakeId
     */
    public synchronized long nextId() {
        long timestamp = timeGen();

        //If the current time is less than the timestamp generated by the last ID, it indicates that the system clock has fallback, and an exception should be thrown at this time
        if (timestamp < lastTimestamp) {
            throw new RuntimeException(
                    String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }

        //If it is generated at the same time, the sequence within milliseconds is performed
        if (lastTimestamp == timestamp) {
            sequence = (sequence + 1) & sequenceMask;
            //Sequence overflow in milliseconds
            if (sequence == 0) {
                //Block to the next millisecond and get a new timestamp
                timestamp = tilNextMillis(lastTimestamp);
            }
        }
        //Timestamp change, sequence reset in milliseconds
        else {
            sequence = 0L;
        }

        //Timestamp of last generated ID
        lastTimestamp = timestamp;

        //Shift and put together by or operation to form a 64 bit ID
        return ((timestamp - twepoch) << timestampLeftShift) //
                | (datacenterId << datacenterIdShift) //
                | (workerId << workerIdShift) //
                | sequence;
    }

    /**
     * Block to the next millisecond until a new timestamp is obtained
     * @param lastTimestamp Timestamp of last generated ID
     * @return Current timestamp
     */
    protected long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    /**
     * Returns the current time in milliseconds
     * @return Current time (MS)
     */
    protected long timeGen() {
        return System.currentTimeMillis();
    }

    //==============================Test=============================================
    /** test */
    public static void main(String[] args) {
        SnowflakeIdWorker idWorker = new SnowflakeIdWorker(0, 0);
        for (int i = 0; i < 1000; i++) {
            long id = idWorker.nextId();
            System.out.println(Long.toBinaryString(id));
            System.out.println(id);
        }
    }
}

The advantage of SnowFlake is that it is sorted by time, and there is no ID collision in the whole distributed system (distinguished by data center ID and machine ID), and the efficiency is high. After testing, SnowFlake can generate about 260000 IDS per second. However, depending on the consistency with the system time, if the system time is recalled or changed, it may cause ID conflict or duplication. In fact, there are not so many computer rooms. We can improve the algorithm to optimize the 10bit machine ID into a business table or business related to our system.

In fact, for the generation strategy of distributed ID. No matter which one we mentioned above. Nothing more than the following two characteristics. Self increasing and non repeating. For non repetitive and self increasing, it is easy to think of time, and the snowflake algorithm is based on timestamp. However, it is obviously unreasonable to directly use millisecond concurrency. Then we will do some articles on this timestamp. As for how to keep this thing unique and self increasing. I'm about to open my brain hole. You can see that the snowflake algorithm is implemented based on synchronized locks.

Posted by formlesstree4 on Sun, 05 Dec 2021 21:39:01 -0800