Distributed sequence number generator snowflake algorithm

Keywords: cloud computing

Globally unique ID, which aims to make all elements in the distributed system have unique identification information.

1.UUID

UUID overview

UUID   (universal unique identifier). UUID is calculated and generated based on the current time, counter and hardware identification (usually the MAC address of the wireless network card).

Format & version

UUID consists of a combination of the following parts:

  1. Current date and time. The first part of UUID is related to time. If you generate a UUID after a few seconds, the first part is different and the rest are the same.
  2. Clock sequence.
  3. The globally unique IEEE machine identification number. If there is a network card, it is obtained from the MAC address of the network card. If there is no network card, it is obtained in other ways.
 

UUID is composed of a group of 32-digit hexadecimal digits and displayed in five groups separated by hyphens. The form is 8-4-4-4-12, with a total of 36 characters (i.e. 32 English letters and four hyphens). For example:

aefbbd3a-9cc5-4655-8363-a2a43e6e6c80
xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx

number   M indicates the UUID version. There are 5 versions in the current specification. The optional values of M are 1, 2, 3, 4 and 5  ;

number   The one to four most significant bits (bits) of N represent the UUID variant. There are two fixed bits 10xx. Therefore, n can only be taken as 8, 9, a and B.

UUID versions are represented by M. there are five versions of the current specification. The optional values of M are 1, 2, 3, 4 and 5. These five versions use different algorithms and use different information to generate UUID. Each version has its own advantages and is suitable for different scenarios. Specific use information

  • version 1, date-time & MAC address

    The time-based UUID is obtained by calculating the current timestamp, random number and node ID: machine MAC address. Because the MAC address is used in the algorithm, this version of UUID can ensure global uniqueness. But at the same time, using MAC address will bring security problems, which is where this version of UUID has been criticized. At the same time, Version 1 does not consider the problem of two processes on a machine, nor does it consider the concurrency of the same timestamp. Therefore, no one implements strict Version 1. Variants of Version 1 include Hibernate's CustomVersionOneStrategy.java, MongoDB's ObjectId.java, Twitter's snowflake, etc.

  • version 2, date-time & group/user id

    The UUID algorithm of DCE (Distributed Computing Environment) security is the same as that of time-based UUID, but the first four positions of timestamp will be replaced by the UID or GID of POSIX. This version of UUID is rarely used in practice.

  • version 3, MD5 hash & namespace

    The name based UUID is obtained by calculating the MD5 hash value of the name and namespace. This version of UUID ensures the uniqueness of UUIDs generated by different names in the same namespace; The uniqueness of UUIDs in different namespaces; The repeated generation of UUIDs with the same name in the same namespace is the same.

  • version 4, pseudo-random number

    Generate UUID based on random number or pseudo-random number.

  • version 5, SHA-1 hash & namespace

    It is similar to the UUID algorithm in version 3, except that SHA1 (Secure Hash Algorithm 1) algorithm is used for hash value calculation.

Version 1 and version 4 are widely used, in which version 1 uses the current timestamp and MAC address information. Version 4 uses (pseudo) random number information. In 128bit, except for 4bit determined by version and 2bit determined by variant, all other 122bit s are determined by (pseudo) random number information. If you want to always generate the same UUID for a given string, use version 3 or version 5.

Repetition probability

In Java, the UUID is implemented using version 4, so among the 128 bits of the UUID generated by the java.util.UUID class, 122 bits are randomly generated, 4 bit identification versions are used, and 2 Identification variants are used. Using the birthday paradox, we can calculate that the probability that two UUIDs have the same value is about
p(n) ≈ 1 - e -n*n/2x

Where x is the value range of UUIDs and n is the number of UUIDs.

The following is based on x = 2122   Calculate the collision probability after n UUID s:

nprobability
68,719,476,736 = 236 0.0000000000000004 (4 x 10-16)
2,199,023,255,552 = 241 0.0000000000004 (4 x 10-13)
70,368,744,177,664 = 246 0.0000000004 (4 x 10-10)

The occurrence of repeated UUID s and errors is very low, so it is unnecessary to consider this problem.

The probability is also related to the quality of the random number generator. To avoid the increase of repetition probability, a strong pseudo-random number generator based on cryptography must be used to generate values.

UUID is composed of a group of 32-digit hexadecimal digits, so the theoretical total number of UUID is 1632  = 2128, approximately 3.4 x 10123. In other words, if 1 million UUIDs are generated every nanosecond, it will take 10 billion years to run out of all UUIDs.

Java implementation
/**
 * Static factory to retrieve a type 4 (pseudo randomly generated) UUID.
 * Use the static factory to get the UUID of version 4 (pseudo-random number generator)
 * The {@code UUID} is generated using a cryptographically strong pseudo
 * This UUID generation uses a strongly encrypted pseudo-random number generator (PRNG)
 * random number generator.
 *
 * @return  A randomly generated {@code UUID}
 */
public static UUID randomUUID() {
    SecureRandom ng = Holder.numberGenerator;

    byte[] randomBytes = new byte[16];
    ng.nextBytes(randomBytes);
    randomBytes[6]  &= 0x0f;  /* clear version        */
    randomBytes[6]  |= 0x40;  /* set to version 4     */
    randomBytes[8]  &= 0x3f;  /* clear variant        */
    randomBytes[8]  |= 0x80;  /* set to IETF variant  */
    return new UUID(randomBytes);
}

/**
 * Static factory to retrieve a type 3 (name based) {@code UUID} based on
 * the specified byte array.
 * The static factory's implementation of version 3 always generates the same UUID for a given string (name)
 * @param  name
 *         A byte array to be used to construct a {@code UUID}
 *
 * @return  A {@code UUID} generated from the specified array
 */
public static UUID nameUUIDFromBytes(byte[] name) {
    MessageDigest md;
    try {
        md = MessageDigest.getInstance("MD5");
    } catch (NoSuchAlgorithmException nsae) {
        throw new InternalError("MD5 not supported", nsae);
    }
    byte[] md5Bytes = md.digest(name);
    md5Bytes[6]  &= 0x0f;  /* clear version        */
    md5Bytes[6]  |= 0x30;  /* set to version 3     */
    md5Bytes[8]  &= 0x3f;  /* clear variant        */
    md5Bytes[8]  |= 0x80;  /* set to IETF variant  */
    return new UUID(md5Bytes);
}
Generate UUID
// Java language implementation
import java.util.UUID;

public class UUIDProvider{
    public static void main(String[] args) {
        // The UUID with version 4 and variant 9 is generated by pseudo-random number
        System.out.println(UUID.randomUUID());
        
        // For the same namespace, always generate the same UUID, version 3 and variant 9
        // The UUID generated when the namespace is "xxx" is always f561aaf6-ef0b-314d-8208-bb46a4ccb3ad
        System.out.println(UUID.nameUUIDFromBytes("xxx".getBytes()));
    }
} 
advantage
  • Simple and convenient code.
  • The performance of ID generation is very good, and there are basically no performance problems. Generated locally without network consumption.
  • It is unique in the world. It can take it easy in case of data migration, system data consolidation, or database change.
shortcoming
  • Meaningless strings are used, and there is no sorting, so the trend can not be guaranteed to increase.
  • UUID is stored in string form, and the query efficiency is low when the amount of data is large
  • The storage space is relatively large. If it is a massive database, the storage capacity needs to be considered.

2. Snowflake algorithm (twitter/snowflake)

Overview of snowflake algorithm

SnowFlake algorithm is an open source distributed ID generation algorithm for Twitter. The core idea is to use a 64 bit long number as the globally unique ID. It is widely used in distributed systems, and ID introduces timestamp, which basically keeps self increasing. The original version is scala, followed by many versions of other languages, such as Java, C + +.

format

  • 1bit - first invalid character

  • 41bit - timestamp (in milliseconds)

    • 41 bits can represent 241  - 1 number;
    • two hundred and forty-one  - One millisecond, converted to adulthood, means 69 years
  • 10bit - working machine id

    • 5bit - datacenter id machine room id
    • 5bit - workerId machine id
  • 12bit - serial number

    Serial number, which is used to record different IDs generated in the same milliseconds on a machine in the same datacenter id.

Features (self increasing, orderly, suitable for distributed scenarios)
  • Time bit: it can be sorted according to time, which helps to improve the query speed.
  • Machine id bit: it is applicable to identify each node of multiple nodes in a distributed environment. It can be designed to divide the length of 10 machine bits according to the number of nodes and deployment. For example, 5 bits represent process bits.
  • Serial number bit: it is a series of self incrementing IDs. It can support the same node to generate multiple ID serial numbers in the same millisecond. The 12 bit counting serial number supports 4096 ID serial numbers per millisecond for each node

The snowflake algorithm can be modified according to the project situation and its own needs.

Twitter algorithm implementation

​ Twitter algorithm implementation(Scala)

Java algorithm implementation
public class IdWorker{

    //10bit working machine id
    private long workerId;    // 5bit
    private long datacenterId;   // 5bit

    private long sequence; // 12bit serial number

    public IdWorker(long workerId, long datacenterId, long sequence){
        // sanity check for workerId
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0",maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0",maxDatacenterId));
        }
        System.out.printf("worker starting. timestamp left shift %d, datacenter id bits %d, worker id bits %d, sequence bits %d, workerid %d",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId);

        this.workerId = workerId;
        this.datacenterId = datacenterId;
        this.sequence = sequence;
    }

    //Initial timestamp
    private long twepoch = 1288834974657L;

    //The length is 5 digits
    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;
    //Shift the maximum value of - 1 by 5 to the left to get the result a, - 1 XOR A: calculate the maximum positive integer that can be represented by 5 bits by bit operation.
    private long maxWorkerId = -1L ^ (-1L << workerIdBits); //31
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits); // 31
    //Serial number id length
    private long sequenceBits = 12L;
    //Maximum serial number
    private long sequenceMask = -1L ^ (-1L << sequenceBits); //4095

    //The number of bits of workerId to be shifted left, 12 bits
    private long workerIdShift = sequenceBits; //12
    //Datacenter ID needs to be shifted left 
    private long datacenterIdShift = sequenceBits + workerIdBits; // 12+5=17
    //The timestamp needs to be shifted left 
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits; // 12+5+5=22

    //Last timestamp, initial value is negative
    private long lastTimestamp = -1L;

    public long getWorkerId(){
        return workerId;
    }

    public long getDatacenterId(){
        return datacenterId;
    }

    public long getTimestamp(){
        return System.currentTimeMillis();
    }

    //Next ID generation algorithm
    public synchronized long nextId() {
        long timestamp = timeGen();

        //Get the current timestamp. If it is less than the last timestamp, it indicates that the timestamp acquisition is abnormal
        if (timestamp < lastTimestamp) {
            System.err.printf("clock is moving backwards.  Rejecting requests until %d.", lastTimestamp);
            throw new RuntimeException(String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds",
                    lastTimestamp - timestamp));
        }

        //Get the current timestamp. If it is equal to the last timestamp (within the same millisecond), add one to the serial number; Otherwise, the serial number is assigned to 0, starting from 0.
        if (lastTimestamp == timestamp) {
            // The result range of calculation is always 0-4095 through bit sum operation
            sequence = (sequence + 1) & sequenceMask; 
            if (sequence == 0) {
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0;
        }

        //Refresh last timestamp value
        lastTimestamp = timestamp;

        /**
         * Return result:
         * (timestamp - twepoch) << timestampLeftShift) Indicates that the initial timestamp is subtracted from the timestamp, and then the corresponding digit is shifted to the left
         * (datacenterId << datacenterIdShift) Indicates that the data id is shifted to the left by the corresponding digit
         * (workerId << workerIdShift) Indicates that the work id is shifted to the left by the corresponding digit
         * | It is a bitwise OR operator, such as x | y. The result is 0 only when x and y are 0, and the result is 1 in other cases.
         * Because only the value in the corresponding bit is meaningful in one part and 0 in other bits, the final spliced id can be obtained by | operation on the value of each part
         */
        return ((timestamp - twepoch) << timestampLeftShift) |
                (datacenterId << datacenterIdShift) |
                (workerId << workerIdShift) |
                sequence;
    }

    //Get the timestamp and compare it with the last timestamp
    private long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    //Get system timestamp
    private long timeGen(){
        return System.currentTimeMillis();
    }

    //---------------Testing---------------
    public static void main(String[] args) {
        IdWorker worker = new IdWorker(1,1,1);
        for (int i = 0; i < 30; i++) {
            System.out.println(worker.nextId());
        }
    }

}
advantage
  • The number of milliseconds is in the high order, the self increasing sequence is in the low order, and the whole ID is increasing in trend.
  • It does not rely on third-party systems such as databases. It is deployed in the form of services, which has higher stability and high performance of ID generation.
  • bit bits can be allocated according to their own business characteristics, which is very flexible.
shortcoming
  • The ID of snowflake algorithm is incremented on a single system, but in the case of multiple nodes in a distributed system, the clocks of all nodes cannot be guaranteed to be completely synchronized, so it may not be globally incremented. If the system time is recalled or changed, it may cause ID conflict or duplication.

3. Use auto of database_ Increment property

Taking MySQL as an example, use to set auto for the field_ increment_ Increment and auto_increment_offset to ensure that the ID increases automatically. Each business uses the following SQL to read and write Mysql to get the ID number

advantage
  • It is very simple, realized by using the functions of the existing database system, with low cost and professional DBA maintenance.
  • The ID number increases monotonously and automatically, which can realize some services with special requirements for ID.
shortcoming
  • Strongly dependent on DB. When DB is abnormal, the whole system is unavailable, which is a fatal problem. Configuring master-slave replication can increase availability as much as possible, but data consistency is difficult to guarantee under special circumstances. Inconsistency during master-slave switching may lead to repeated signal issuance.
  • The bottleneck of ID issuing performance is limited to the read-write performance of a single MySQL
  • It is troublesome to divide tables and databases, migrate and merge data

Posted by oughost on Wed, 10 Nov 2021 17:13:23 -0800