Redis principle - data structure

introduction

Redis is a memory NoSql database, which is often used to implement functions such as caching, distributed session, leaderboard, message queue and so on. As a memory database, how does redis make full use of memory? How to achieve high performance? How to support so many functions? Maybe you can find the answer from the data structure design of redis.

String sds

The redis scenario determines that strings are read frequently, so strings are mainly used to solve performance problems, supplemented by security and functionality. The string of c language does not meet these scenarios, so redis customized a simple dynamic string of SDS.

SDS is defined as follows:

struct sdshdr{
    //Bytes used
    int len;
    //Unused bytes
    int free;
    //Byte array
    char buff[];
}

Let's see how sds achieves high performance, security and functionality from the aspects of definition and api.

1. High performance

C language obtains the character length by traversing the character array, and the complexity is O(n). SDS can obtain len directly, and the complexity is O(1). In this regard, it can be said that SDS explodes the C string.

When expanding or shrinking the capacity string, the c string reapplies / cleans up the memory every time, sometimes involving system calls. It is not frequently modified, and the performance gap is not obvious. However, the redis scenario requires frequent string modification, so redis also optimizes the memory allocation. When re applying, more memory will be allocated. Less than 1M, double. More than 1M is allocated. Instead of directly releasing memory during cleaning, you can modify free to reuse excess memory next time.

2. Security

c string requires manual memory application. If it is not applied, the data of the following string will be overwritten. Cause character overflow. SDS is automatic help application, safe and reliable.

3. Functionality

C string is composed of ASCII characters ending with space. The format limit is high, and space characters cannot be stored, only text can be stored. SDS can store any binary characters with flexible functions. SDS is also compatible with some c-character functions.

Linked list

Since c does not have a special linked list structure, redis has customized a double ended linked list with head and tail nodes. Sequential access and range access can be realized. The structure is similar to the LinkList in Java. Not detailed.

Dictionary dict

Redis is a KV database, so KV data structure is essential. The dictionary data structure of redis is the specific implementation of KV.

Dictionary dict is defined as follows:

typedef struct dict{
    //Type function, not discussed
    dictType *type;
    //Private data, not discussed
    void *private;
    //The hash table is divided into h0 and h1 area
    dictht ht[2];
    //Progressive hash Indexes
    int trehashidx;
}
typedef struct dictht{
    //hash array
    dictEntry **table;
    //Node size
    unsigned long size;
    //Hash table mask
    unsigned long sizemask;
    //Existing node
    unsigned long used;
}
typedef struct dictEntry{
    //key
    void *key;
    union{
        void *val;
        unit64_tu64;
        int64_ts64;
    } v;
    //Hash table mask
    unsigned long sizemask;
    //next hash node
    struct dictEntry *next;

On the whole, it is similar to the HashMap in Java. It is in the form of array + linked list. The real difference is rehash. As we all know, if the dictionary structure has a large amount of data, a complete rehash is very time-consuming. Redis is single threaded, and it is almost inevitable that a time-consuming rehash will cause blocking. Redis cannot afford this. So redis invented a progressive rehash operation.

Progressive rehash means that not all rehash is completed at one time, but rehash in batches. The two arrays and trehashidx fields in the dictionary are designed to implement progressive rehash. In the normal state, all reads and writes are in the array h0, and h1 does not do any operation. However, in progressive rehash, double space is allocated to h1 first, and then rehash from an element in h0 to h1 every time it is read and written. In this way, one element of the rehash array each time avoids blocking. Trehashidx records the number of rehash, and assigns h1 to h0 after all. During progressive rehash, both read and write access h0 first and then h1.

To sum up, the characteristics are as follows

1. Array + linked list structure

2. Chain header interpolation

3. Progressive hash

Skip list

Jump table is an ordered data structure, which supports sequential search, reverse search and range search. Average O(logn), worst O(n) complexity. The query speed is comparable to that of B tree and hashMap, and the implementation is relatively simple. It is one of the underlying implementations of the ordered set key of redis.

The overall node of the hop list is a forward array based on the double ended linked list. The forward array records the next node corresponding to the floor height, that is, you can quickly jump to the next node through forward[level] to achieve the effect of fast query. The forward[0] layer is a complete data linked list. With the pointer before backward, it can perfectly realize range search, sequential search and reverse search. This is also the underlying support for the commands zrange and zresrange. The Java implementation code of jump table is at the end of the article. Interested students can have a look.

To sum up, the characteristics are as follows:

1. Fast query speed

2. There is a lot of memory fragmentation: each node has several pointers to other nodes.

3. Traversal in reverse order

4. Range traversal

Compressed list ziplost

In order to further compress memory, redis has developed a data structure of compressed list. A compressed list can contain any number of entries, and each node can hold a byte array or an integer

Each entry will use one byte (or two or five bytes) to record the length of the front node, so as to quickly find the front node through (current position - length of the front node), which is similar to the "front pointer", and there are no memory fragments, so as to achieve the effect of memory compression. Because of this structure, it can only be traversed from the tail, not sequentially. The structure also has defects. For example, the length of the front node changes from 63 to 64 +, and the current node used one byte to record the length of the front node, that is, the length of the current node will be + 1. If 63 happens to change to 64, the rear node will also grow. Thus, a "chain update" reaction is generated, which affects the performance.

To sum up, the characteristics are as follows:

1. Compact memory: there is no redundant fragment, but when updating, you need to apply for or reclaim memory.

2. Traversal in reverse order: traversal from the tail node, tail interpolation.

3. Chain update: the update of one node may lead to the update of several subsequent nodes

Quick linked list quicklist

In earlier versions, list had two underlying implementations:

1. If the list is short or the elements occupy less space, zip list storage is used.

2. If the list is long or the elements occupy a large amount, linkedlist is used for storage

But in fact, both have advantages and disadvantages:

1,ziplist

Advantages: compact memory,

Disadvantages: low update efficiency

2,linkedlist

Advantages: fast update

Disadvantages: when there are many nodes, there will be fragments.

In order to be compatible with the advantages and disadvantages of the two, the high version of redis implements a quick list quicklist. quicklist is a linkedlist structure with ziplist as the node. Structure reference is as follows

Although the structure is compatible with both advantages, it will be acclimatized in some cases. Of course, redis designers are very considerate and have their own solutions.

Scenario 1:

If the ziplost allocates too many elements, the ziplost will be updated slowly

If the ziplost allocation is too small, it will degenerate into linkedlist. There are many memory fragments.

Solution: the configuration file provides the list Max ziplist size parameter. A positive number indicates the maximum value of each ziplost data item. When it is a negative number (it can only be a number in the [- 1, - 5] interval), it indicates the ziplost size level. Users can customize this parameter according to the scene.

Scenario 2:

When the linked list is very long, the data at both ends are accessed most frequently, and the intermediate access frequency will be lower. For example, leaderboards usually access the highest / lowest elements. In fact, the data in the middle can be further compressed. Similarly, the redis configuration file also provides the configuration parameter list compress depth, which indicates how many nodes on both ends are compressed. When it is 0, it means no compression, which is also the default value.

Integer set intset

Integer set is the bottom implementation of set object. When all the elements of set object are integers, intset will be used to store them.

typedef struct intset {
uint32_t encoding;
uint32_t length;
int8_t contents[];
}

The real data of intset is stored in the contents array, and the array is sorted from small to large. The element type is determined by the encoding attribute, generally int8_t,int16_t,int32_t.

Upgrade of intset

If all the elements in contents are int8_t. The new integer is int16_t, the upgrade will be triggered. The specific operation is to reallocate space for contents. The second is to convert the original integer type to int16_t. Insert sort into new array. Finally, insert a new integer. ps: because the inserted number is a high-level type, it is either larger than the original integer or smaller than the original array, so it is either inserted into the array header or the array tail, and the complexity is O(1); Of course, there will be no downgrade here. It is impossible to downgrade in this life.

To sum up, the characteristics of intset are as follows:

1. Compatible with multiple integer types

2. It will be upgraded, but not downgraded

Object redisObject

Although the above data structures have been very rich and perfect, redis does not directly use the above data structures. Instead, redisObject is introduced.

typedef struct redisObject {
//code
unfigned encoding;
//type
unfigned type;
void *ptr;
//Reference count
int refcount
//Last visit time
unsigned lru;
}

Decide which data structures to use flexibly through type and encoding. In addition, reference counting is used to realize memory recycling and object reuse. lru records the latest access time to eliminate expired key s. There are five main types of objects: String object, hash object, list object, collection object and ordered collection object.

String object

Character objects are generally int, raw(sds) and embstr. Embstr is a redisObject+sds that stores short strings (< = 32 bytes). Raw needs to allocate redisObject and sds memory twice, but embstr only needs to allocate memory once, and only needs to release memory once at the same time. However, embstr is read-only. If you modify it once (append command), even if the length is < = 32 bytes, it will become raw. This change can be observed using the object command.

127.0.0.1:6379> set str value1
OK
127.0.0.1:6379> object encoding str 
"embstr"
127.0.0.1:6379> set str nlakjsdnflaksjdhflaskdjfhlaksdfhaisduhflaisudfhlaisduf\
OK
127.0.0.1:6379> object encoding str
"raw"

127.0.0.1:6379> set str2 a
OK
127.0.0.1:6379> object encoding str2
"embstr"
127.0.0.1:6379> append str2 b
(integer) 2
127.0.0.1:6379> object encoding str2
"raw"

String objects are also one of the basic implementations of many other objects.

List object

List objects are generally ziplist and linkedlist. The higher version is quicklist.

Code conversion:

Ziplist can be used only when 1, the length of list elements is less than 64 bytes, 2 and the length of list is less than 512. If one of them is not satisfied, it will be converted to linkedlist. These two values can be configured through list Max ziplist value and list Max ziplist entries

hash objects

Hash objects are generally ziplost and hashsale. How does ziplost implement the hash function? Ziplist saves key and value with two entries, and then inserts them into ziplist adjacent to each other. If two conditions are met, 1. The length of all key value pairs is less than 64 bytes. 2. Only when the number of key value pairs is small, can ziplost be used
If one of them is not satisfied, it will be converted to hashtable. These two values can be configured through hash Max ziplist value and hash Max ziplist entries

Collection object

Collection objects are generally intset and hashsale. Intset is easy to understand. How can hashtable be a collection object? Very simple. key is a string object and value is null. Two conditions are met at the same time: 1. All values are integers; 2. The number is less than 512. Intset will be used. If one of them is not satisfied, it will be converted to hashtable. The first condition cannot be changed, but the second value can be configured through set Max int set entries

Ordered collection object

Ordered collection objects are generally ziplist and skiplist. How does ziplost implement an ordered collection of objects? Use two entries, one to save the score and the other to save the value, and insert the ziplost in order at the same time. The small score is close to the header and the large one is close to the footer. Unlike other objects, the ordered collection object also has a hashtable that stores the mapping of < score, value >. It is convenient to use the speed of O(1) to quickly find the value corresponding to the score. If two conditions are met at the same time, 1. The length of all elements is less than 64 bytes, 2. If the number is less than 128, the ziplost will be used. If one of them is not satisfied, it will be converted to skiplist. These two values can be configured through Zset Max ziplist value and Zset Max ziplist entries

Reference count

Record the total number of references through refcount to facilitate memory recycling. However, it should be the same as the reference defect of the JVM. In the case of mutual reference, it will cause non recyclability.

Shared object

When redis is started, a batch of objects will be generated to facilitate sharing. Like 1-9999 or something.

summary

The biggest feature of redis's data structure is memory saving, because pure memory operation itself has very high performance, and if memory can be saved, the maximum performance benefit can be obtained. Therefore, the second is the performance. At the same time, it also takes into account the functionality, so as to realize various functions, such as leaderboard, message queue and so on.
Save memory
1. emstr and redisObject continuous storage
2. Extreme compression of ziplost
3. Upgrade mechanism of intset
4. Object reference recycling and reuse
5. Intermediate node compression of quickList
High performance:
1. Pre distribution, inert release, len of SDS
2. Efficient query of skiplist
3,hashtable
4. emstr requests memory only once
5. hashtable structure in ordered sets
Functionality:
1. sds save binary array, support pictures, etc
2. Forward and reverse order query and range query of skipList
3. Compatibility of intset
4. Progressive rehash anti blocking for hashtable

Reference reference

1. Design and implementation of redis (Second Edition)

2,https://www.html.cn/softprog/database/176951.html

Reference code

public class SkipList<T extends Comparable<? super T>> {
    private static final int LEVEL_MAX = 32;
    private Node<T> head = new Node<>(LEVEL_MAX, null);
    private Node<T> tail;
    private int levelCount = 0;
    private static double SKIP_P = 0.5;

    private class Node<T> {
        Node<T> backword;
        Node<T>[] forward;
        int levelMax;
        T data;

        Node(int levelMax) {
            this.levelMax = levelMax;
            forward = initForward(this.levelMax);
        }

        Node(int levelMax, T t) {
            this.levelMax = levelMax;
            forward = initForward(this.levelMax);
            data = t;
        }

        public T getData() {
            return data;
        }

        public void setData(T data) {
            this.data = data;
        }

//        @Override
//        public String toString() {
//            return "Node{" +
//                    "pre=" + pre +
//                    ", pro=" + pro +
//                    ", forward=" + Arrays.toString(forward) +
//                    ", levelMax=" + levelMax +
//                    ", data=" + data +
//                    '}';
//        }
    }

    public Node<T> get(T t) {

        Node<T> p = head;
        for (int i = head.levelMax - 1; i >= 0; i--) {
            while (p.forward[i] != null && p.forward[i].getData().compareTo(t) < 0) {
                p = p.forward[i];
            }
        }
        if (p.forward[0].getData().equals(t)) {
            return p.forward[0];
        }

        return null;
    }


    public void put(T t) {
        int level = randomLevel();
        Node<T> newNode = new Node<>(level, t);
        Node<T>[] trace = initForward(level);

        levelCount = Math.max(level, levelCount);

        Node<T> p = head;
        for (int i = level - 1; i >= 0; i--) {
            while (p.forward[i] != null && p.forward[i].getData().compareTo(t) < 0) {
                p = p.forward[i];
            }
            trace[i] = p;
        }
        for (int i = level - 1; i >= 0; i--) {
            newNode.forward[i] = trace[i].forward[i];
            trace[i].forward[i] = newNode;
        }
        //Maintain the front pointer of the build node
        if (trace[0] == head) {
            newNode.backword = null;
        } else {
            newNode.backword = trace[0];
        }
        //Maintain post pointer of generation node
        if (newNode.forward[0] != null) {
            newNode.forward[0].backword = newNode;
        } else {
            tail = newNode;
        }


    }

    public void delete(T t) {
        Node<T>[] trace = new Node[levelCount];
        Node<T> p = head;
        for (int i = levelCount - 1; i >= 0; i--) {
            while (p.forward[i] != null && p.forward[i].getData().compareTo(t) < 0) {
                p = p.forward[i];
            }
            trace[i] = p;
        }
        for (int i = 0; i < levelCount; i++) {
            //Remove duplicate values
            while (trace[i] != null && trace[i].forward[i] != null && trace[i].forward[i].data.equals(t)) {
                Node<T> targetNode = trace[i].forward[i];
                if (targetNode.equals(tail)){
                    tail = trace[i];
                }
                trace[i].forward[i] = targetNode.forward[i];
                if (targetNode.forward[i] != null) {
                    targetNode.forward[i].backword = trace[i];
                }
                targetNode.forward[i] = null;
                targetNode.backword = null;
            }
        }


    }


    static int randomLevel() {
        int level = 1;
        //The probability of layers will affect the efficiency of insertion and query
        while (Math.random() < SKIP_P && level < LEVEL_MAX) {
            level++;
        }
        return level;
//        return (int) (Math.random() * (LEVEL_MAX - 1)) + 1;
    }

    private List<T> range(int start, int end) throws Exception {
        if (start < 0) {
            throw new Exception("start is err");
        }
        List<T> result = new ArrayList<>();
        Node<T> p = head.forward[0];
        int i = 0;
        while (p != null) {
            if (start <= i && i <= end) {
                result.add(p.getData());
            }
            if (i > end) break;
            p = p.forward[0];
            i++;
        }
        return result;
    }

    private List<T> resRange(int start, int end) throws Exception {
        if (start < 0) {
            throw new Exception("start is err");
        }
        List<T> result = new ArrayList<>();
        Node<T> p = tail;
        int i = 0;
        while (p != null) {
            if (start <= i && i <= end) {
                result.add(p.getData());
            }
            if (i > end) break;
            p = p.backword;
            i++;
        }
        return result;
    }

    private Node[] initForward(int level) {
        Node[] forward = new Node[level];
        return forward;
    }

    private void printAll() {

        for (int i = LEVEL_MAX - 1; i >= 0; i--) {
            Node<T> p = head.forward[i];
            StringBuilder str = new StringBuilder("The first").append(i).append("layer");
            while (p != null) {
                str.append(p.data.toString()).append("->");
                p = p.forward[i];
            }
            System.out.println(str.append("null").toString());
        }
    }

    public static void main(String[] args) throws Exception {
        SkipList<Integer> skipList = new SkipList<>();
        for (int i = 0; i < 5; i++) {
            skipList.put(i);
        }
        skipList.printAll();
        skipList.delete(0);
        skipList.printAll();
        List<Integer> range = skipList.range(0, 1);
        System.out.println(JSON.toJSONString(range));

        List<Integer> resRange = skipList.resRange(0, 1);
        System.out.println(JSON.toJSONString(resRange));



//        timeTest();
    }

    public static void timeTest() {
        SkipList<Integer> skipList = new SkipList<>();
        int count = 100000;
        long time1 = System.currentTimeMillis();
        for (int i = 0; i < count; i++) {
            skipList.put(i);
        }
        long time2 = System.currentTimeMillis();
        System.out.println(" insert " + (time2 - time1));
        for (int i = 0; i < count; i++) {
            skipList.get(i);
        }
        long time3 = System.currentTimeMillis();
        System.out.println("get " + (time3 - time2));

        HashMap<String, Integer> hashMap = new HashMap<>();
        long htime1 = System.currentTimeMillis();
        for (int i = 0; i < count; i++) {
            hashMap.put(i + "", i);
        }
        long htime2 = System.currentTimeMillis();
        System.out.println("hashMap insert " + (htime2 - htime1));
        for (int i = 0; i < count; i++) {
            hashMap.get(i + "");
        }
        long htime3 = System.currentTimeMillis();
        System.out.println("hashMap  get " + (htime3 - time2));

    }

}

Posted by fil on Fri, 03 Dec 2021 01:35:01 -0800

Programmer Group