Introduction to Redis HyperLog usage

Keywords: Jedis Redis Database Java

(1) Introduction to HyperLogLog

HyperLogLog was added at Redis version 2.8.9. The HyperLogLog algorithm is an algorithm for cardinality statistics. It only takes 12 KB of memory per HyperLogLog key to calculate the cardinality of nearly 2^64 different elements.HyperLogLog is suitable for large data volumes because it is relatively cheaper and consumes up to 12 KB of memory

In business scenarios, HyperLogLog is often used for statistics with large amounts of data, such as page or user visits

For example, if you want to count the number of visits to a page (PV), this is a good idea. You can either use the redis counter directly or store the database directly. If you need more, you can now count the number of user visits to a page (UV), which can only be counted once a user visits more than once a day.In this case, you might think of using SET collection, because SET collection has the function of de-duplication, key stores keywords corresponding to pages and value stores corresponding userId. This method is feasible, but if there are more visits, it would be a hassle if there are tens of millions of visits. In order to count one visit, it should be frequentCreate SET Collection Object

Is there any other way?For the large number of visits above, redis implements the HyperLogLog algorithm, which was invented by Professor Philippe Flajolet

Redis integrated yperLog usage syntax is mainly pfadd and pfcount. As the name implies, one is to add data, the other is to statistics, which is easier to master, but the algorithm is more complex, then why use pf?Because the inventor of the data structure HyperLogLog is Professor Philippe Flajolet, it's easy to remember the grammar by using the initials for the inventor.

Here are some simple examples to start the redis client

127.0.0.1:6379> flushall
OK
127.0.0.1:6379> pfadd uv user1
(integer) 1
127.0.0.1:6379> pfcount uv
(integer) 1
127.0.0.1:6379> pfadd uv user2
(integer) 1
127.0.0.1:6379> pfcount uv
(integer) 2
127.0.0.1:6379> pfadd uv user3
(integer) 1
127.0.0.1:6379> pfcount uv
(integer) 3
127.0.0.1:6379> pfadd uv user4
(integer) 1
127.0.0.1:6379> pfcount uv
(integer) 4
127.0.0.1:6379> pfadd uv user5 user6 user 7 user8 user9 user10
(integer) 1
127.0.0.1:6379> pfcount uv
(integer) 10
127.0.0.1:6379>

Then use java's Jedis library to implement

Add Maven:

<dependencies>
    <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
            <version>2.9.0</version>
        </dependency>
  </dependencies>

Write a test class, start the redis server first

package com.test.redis;

import redis.clients.jedis.Jedis;

public class RedisPFCountTest {
    
    public static void main(String[] args) {
        Jedis jedis = new Jedis("127.0.0.1",6379);
        for (int i = 0; i < 1000; i++) {
          jedis.pfadd("pv", String.valueOf(i));
        }
        long total = jedis.pfcount("pv");
        System.out.printf("%d\n", total);
        jedis.close();
    }

}
Insert a picture description here

In addition, to increase the amount of data, 100,000 statistics have been written here, which shows that there is a slight error.


Insert a picture description here

Of course, the HyperLogLog algorithm was originally developed for a large amount of statistics, so it is very suitable for a large amount of data, and then there is no requirement for calculation without a little error. HyperLogLog provides an inaccurate de-recalculation scheme, which is not exact but not very precise, and the standard error is 0.81%.This has no impact on page user visits, because this statistic may be very large, but it is not necessary to be absolutely accurate, access requirements for accuracy are not so high, but performance storage requirements are higher, and HyperLogLog just meets this requirement and does not take up too much storage spaceAnd it also performs well

(2) PFMERGE usage

pfadd and pfcount are commonly used for statistics. Next, if the two pages are very similar, would you like to count the user visits to the two pages now?Here you can merge statistics with pfmerge, for example, syntax:

127.0.0.1:6379> PFADD test1 "apple" "banana" "cherry"
(integer) 1
127.0.0.1:6379> PFCOUNT test1
(integer) 3
127.0.0.1:6379> PFADD test2 "apple" "cherry" "durian" "mongo"
(integer) 1
127.0.0.1:6379> PFCOUNT test2
(integer) 4
127.0.0.1:6379> PFMERGE test1&test2 test1 test2
OK
127.0.0.1:6379> PFCOUNT test1&test2
(integer) 5

Posted by jonathanellis on Wed, 17 Jul 2019 09:26:54 -0700