Seckill TOPK problem (with code)

subject

Given a large file (1T? 10T), in which each line stores a user's ID (IP? IQ?) , your computer has only 2G memory, please find the ten IDS with the highest frequency

introduce

In recent years, the TopK problem has been the most, the most and the most in the field test

In fact, the answer is relatively simple

stand-alone
Limited memory
File is too large.

In such an environment, use the following ideas to solve it

Read large file by line
Hash sub file
Read a single small file
Use map count
Maintenance of small top reactor

Create use cases

First, make some use cases for testing

This is to create 100000 strings, because the amount of data is not much and the running time is a little long

The generated file size is shown in the figure

The generated data is shown in the figure

Read and split files by line

This piece of code should be ok?

Logic is

Read by line
Print the process every 1000 lines
Bucket according to Hashcode value
Writes to the specified file in append mode

In this way, theoretically, we can get 1000 strings containing the same hashcode modulus value

That is, all the same strings are in one file

count

A single file can be saved by the system, so long as you count these strings and put the key values into a hashmap

Maintenance of small top reactor

After one traverse, you can put the count into the small top heap, so that you can complete the sorting only by maintaining a small top heap of k size

output

Finally, output the contents of the small top heap

Better solution

If you can use hadoop for multiple computers, you can split the files into multiple files, count the map separately, count the reduction uniformly, and then use the small top heap to find the topk

But the interviewer will not be allowed to operate XD in this way

appendix

TopK use case generation

import java.io.*;
import java.util.*;

class TopKUseCase {
    public static void main(String[] args) throws IOException {
        try {
            final int divNum = 1000;
            int fileSize =100000;
            File outputfile = new File("d:\\bigdata.txt");
            StringBuilder stringBuilder = new StringBuilder();
            PrintWriter output = new PrintWriter(outputfile);
            for (int i = 0; i < fileSize; i++) {
                stringBuilder.append("user" + (int) (Math.random() * 10000) + "who" + System.lineSeparator());
            }
            if (!outputfile.exists())
                outputfile.createNewFile();
            output = new PrintWriter(outputfile);
            output.println(stringBuilder);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

TopK code

import java.awt.*;
import java.io.*;
import java.util.*;

class TopK {
    public static void main(String[] args) throws IOException {
        final int divNum = 1000; // How many files are divided in advance according to the given conditions
        // Read large files by line
        File inputFile = new File("d:\\bigdata.txt");
        FileInputStream inputStream = new FileInputStream(inputFile);
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
        String str = null;
        // Output small file
        File outputfile;
        BufferedWriter output;
        // Count is used to print logs
        int times = 0;
        System.out.println("Sub file start");
        // Create directory
        File menu = new File("D:\\div");
        if (!menu.exists())
            menu.mkdirs();
        // Circular reading of large files
        while ((str = bufferedReader.readLine()) != null) {
            times++;
            if (times % 1000 == 0) {
                System.out.println(times + "Times" + str);
            }
            // According to the Hashcode value, which bucket to throw
            int order = str.hashCode() & divNum;
            outputfile = new File("D:\\div\\d" + order + "file.txt");
            if (!outputfile.exists())
                outputfile.createNewFile();
            // Note that the second parameter here should be set to true, indicating the adding mode
            output = new BufferedWriter(new FileWriter(outputfile, true));
            output.write(str);
            output.newLine();
            output.flush();
        }
        System.out.println("File splitting completed");
        // Create small top heap and hash table
        String[] strArray;
        Map<String, Integer> map;
        Queue<Map.Entry<String, Integer>> queue = new PriorityQueue<>(10, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o1.getValue() - o2.getValue();
            }
        });
        // Traverse every existing small file
        for (int i = 0; i <= divNum; i++) {
            System.out.println("The first" + i + "individual");
            inputFile = new File("D:\\div\\d" + i + "file.txt");
            if (!inputFile.exists())
                continue;
            // Read the whole file at a time. This is for the convenience of reading by line if there is insufficient memory
            str = readToString(inputFile);
            strArray = str.split(System.lineSeparator());
            map = new HashMap<>();
            // Use hash table count
            for (String string : strArray) {
                try {
                    map.put(string, map.get(string) + 1);
                } catch (Exception e) {
                    map.put(string, 1);
                }
            }
            // Maintenance of small top reactor
            for (Map.Entry<String, Integer> entry : map.entrySet()) {
                if (queue.size() < 10) {
                    queue.add(entry);
                    continue;
                }
                if (entry.getValue() > queue.peek().getValue()) {
                    queue.poll();
                    queue.add(entry);
                }
            }
        }
        // Output small top reactor
        Iterator<Map.Entry<String, Integer>> iterator = queue.iterator();
        while (iterator.hasNext()) {
            Map.Entry<String, Integer> entry = iterator.next();
            System.out.println(entry.getKey() + " " + entry.getValue());
        }
    }

    // Read all files at once
    public static String readToString(File file) {
        Long filelength = file.length();     //Get file length
        byte[] filecontent = new byte[filelength.intValue()];
        try {
            FileInputStream in = new FileInputStream(file);
            in.read(filecontent);
            in.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return new String(filecontent);//Return to file content, default encoding
    }
}

Gaoguobin1996

52 original articles published, 33 praised, 50000 visitors+

Private letter follow

Posted by hughmill on Mon, 17 Feb 2020 20:07:30 -0800

Programmer Group