Author: ksfzhaohui https://my.oschina.net/OutOfMemory/blog/3117737
Recently, there is a demand to analyze an order file, and the description file can reach tens of millions of data, each data is about 20 fields, each field is separated by commas, and it needs to be stored in half an hour as much as possible.
thinking
1. Estimate file size
Because there are tens of millions of files, and each record is about 20 fields, you can roughly estimate the size of the entire order file. The method is also very simple. Use FileWriter to insert tens of millions of data into the file, and check the file size, which is about 1.5G after testing;
2. How to batch insert
It can be seen from the above that the file is relatively large, so it is definitely not good to read the memory at one time. The method is to intercept a part of the data from the current order file every time, and then insert it in batches. How to use * * insert(...)values(...), (...) *? Mode for batch insertion? After testing, the efficiency of this mode is quite high; How to insert 100 pieces of data quickly, with the shortest time , take a look at this one.
3. Data integrity
When intercepting data, we need to pay attention to the need to ensure the integrity of the data. Each record is a line break at the end. We need to ensure that each intercepting is the whole number according to this identification, and do not have half of the data;
4. Does the database support batch data
Because the batch data needs to be inserted, whether the database supports a large number of data writes, such as mysql used here, can be done by setting max_allowed_packet to ensure the amount of data submitted by the batch;
5. Halfway error
Because it is large file parsing, if there is an error in the middle of the process, such as when the data is just inserted into 900w, the database connection fails. In this case, it is impossible to insert it again. All the locations where the data is inserted each time need to be recorded, and the data inserted in the same transaction with the batch needs to be guaranteed, so that after recovery, the insertion can continue from the location of the record.
realization
1. Prepare data sheet
Here you need to prepare two tables: order status location information table and order table;
CREATE TABLE `file_analysis` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `file_type` varchar(255) NOT NULL COMMENT 'Document type 01:Type 1, 02:Type 2', `file_name` varchar(255) NOT NULL COMMENT 'File name', `file_path` varchar(255) NOT NULL COMMENT 'File path', `status` varchar(255) NOT NULL COMMENT 'File state 0 initialization; 1 succeeded; 2 failed: 3 in process', `position` bigint(20) NOT NULL COMMENT 'Where the last processing was completed', `crt_time` datetime NOT NULL COMMENT 'Creation time', `upd_time` datetime NOT NULL COMMENT 'Update time', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8 CREATE TABLE `file_order` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `file_id` bigint(20) DEFAULT NULL, `field1` varchar(255) DEFAULT NULL, `field2` varchar(255) DEFAULT NULL, `field3` varchar(255) DEFAULT NULL, `field4` varchar(255) DEFAULT NULL, `field5` varchar(255) DEFAULT NULL, `field6` varchar(255) DEFAULT NULL, `field7` varchar(255) DEFAULT NULL, `field8` varchar(255) DEFAULT NULL, `field9` varchar(255) DEFAULT NULL, `field10` varchar(255) DEFAULT NULL, `field11` varchar(255) DEFAULT NULL, `field12` varchar(255) DEFAULT NULL, `field13` varchar(255) DEFAULT NULL, `field14` varchar(255) DEFAULT NULL, `field15` varchar(255) DEFAULT NULL, `field16` varchar(255) DEFAULT NULL, `field17` varchar(255) DEFAULT NULL, `field18` varchar(255) DEFAULT NULL, `crt_time` datetime NOT NULL COMMENT 'Creation time', `upd_time` datetime NOT NULL COMMENT 'Update time', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=10000024 DEFAULT CHARSET=utf8
2. Configure database package size
mysql> show VARIABLES like '%max_allowed_packet%'; +--------------------------+------------+ | Variable_name | Value | +--------------------------+------------+ | max_allowed_packet | 1048576 | | slave_max_allowed_packet | 1073741824 | +--------------------------+------------+ 2 rows in set mysql> set global max_allowed_packet = 1024*1024*10; Query OK, 0 rows affected
By setting max_allowed_packet, to ensure that the database can receive the packet size inserted by the batch; otherwise, the following error will occur:
Caused by: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (4980577 > 1048576). You can change this value on the server by setting the max_allowed_packet' variable. at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3915) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2598) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2834)
3. Prepare test data
public static void main(String[] args) throws IOException { FileWriter out = new FileWriter(new File("D://xxxxxxx//orders.txt")); for (int i = 0; i < 10000000; i++) { out.write( "vaule1,vaule2,vaule3,vaule4,vaule5,vaule6,vaule7,vaule8,vaule9,vaule10,vaule11,vaule12,vaule13,vaule14,vaule15,vaule16,vaule17,vaule18"); out.write(System.getProperty("line.separator")); } out.close(); }
Use FileWriter to traverse and insert 1000w pieces of data into a file, which is still very fast. Don't forget to add line breaks (\ n\r) after each piece of data;
4. integrity of intercepted data
In addition to setting the size of each read file, you also need to set a parameter to get a small part of data at a time, and to get line breaks (\ n\r) from this small part of data. If you can't get them until they are accumulated and obtained directly, the size of this value setting is about the same as the size of each data. Part of the implementation is as follows:
ByteBuffer byteBuffer = ByteBuffer.allocate(buffSize); // Request a cache long endPosition = batchFileSize + startPosition - buffSize;// Child file end location long startTime, endTime; for (int i = 0; i < count; i++) { startTime = System.currentTimeMillis(); if (i + 1 != count) { int read = inputChannel.read(byteBuffer, endPosition);// Read data readW: while (read != -1) { byteBuffer.flip();// Toggle read mode byte[] array = byteBuffer.array(); for (int j = 0; j < array.length; j++) { byte b = array[j]; if (b == 10 || b == 13) { // Judgment \ n\r endPosition += j; break readW; } } endPosition += buffSize; byteBuffer.clear(); // Reset cache block pointer read = inputChannel.read(byteBuffer, endPosition); } } else { endPosition = fileSize; // The last file points directly to the end of the file } ...Omitted, more can be viewed Github Full code... }
As shown in the above code, a buffer is opened, which is about 200 bytes according to the data size of each line. Then, the current position is added to the previous ending position by traversing to find the line break (\ n\r), ensuring the integrity of the data;
5. Batch insert data
Insert data in batches by * * insert(...)values(...), (...) *. Some codes are as follows:
// Save order and resolution location in one transaction SqlSession session = sqlSessionFactory.openSession(); try { long startTime = System.currentTimeMillis(); FielAnalysisMapper fielAnalysisMapper = session.getMapper(FielAnalysisMapper.class); FileOrderMapper fileOrderMapper = session.getMapper(FileOrderMapper.class); fileOrderMapper.batchInsert(orderList); // Update the last resolved location and specify the update time fileAnalysis.setPosition(endPosition + 1); fileAnalysis.setStatus("3"); fileAnalysis.setUpdTime(new Date()); fielAnalysisMapper.updateFileAnalysis(fileAnalysis); session.commit(); long endTime = System.currentTimeMillis(); System.out.println("===Insert data cost:" + (endTime - startTime) + "ms==="); } catch (Exception e) { session.rollback(); } finally { session.close(); } ...Omitted, more can be viewed Github Full code...
The above code saves batch order data and file resolution location information in a transaction. batchInsert uses the**< foreach >**Tag to traverse the order list and generate values data;
summary
Part of the code is shown above, and the complete code can be viewed Github For the batchInsert module in the address, set the file size of each interception to 2M locally.
After the test, 1000w pieces of data (about 1.5G in size) are inserted into mysql database, which takes about 20 minutes. Of course, by setting the size of the intercepted file, the time spent will change accordingly.
Recommend to my blog to read more:
1.Java JVM, collection, multithreading, new features series
2.Spring MVC, Spring Boot, Spring Cloud series tutorials
3.Maven, Git, Eclipse, Intellij IDEA series tools tutorial
4.Latest interview questions of Java, backend, architecture, Alibaba and other large factories
Feel good, don't forget to like + forward!