There are six tables in the online library with duplicate data, two of which are relatively large, one 960000 + and one 300000 +. Because the same problem has been handled before, we directly brought the last Python de duplication script. The script is very simple, that is, connect to the database, find out the duplicate data and delete it circularly.
The efficiency is too low. One piece per second, about 20000 + duplicate data, and the estimated time is about 8 hours... It's a problem to blindly rely on the things of predecessors without thinking for yourself!
I always think about how I could do it before and how I can't do it now. This is also a problem! I found that the state is not quite right recently. I have lost the desire to explore and seek knowledge. Today, I am alert and feel like I have lost my way.
To get back to business, let's introduce the steps of de duplication in detail.
CREATE TABLE `animal` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(20) DEFAULT NULL, `age` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin; INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1', 'cat', '12'); INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2', 'dog', '13'); INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3', 'camel', '25'); INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4', 'cat', '32'); INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5', 'dog', '42');
Objective: we want to remove the data with the same name.
Let's see what data is repeated
SELECT name,count( 1 ) FROM student GROUP BY NAME HAVING count( 1 ) > 1;
"name count(1) cat 2 dog 2
The data with name cat and dog are duplicated, and there are two pieces of duplicate data for each;
Select * From surface Where Duplicate field In (Select Duplicate field From surface Group By Duplicate field Having Count(1)>1)
Delete all duplicate data and leave none
If you delete it directly, an error will be reported
DELETE FROM student WHERE NAME IN ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1)
"1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s
The reason is that when the table is updated, the table is queried. When the table is queried, the table is updated. It can be understood as deadlock. mysql does not support this operation of updating and querying the same table
Solution: query several columns of data to be updated as a third-party table, and then filter the updates.
DELETE FROM student WHERE NAME IN ( SELECT t.NAME FROM ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1 ) t)
Delete the duplicate data in the delete table and keep only one
Before deleting, we can check what kind of duplicate data we want to delete
SELECT * FROM student WHERE id NOT IN ( SELECT t.id FROM ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t )
What do you mean? First, find out the data with the smallest id through the name grouping. These data are the kindling we want to leave, and then find out that the id is not here, which is the duplicate data we want to delete.
Start deleting duplicate data, leaving only one
It's very simple. Just replace the select with delete
DELETE FROM student WHERE id NOT IN ( SELECT t.id FROM ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t )
The 900000 + meter performs super fast.