★★ | MySQL | how to filter out duplicate data for mass insertion?

Keywords: Python Java SQL

        There are six tables in the online library with duplicate data, two of which are relatively large, one 960000 + and one 300000 +. Because the same problem has been handled before, we directly brought the last Python de duplication script. The script is very simple, that is, connect to the database, find out the duplicate data and delete it circularly.

         The efficiency is too low. One piece per second, about 20000 + duplicate data, and the estimated time is about 8 hours... It's a problem to blindly rely on the things of predecessors without thinking for yourself!

         I always think about how I could do it before and how I can't do it now. This is also a problem! I found that the state is not quite right recently. I have lost the desire to explore and seek knowledge. Today, I am alert and feel like I have lost my way.

To get back to business, let's introduce the steps of de duplication in detail.

CREATE TABLE `animal` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1', 'cat', '12');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2', 'dog', '13');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3', 'camel', '25');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4', 'cat', '32');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5', 'dog', '42');

Objective: we want to remove the data with the same name.

Let's see what data is repeated

SELECT name,count( 1 ) 
FROM
 student 
GROUP BY
NAME 
HAVING
 count( 1 ) > 1;

Output:

"name count(1) cat 2 dog 2

The data with name cat and dog are duplicated, and there are two pieces of duplicate data for each;

Select * From surface Where Duplicate field In (Select Duplicate field From surface Group By Duplicate field Having Count(1)>1)

Delete all duplicate data and leave none

If you delete it directly, an error will be reported

DELETE 
FROM
 student 
WHERE
 NAME IN (
 SELECT NAME 
 FROM
  student 
 GROUP BY
 NAME 
HAVING
 count( 1 ) > 1)

report errors:

"1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s

The reason is that when the table is updated, the table is queried. When the table is queried, the table is updated. It can be understood as deadlock. mysql does not support this operation of updating and querying the same table

Solution: query several columns of data to be updated as a third-party table, and then filter the updates.

DELETE 
FROM
 student 
WHERE
 NAME IN (
 SELECT
  t.NAME 
FROM
 ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1 ) t)

Delete the duplicate data in the delete table and keep only one

Before deleting, we can check what kind of duplicate data we want to delete

SELECT
 * 
FROM
 student 
WHERE
 id NOT IN (
 SELECT
  t.id 
 FROM
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t 
 )

What do you mean? First, find out the data with the smallest id through the name grouping. These data are the kindling we want to leave, and then find out that the id is not here, which is the duplicate data we want to delete.

Start deleting duplicate data, leaving only one

It's very simple. Just replace the select with delete

DELETE 
FROM
 student 
WHERE
 id NOT IN (
 SELECT
  t.id 
 FROM
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t 
 )

The 900000 + meter performs super fast.

Source: telami.cn/2019/mysql-remove-duplicate-data-and-keeping-only-one/

Posted by durahman on Sun, 28 Nov 2021 13:09:09 -0800