Python batch deletes tens of millions of levels of mysql data

Keywords: Python SQL MySQL Database

Scene description

On-line mysql database contains a table with daily statistical results, more than 10 million items per day, which is unexpected to us. How many statistical results are there? Operations and maintenance found that the disk occupied 200 G, and finally asked the operation, you can only retain the last three days, the previous data can only be deleted. Delete, how to delete?
Because this is an online database, there are many other tables in it. If you delete the data of this table directly, it will certainly not work, and it may have an impact on other tables. Trying to delete data for only one day at a time is still a big problem for Carton. There's no way to do that. Write a Python script and delete it in batches.
The specific ideas are as follows:

  1. Delete data for only one day at a time.
  2. Delete the data for one day and delete 50,000 items at a time.
  3. After one day's data deletion, begin to delete the next day's data;

Python code

# -*-coding:utf-8 -*-

import sys

# This is our Python Module encapsulated internally.
sys.path.append('/var/lib/hadoop-hdfs/scripts/python_module2')
import keguang.commons as commons
import keguang.timedef as timedef
import keguang.sql.mysqlclient as mysql


def run(starttime, endtime, regx):
    tb_name = 'statistic_ad_image_final_count'
    days = timedef.getDays(starttime,endtime,regx)
    # Traverse to delete data for all days
    for day in days:
        print '%s Data deletion begins'%(day)
        mclient = getConn()
        sql = '''
        select 1 from %s where date = '%s' limit 1
        '''%(tb_name, day)
        print sql
        result = mclient.query(sql)
        # If you query the data for this day, continue deleting
        while result is not ():
            sql = 'delete from %s where date = "%s" limit 50000'%(tb_name, day)
            print sql
            mclient.execute(sql)
            sql = '''
            select 1 from %s where date = '%s' limit 1
            '''%(tb_name, day)
            print sql
            result = mclient.query(sql)
        print '%s Data deletion completed'%(day)
        mclient.close()


# Return mysql connection
def getConn():
    return mysql.MysqlClient(host = '0.0.0.0', user = 'test', passwd = 'test', db= 'statistic')

if __name__ == '__main__':
    regx = '%Y-%m-%d'
    yesday = timedef.getYes(regx, -1)
    starttime = '2019-08-17'
    endtime ='2019-08-30'
    run(starttime, endtime, regx)

Cyclic judgment data, if any, continue to delete 50,000 data on that day; otherwise, start deleting data on the next day. It took half an hour to delete it.

Welcome to my public address.

Posted by jskywalker on Wed, 09 Oct 2019 12:22:22 -0700