How to batch insert data into the database with Python

Demand reasons

Recently, I was dealing with a request for batch inserting data into the database. The description is as follows

The original program is a stored procedure based on sql to update and modify data. Due to the large amount of data, there is too much pressure on the database. Therefore, it is necessary to reconstruct the program to calculate and process the data by reading files in python to reduce the pressure. Finally, it is only necessary to call the lambda service of aws to update the calculation results to the database again, It reduces the great pressure and reduces the cost. The database is mainly involved in insert and update operations

Version library information

Third party library written based on linux system > > > pandas 1.0.5, pymysql 0.9.3, python version > > > 3.7 Standard Library > > OS

Logical combing

In fact, in the last step, the file data to be written to the database is stored in memory. Because the calculation after reading the file is carried out in memory, the calculation results do not need to be written locally, read and written to the database, which will affect the efficiency of the program. The logic is as follows

Read the splicing and calculation of files, generate a new df, initialize the connection to the database, convert the data required by df into tuple data (depending on how the interface of the three-party database supports batch operation), write the data into the database, and check the contents of the database

Step by step implementation and analysis

read file

Just give the file path and read the file. Emphasize the points that need attention

Absolute path: This is the simplest. Just give the path string directly, but once the folder directory structure changes, you need to change the relative path frequently: I generally like to locate the current script in the script first, and then find it through the relative path. In this way, you don't need to change as long as the directory structure in the whole package doesn't change, Even if the deployment goes online, it is directly based on the location of the package. It is very convenient. pandas will read all numbers as float by default. Therefore, for fields that look like numbers but actually need to be used as strings, type conversion is performed

import pandas as pd  
import numpy as np 
 
# Location of the current script 
current_folder_path = os.path.dirname(__file__) 
 
# Location of your files 
your_file_path1 = os.path.join(current_folder_path, "File name 1") 
your_file_path2 = os.path.join(current_folder_path, "File name 2") 
 
# Here, I take reading csv files as an example, and delimiter is the delimiter between columns agreed internally 
df1 = pd.read_csv(your_file_path1, dtype={"column1": str, "column2": str}, delimiter="/t") 
df2 = pd.read_csv(your_file_path2, dtype={"column1": str, "column2": str}, delimiter="/t") 

Splicing and calculation of documents

The splicing of files is mainly about the use of merge and concat grammars. I would like to emphasize some knowledge points

merge syntax mainly corresponds to the inner connection, outer connection, left connection and right connection of sql language. concat is mainly used to simply splice df with the same structure (that is, the total number of rows in the list increases)

# Here, take the left connection as an example, assuming that there are only two files spliced 
ret_df = pd.merge(df1, df2, left_on=["column_name"], right_on=["column_name"], how="left") 

Initialize connection

Import the third-party library pymysql and initialize the connection

# Get links through the interface of pymysql 
def mysql_conn(host, user, password, db, port=3306, charset="utf8"): 
  # Pass parameter version 
  conn = pymysql.connect(host=host, user=user, password=password, database=db, port=port, charset=charset) 
  return conn 

Corresponding interface conversion data

1. Data insertion should consider writing a transaction, because if it fails, it should ensure that it has no impact on the database 2. Construct a data format that conforms to the corresponding interface. Through query, pymysql has two interfaces that can execute statements

Execute (single insert statement) an interface that executes a single statement

1. Similar: Insert into table_name (column) values (value); 2. Executemany (batch insert statement)

The interface for executing multiple statements is similar to this: Insert into table_name (column1, column2, column3) values (value1, value2, value3);

The specific implementation is as follows

# First create cursor to operate conn interface 
conn = mysql_conn("your db host", "your username", "your password", "db name") 
cursor = conn.cursor() 
# Open transaction 
conn.begin() 
 
#############      The process of constructing batch data            ############# 
 
# First construct the required or database matching columns 
columns = list(df.columns) 
# You can delete unwanted columns or column names that do not exist in the database 
columns.remove("Listing") 
# Reconstruct df and use the above columns. Here, you should ensure that all your columns are ready to be written to the database 
new_df = df[columns].copy() 
 
# Construct columns that conform to the sql statement, because the sql statement is separated by commas (this corresponds to the above sql statement (column1, column2, column3)) 
columns = ','.join(list(new_df.columns)) 
 
# Construct the data corresponding to each column, corresponding to the above ((value1, value2, value3)) 
data_list = [tuple(i) for i in gdsord_df.values] # Each tuple is a piece of data. How many tuple data are generated according to the number of df rows 
 
# String placeholders are required to calculate how many value values are in a row 
s_count = len(data_list[0]) * "%s," 
 
# Construct sql statement 
insert_sql = "insert into " + "Database table name" + " (" + columns + ") values (" + s_count[:-1] + ")" 

Write data to database

This is simple, directly on the code

cursor.executemany(insert_sql, data_list) 
conn.commit() 
cursor.close() 
conn.close() 

Check whether the database is inserted successfully

If there is no problem, you can read, write and calculate multiple files at the same time. Finally, enable multithreading to write data to the database at the same time, which is very efficient!

Complete code

import pandas as pd  
import numpy as np 

# pymysql interface 
def mysql_conn(host, user, password, db, port=3306, charset="utf8"): 
  conn = pymysql.connect(host=host, user=user, password=password, database=db, port=port, charset=charset) 
  return conn 
 
 
# Location of the current script 
current_folder_path = os.path.dirname(__file__) 
 
# Location of your files 
your_file_path1 = os.path.join(current_folder_path, "File name 1") 
your_file_path2 = os.path.join(current_folder_path, "File name 2") 
 
# Here, I take reading csv files as an example, and delimiter is the delimiter between columns agreed internally 
df1 = pd.read_csv(your_file_path1, dtype={"column1": str, "column2": str}, delimiter="/t") 
df2 = pd.read_csv(your_file_path2, dtype={"column1": str, "column2": str}, delimiter="/t") 
# merge 
ret_df = pd.merge(df1, df2, left_on=["column_name"], right_on=["column_name"], how="left") 
 
# First create cursor to operate conn interface 
conn = mysql_conn("your db host", "your username", "your password", "db name") 
cursor = conn.cursor() 
# Open transaction 
conn.begin() 
 
# First construct the required or database matching columns 
columns = list(df.columns) 
# You can delete unwanted columns or column names that do not exist in the database 
columns.remove("Listing") 
# Reconstruct df and use the above columns. Here, you should ensure that all your columns are ready to be written to the database 
new_df = df[columns].copy() 
 
# Construct columns that conform to the sql statement, because the sql statement is separated by commas (this corresponds to the above sql statement (column1, column2, column3)) 
columns = ','.join(list(new_df.columns)) 
 
# Construct the data corresponding to each column, corresponding to the above ((value1, value2, value3)) 
data_list = [tuple(i) for i in gdsord_df.values] # Each tuple is a piece of data. How many tuple data are generated according to the number of df rows 
 
# String placeholders are required to calculate how many value values are in a row 
s_count = len(data_list[0]) * "%s," 
 
# Construct sql statement 
insert_sql = "insert into " + "Database table name" + " (" + columns + ") values (" + s_count[:-1] + ")" 
try: 
  cursor.executemany(insert_sql, data_list) 
  conn.commit() 
  cursor.close() 
  conn.close() 
except Exception as e: 
  # In case of failure, rollback operation shall be performed 
  conn.rollback() 
  cursor.close() 
  conn.close() 

Posted by kir10s on Thu, 02 Dec 2021 22:56:55 -0800