Demand reasons
Recently, I was dealing with a request for batch inserting data into the database. The description is as follows
The original program is a stored procedure based on sql to update and modify data. Due to the large amount of data, there is too much pressure on the database. Therefore, it is necessary to reconstruct the program to calculate and process the data by reading files in python to reduce the pressure. Finally, it is only necessary to call the lambda service of aws to update the calculation results to the database again, It reduces the great pressure and reduces the cost. The database is mainly involved in insert and update operations
Version library information
Third party library written based on linux system > > > pandas 1.0.5, pymysql 0.9.3, python version > > > 3.7 Standard Library > > OS
Logical combing
In fact, in the last step, the file data to be written to the database is stored in memory. Because the calculation after reading the file is carried out in memory, the calculation results do not need to be written locally, read and written to the database, which will affect the efficiency of the program. The logic is as follows
Read the splicing and calculation of files, generate a new df, initialize the connection to the database, convert the data required by df into tuple data (depending on how the interface of the three-party database supports batch operation), write the data into the database, and check the contents of the database
Step by step implementation and analysis
read file
Just give the file path and read the file. Emphasize the points that need attention
Absolute path: This is the simplest. Just give the path string directly, but once the folder directory structure changes, you need to change the relative path frequently: I generally like to locate the current script in the script first, and then find it through the relative path. In this way, you don't need to change as long as the directory structure in the whole package doesn't change, Even if the deployment goes online, it is directly based on the location of the package. It is very convenient. pandas will read all numbers as float by default. Therefore, for fields that look like numbers but actually need to be used as strings, type conversion is performed
import pandas as pd import numpy as np # Location of the current script current_folder_path = os.path.dirname(__file__) # Location of your files your_file_path1 = os.path.join(current_folder_path, "File name 1") your_file_path2 = os.path.join(current_folder_path, "File name 2") # Here, I take reading csv files as an example, and delimiter is the delimiter between columns agreed internally df1 = pd.read_csv(your_file_path1, dtype={"column1": str, "column2": str}, delimiter="/t") df2 = pd.read_csv(your_file_path2, dtype={"column1": str, "column2": str}, delimiter="/t")
Splicing and calculation of documents
The splicing of files is mainly about the use of merge and concat grammars. I would like to emphasize some knowledge points
merge syntax mainly corresponds to the inner connection, outer connection, left connection and right connection of sql language. concat is mainly used to simply splice df with the same structure (that is, the total number of rows in the list increases)
# Here, take the left connection as an example, assuming that there are only two files spliced ret_df = pd.merge(df1, df2, left_on=["column_name"], right_on=["column_name"], how="left")
Initialize connection
Import the third-party library pymysql and initialize the connection
# Get links through the interface of pymysql def mysql_conn(host, user, password, db, port=3306, charset="utf8"): # Pass parameter version conn = pymysql.connect(host=host, user=user, password=password, database=db, port=port, charset=charset) return conn
Corresponding interface conversion data
1. Data insertion should consider writing a transaction, because if it fails, it should ensure that it has no impact on the database 2. Construct a data format that conforms to the corresponding interface. Through query, pymysql has two interfaces that can execute statements
Execute (single insert statement) an interface that executes a single statement
1. Similar: Insert into table_name (column) values (value); 2. Executemany (batch insert statement)
The interface for executing multiple statements is similar to this: Insert into table_name (column1, column2, column3) values (value1, value2, value3);
The specific implementation is as follows
# First create cursor to operate conn interface conn = mysql_conn("your db host", "your username", "your password", "db name") cursor = conn.cursor() # Open transaction conn.begin() ############# The process of constructing batch data ############# # First construct the required or database matching columns columns = list(df.columns) # You can delete unwanted columns or column names that do not exist in the database columns.remove("Listing") # Reconstruct df and use the above columns. Here, you should ensure that all your columns are ready to be written to the database new_df = df[columns].copy() # Construct columns that conform to the sql statement, because the sql statement is separated by commas (this corresponds to the above sql statement (column1, column2, column3)) columns = ','.join(list(new_df.columns)) # Construct the data corresponding to each column, corresponding to the above ((value1, value2, value3)) data_list = [tuple(i) for i in gdsord_df.values] # Each tuple is a piece of data. How many tuple data are generated according to the number of df rows # String placeholders are required to calculate how many value values are in a row s_count = len(data_list[0]) * "%s," # Construct sql statement insert_sql = "insert into " + "Database table name" + " (" + columns + ") values (" + s_count[:-1] + ")"
Write data to database
This is simple, directly on the code
cursor.executemany(insert_sql, data_list) conn.commit() cursor.close() conn.close()
Check whether the database is inserted successfully
If there is no problem, you can read, write and calculate multiple files at the same time. Finally, enable multithreading to write data to the database at the same time, which is very efficient!
Complete code
import pandas as pd import numpy as np # pymysql interface def mysql_conn(host, user, password, db, port=3306, charset="utf8"): conn = pymysql.connect(host=host, user=user, password=password, database=db, port=port, charset=charset) return conn # Location of the current script current_folder_path = os.path.dirname(__file__) # Location of your files your_file_path1 = os.path.join(current_folder_path, "File name 1") your_file_path2 = os.path.join(current_folder_path, "File name 2") # Here, I take reading csv files as an example, and delimiter is the delimiter between columns agreed internally df1 = pd.read_csv(your_file_path1, dtype={"column1": str, "column2": str}, delimiter="/t") df2 = pd.read_csv(your_file_path2, dtype={"column1": str, "column2": str}, delimiter="/t") # merge ret_df = pd.merge(df1, df2, left_on=["column_name"], right_on=["column_name"], how="left") # First create cursor to operate conn interface conn = mysql_conn("your db host", "your username", "your password", "db name") cursor = conn.cursor() # Open transaction conn.begin() # First construct the required or database matching columns columns = list(df.columns) # You can delete unwanted columns or column names that do not exist in the database columns.remove("Listing") # Reconstruct df and use the above columns. Here, you should ensure that all your columns are ready to be written to the database new_df = df[columns].copy() # Construct columns that conform to the sql statement, because the sql statement is separated by commas (this corresponds to the above sql statement (column1, column2, column3)) columns = ','.join(list(new_df.columns)) # Construct the data corresponding to each column, corresponding to the above ((value1, value2, value3)) data_list = [tuple(i) for i in gdsord_df.values] # Each tuple is a piece of data. How many tuple data are generated according to the number of df rows # String placeholders are required to calculate how many value values are in a row s_count = len(data_list[0]) * "%s," # Construct sql statement insert_sql = "insert into " + "Database table name" + " (" + columns + ") values (" + s_count[:-1] + ")" try: cursor.executemany(insert_sql, data_list) conn.commit() cursor.close() conn.close() except Exception as e: # In case of failure, rollback operation shall be performed conn.rollback() cursor.close() conn.close()