Simply put, Numpy is used for matrix calculation, while Pandas is based on Numpy, which enriches and simplifies Numpy's operation.
Numpy
Basic operation
import numpy as np array = np.array([[1, 2, 3], [2, 3, 4]], dtype=np.int) #dtype is the element data type a = np.zeros((3, 4)) # Zero Matrix of 3 Rows and 4 Columns vec = np.arange(10, 20) # Vectors from 10 to 20 mat_range = np.arange(12).reshape((3, 4)) # Matrix of 3 rows and 4 columns from 10 to 20 vec_line = np.linspace(1, 10, 3) # Vectors divided into three elements from 1 to 10 make the difference between adjacent elements consistent print(array.ndim) # dimension print(array.shape) # (2, 3) print(array.size) # Element number print(array.dtype) b = a.copy() # deep copy
Basic operation
a = np.array([10, 20, 30, 40]).reshape((2, 2)) b = np.arange(4).reshape((2, 2)) print(a.T) # Matrix transposition c = 10 * np.sin(a) # For each element of a, sin is multiplied by 10 print(b < 3) # Comparison of all elements [True True True False] c = a*b # Multiplication of Corresponding Elements c = np.dot(a, b) # Matrix multiplication rand = np.random.random((2, 4)) # Random matrix print(np.sum(a, axis=0)) # Row addition, axis is understood as the index of shape np.min(rand) np.max(rand)
Basic Operations (2)
a = np.arange(2, 14).reshape((3, 4)) b = np.random.random((3, 4)) print(a) print(np.argmax(a)) # Returns the maximum index print(np.argmin(a)) print(np.average(a)) # average value print(np.cumsum(a)) # Prefix and print(np.diff(a)) # First order difference print(np.nonzero(a)) print(np.clip(a, 5, 9)) # Limit elements to [5,9]
Index and iteration
a = np.arange(3, 15).reshape((3, 4)) print(a[2, 1]) # Second row, first column element print(a[:, 1]) # The first column of elements for row in a: print(row) for col in a.T: print(col) for item in a.flat: # Traversing through each element print(item)
Merge and Separate
a = np.array([1, 1, 1]) b = np.array([2, 2, 2]) print(a[:, np.newaxis]) # Turn a 1-dimensional row vector into a 2-dimensional column vector print(np.vstack((a, b))) # Merge from top to bottom print(np.hstack((a, b))) # Left to right merger a = np.arange(12).reshape((3, 4)) print(np.split(a, 2, axis=1)) # Longitudinal bisection into two parts print(np.array_split(a, 3, axis=1)) # Longitudinal inequality is divided into three parts
Pandas
If numpy is equivalent to a list, Pandas is equivalent to a dict.
Simply put, when the dimension of data is large, it seems a little meaningless to access elements simply through the numeric index in numpy. We want to give the elements of a row, a column or even a unit a name, simplify the operation and enrich the semantics and readability of each line of code. This is Pandas.
basic operation
Before learning the basic operation, we must make it clear that:
Generally speaking, using matrices to organize data is based on behavioral units. What does that mean? That is, multiple sets of structurally consistent data are often represented by multiple rows in a matrix. Therefore, from the perspective of columns, each column is the same type of data with multiple sets of data. That's why labels in Pandas are called indexes, because they are not essentially different from indexes in general sense. They are all serial numbers representing data.
The two most commonly used data types in Pandas are Series and DataFrame
import pandas as pd import numpy as np s = pd.Series([1, 3, 6, 7, 44, np.nan, 3.4], index=[7]*7) # One-dimensional ndarray labeled data = pd.DataFrame(np.arange(12).reshape( (3, 4)), index=np.arange(3), columns=['a', 'b', 'c', 'd']) # Equivalent to a matrix with row labels and column labels, index represents row labels, columns represents column labels print(data.index) # Row labels print(data.columns) # Column labels print(data.values) print(data.describe()) # Output statistics by column data = data.sort_index(axis=0, ascending=False) # Sort rows labels from large to small data = data.sort_values(by='a', axis=0) # Specify column labels sorted by row
Data Selection and Change
dates = pd.date_range('20190329', periods=6) df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['a', 'b', 'c', 'd']) print(df['a'], df.a, sep='\n') # Output column labeled'a' print(df[0:3], df['20190329':'20190331'], sep='\n') # Output the first three rows of data print(df.loc['20190329']) print(df.loc[:, ['a', 'b']]) # loc is based on tag name print(df.iloc[3:5, 1:3]) # iloc is based on the absolute number of rows and columns (index) print(df[df.a > 8]) # Screen out rows with a greater than 8 # change data df.iloc[2, 2] = 111 df.loc['20190329', 'b'] = 333 df.a[df.a > 4] = 0 # Assign column a greater than 4 to 0 df['f'] = np.nan # Dynamic addition of new columns
Dealing with missing data
dates = pd.date_range('20190329', periods=6) df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['a', 'b', 'c', 'd']) df.iloc[0, 1] = np.nan df.iloc[1, 2] = np.nan print(df.fillna(value=0)) # Filled with nan0 print(df.dropna(axis=0, how='any')) # Lost lines with nan print(df.isnull()) # nan is located at True and the rest is False. print(np.any(df.isnull()) == True) # Judging the existence of nan in the whole matrix
Simple IO
data = pd.read_csv('data.csv', sep=',') pd.to_pickle(data, 'data.pickle') data = pd.read_pickle('data.pickle')
merge
import pandas as pd import numpy as np # concat df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd']) df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd']) df3 = pd.DataFrame(np.ones((3, 4))*2, columns=['a', 'b', 'c', 'd']) print(df1, df2, df3, sep='\n\n') res = pd.concat([df1, df2, df3], axis=0, ignore_index=True, join='inner') # Merge by row (i.e. vertically) and recalculate index # inner mode deletes columns corresponding to non-public column labels, while outer retains them and fills them with nan res = res.append(pd.Series([0, 1, 2, 3], index=[ 'a', 'b', 'c', 'd']), ignore_index=True) # Add a Series # key-based merging refers to the merge function, which is very similar to the internal and external links in sql. There is no need to repeat the usage of each parameter here. You can query the document when you need it. pd.merge()