Numpy&Pandas

Keywords: REST SQL

Simply put, Numpy is used for matrix calculation, while Pandas is based on Numpy, which enriches and simplifies Numpy's operation.

Numpy

Basic operation

import numpy as np

array = np.array([[1, 2, 3],
                  [2, 3, 4]], dtype=np.int) #dtype is the element data type

a = np.zeros((3, 4))  # Zero Matrix of 3 Rows and 4 Columns
vec = np.arange(10, 20)  # Vectors from 10 to 20
mat_range = np.arange(12).reshape((3, 4))  # Matrix of 3 rows and 4 columns from 10 to 20
vec_line = np.linspace(1, 10, 3)  # Vectors divided into three elements from 1 to 10 make the difference between adjacent elements consistent

print(array.ndim)  # dimension
print(array.shape)  # (2, 3)
print(array.size)  # Element number
print(array.dtype)

b = a.copy() # deep copy

Basic operation

a = np.array([10, 20, 30, 40]).reshape((2, 2))
b = np.arange(4).reshape((2, 2))
print(a.T) # Matrix transposition

c = 10 * np.sin(a)  # For each element of a, sin is multiplied by 10

print(b < 3)  # Comparison of all elements [True True True False]

c = a*b  # Multiplication of Corresponding Elements
c = np.dot(a, b)  # Matrix multiplication

rand = np.random.random((2, 4))  # Random matrix

print(np.sum(a, axis=0))  # Row addition, axis is understood as the index of shape
np.min(rand)
np.max(rand)

Basic Operations (2)

a = np.arange(2, 14).reshape((3, 4))
b = np.random.random((3, 4))

print(a)
print(np.argmax(a))  # Returns the maximum index
print(np.argmin(a))
print(np.average(a))  # average value

print(np.cumsum(a))  # Prefix and
print(np.diff(a))  # First order difference
print(np.nonzero(a))

print(np.clip(a, 5, 9))  # Limit elements to [5,9]

Index and iteration

a = np.arange(3, 15).reshape((3, 4))

print(a[2, 1])  # Second row, first column element
print(a[:, 1])  # The first column of elements

for row in a:
    print(row)

for col in a.T:
    print(col)

for item in a.flat:  # Traversing through each element
    print(item)

Merge and Separate

a = np.array([1, 1, 1])
b = np.array([2, 2, 2])

print(a[:, np.newaxis])  # Turn a 1-dimensional row vector into a 2-dimensional column vector

print(np.vstack((a, b)))  # Merge from top to bottom
print(np.hstack((a, b)))  # Left to right merger

a = np.arange(12).reshape((3, 4))


print(np.split(a, 2, axis=1))  # Longitudinal bisection into two parts
print(np.array_split(a, 3, axis=1))  # Longitudinal inequality is divided into three parts



Pandas

If numpy is equivalent to a list, Pandas is equivalent to a dict.

Simply put, when the dimension of data is large, it seems a little meaningless to access elements simply through the numeric index in numpy. We want to give the elements of a row, a column or even a unit a name, simplify the operation and enrich the semantics and readability of each line of code. This is Pandas.

basic operation

Before learning the basic operation, we must make it clear that:

Generally speaking, using matrices to organize data is based on behavioral units. What does that mean? That is, multiple sets of structurally consistent data are often represented by multiple rows in a matrix. Therefore, from the perspective of columns, each column is the same type of data with multiple sets of data. That's why labels in Pandas are called indexes, because they are not essentially different from indexes in general sense. They are all serial numbers representing data.

The two most commonly used data types in Pandas are Series and DataFrame

import pandas as pd
import numpy as np

s = pd.Series([1, 3, 6, 7, 44, np.nan, 3.4], index=[7]*7)  # One-dimensional ndarray labeled

data = pd.DataFrame(np.arange(12).reshape(
    (3, 4)), index=np.arange(3), columns=['a', 'b', 'c', 'd'])
# Equivalent to a matrix with row labels and column labels, index represents row labels, columns represents column labels

print(data.index)  # Row labels
print(data.columns)  # Column labels
print(data.values)

print(data.describe())  # Output statistics by column

data = data.sort_index(axis=0, ascending=False)  # Sort rows labels from large to small

data = data.sort_values(by='a', axis=0)  # Specify column labels sorted by row

Data Selection and Change

dates = pd.date_range('20190329', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)),
                  index=dates, columns=['a', 'b', 'c', 'd'])

print(df['a'], df.a, sep='\n')  # Output column labeled'a'
print(df[0:3], df['20190329':'20190331'], sep='\n')  # Output the first three rows of data


print(df.loc['20190329'])
print(df.loc[:, ['a', 'b']])   # loc is based on tag name


print(df.iloc[3:5, 1:3])  # iloc is based on the absolute number of rows and columns (index)

print(df[df.a > 8])  # Screen out rows with a greater than 8

# change data
df.iloc[2, 2] = 111
df.loc['20190329', 'b'] = 333
df.a[df.a > 4] = 0  # Assign column a greater than 4 to 0
df['f'] = np.nan  # Dynamic addition of new columns

Dealing with missing data

dates = pd.date_range('20190329', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)),
                  index=dates, columns=['a', 'b', 'c', 'd'])


df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan

print(df.fillna(value=0))  # Filled with nan0
print(df.dropna(axis=0, how='any'))  # Lost lines with nan


print(df.isnull())  # nan is located at True and the rest is False.
print(np.any(df.isnull()) == True)  # Judging the existence of nan in the whole matrix

Simple IO

data = pd.read_csv('data.csv', sep=',')

pd.to_pickle(data, 'data.pickle')

data = pd.read_pickle('data.pickle')

merge

import pandas as pd
import numpy as np


# concat

df1 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4))*1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFrame(np.ones((3, 4))*2, columns=['a', 'b', 'c', 'd'])

print(df1, df2, df3, sep='\n\n')

res = pd.concat([df1, df2, df3], axis=0,
                ignore_index=True, join='inner')  # Merge by row (i.e. vertically) and recalculate index
# inner mode deletes columns corresponding to non-public column labels, while outer retains them and fills them with nan


res = res.append(pd.Series([0, 1, 2, 3], index=[
    'a', 'b', 'c', 'd']), ignore_index=True)  # Add a Series


# key-based merging refers to the merge function, which is very similar to the internal and external links in sql. There is no need to repeat the usage of each parameter here. You can query the document when you need it.
pd.merge()

Posted by woody79 on Tue, 07 May 2019 05:50:40 -0700