Data science pandas

numpy has been able to help us process data and solve our data analysis problems in combination with matplotlib. What is the purpose of pandas learning?

numpy can help us deal with numerical data, but that's not enough

Many times, in addition to numerical values, our data also includes strings and time series

For example, we get the data stored in the database through the crawler

For example, in the previous youtube example, in addition to values, there are also country information, video tag information, title information, etc

Therefore, numpy can help us deal with numerical values, but pandas can help us deal with other types of data in addition to dealing with numerical values (based on numpy)

What is pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Basic operation of pandas

Common data types of pandas

1.Series one-dimensional, tagged array

2.DataFrame, 2D, Series container

Creation of pandas Series

In [2]: import string

In [3]: import numpy as np

In [4]: import pandas as pd

In [5]: t=pd.Series(np.arange(10),index=list(string.ascii_uppercase[:10]))

In [6]: t
Out[6]:
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int32

In [7]: type(t)
Out[7]: pandas.core.series.Series

Pay attention to the following questions:

What can PD. Series do and what types of data can be passed in to make it a series structure

What is index and where is it? What is index for our common database data or ndarray

How to assign an index to a set of data

In [9]: #Dictionary derivation creates a dictionary and a series through the dictionary. Note that the index is the key of the dictionary

In [10]: a={string.ascii_uppercase[i]:i for i in range(10)}

In [11]: a
Out[11]:
{'A': 0,
 'B': 1,
 'C': 2,
 'D': 3,
 'E': 4,
 'F': 5,
 'G': 6,
 'H': 7,
 'I': 8,
 'J': 9}

After making other indexes for them again, if they can correspond to each other, their values will be taken. If not, Nan will be returned
Why is the type float? pandas will automatically change the dtype type of series according to the data type

In [14]: pd.Series(a,index=list(string.ascii_uppercase[5:15]))
Out[14]:
F    5.0
G    6.0
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64

#After making other indexes for them again, if they can correspond to each other, their values will be taken. If not, Nan will be returned 
#Why is the type float? pandas will automatically change the dtype type of series according to the data type

Series Slicing and indexing of pandas

In [15]: t
Out[15]:
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int32

In [16]: t[2:10:2]
Out[16]:
C    2
E    4
G    6
I    8
dtype: int32

In [17]: t[2:10:2]  #  Index 2-10 in steps of 2
Out[17]:
C    2
E    4
G    6
I    8
dtype: int32

In [18]: t[1] #Value with index subscript 1
Out[18]: 1

In [20]: t[[2,3,6]] #Value with index subscript 2 3 6
Out[20]:
C    2
D    3
G    6
dtype: int32

In [21]: t[t>4]
Out[21]:
F    5
G    6
H    7
I    8
J    9
dtype: int32

In [22]: t["F"]  #Specifies that the index value is the value corresponding to F
Out[22]: 5

In [23]: t[["A","F","g"]]   #The specified index value is AFg. There are no corresponding values, so nan is returned
E:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py:851: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]
Out[23]:
A    0.0
F    5.0
g    NaN
dtype: float64

Slice: pass in start directly end or step length

Index: the serial number or index is directly passed in when a When there are multiple, the serial number or index list is passed in

For an unfamiliar series type, how do we know its index and specific value?

In [24]: t.index
Out[24]: Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')

In [25]: t.values
Out[25]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [26]: type(t.index)
Out[26]: pandas.core.indexes.base.Index

In [27]: type(t.values)
Out[27]: numpy.ndarray

A Series object consists essentially of two numbers

An array constitutes the key (index, index) of the object, and an array constitutes the values of the object Key - > value

Many methods of ndarrary can be applied to series types, such as argmax and clip

series has a where method, but the result is different from ndarray

Reading external data from pandas

Now suppose we have a group of statistics about dog names, so what should we do to observe this group of data?

Data source: https://www.kaggle.com/new-york-city/nyc-dog-names/data

Our set of data is stored in csv, and we use PD. Read directly_ csv

import pandas as pd

#pandas reads files in csv
df = pd.read_csv("./dogNames2.csv")

print(df[(800<df["Count_AnimalName"])|(df["Count_AnimalName"]<1000)])

Output results

It is somewhat different from what we thought. We thought it would be a Series type, but it is a DataFrame. Next, let's understand this data type

However, there is another problem:

How do we use data in databases such as mysql or mongodb?

pd.read_sql(sql_sentence,connection)

So, what about mongodb?

DataFrame of pandas

DataFrame objects have both row and column indexes

Row index indicates different rows. The horizontal index is called index, 0 axis, axis=0

Column index, columns with different table names, vertical index, called columns, 1 axis, axis=1

So here comes the question...:

1. What is the relationship between dataframe and Series?

2. If series can import a dictionary, can DataFrame import a dictionary as data? So can mongodb data be passed in like this?

3. What can we do for a dataframe type with both row index and column index

Underlying properties of DataFrame

df.shape # rows and columns df.dtype # column data type

df.ndim # data dimension df.index # row index

Df.colonies # column index df.values # object value, 2D array of ndarray

In [32]: t
Out[32]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [33]: t.shape
Out[33]: (3, 4)

In [34]: t.dtypes #Column data type
Out[34]:
W    int32
X    int32
Y    int32
Z    int32
dtype: object

In [35]: t.ndim #Data dimension
Out[35]: 2

In [36]: t.index #Row index
Out[36]: Index(['A', 'B', 'C'], dtype='object')

In [37]: t.index #Column index
Out[37]: Index(['A', 'B', 'C'], dtype='object')

In [38]: t.values #Object value, 2D array of ndarray
Out[38]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Query the overall situation of DataFrame

t.head(2) # displays the header lines, with 5 lines by default

t.tail(2) ## Display the last few lines, 5 lines by default

t.info() # Overview of related information: number of rows, number of columns, column index, number of non null values of columns, column type, memory occupation

t.describe() # fast comprehensive statistical results: count, mean, standard deviation, maximum value, quarter number and minimum value

In [39]: t.head(2) #Display the first few lines, 5 lines by default
Out[39]:
   W  X  Y  Z
A  0  1  2  3
B  4  5  6  7

In [40]: t.tail(2)  ##Display the last few lines, 5 lines by default
Out[40]:
   W  X   Y   Z
B  4  5   6   7
C  8  9  10  11

In [41]: t.info()   #Overview of related information: number of rows, number of columns, column index, number of non null values of columns, column type, memory occupation
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, A to C
Data columns (total 4 columns):
W    3 non-null int32
X    3 non-null int32
Y    3 non-null int32
Z    3 non-null int32
dtypes: int32(4)
memory usage: 72.0+ bytes

In [42]: t.describe() #Fast comprehensive statistical results: count, mean, standard deviation, maximum value, quarter number and minimum value
Out[42]:
         W    X     Y     Z
count  3.0  3.0   3.0   3.0
mean   4.0  5.0   6.0   7.0
std    4.0  4.0   4.0   4.0
min    0.0  1.0   2.0   3.0
25%    2.0  3.0   4.0   5.0
50%    4.0  5.0   6.0   7.0
75%    6.0  7.0   8.0   9.0
max    8.0  9.0  10.0  11.0

Many students must want to know what are the top names used most frequently?

df.sort_values(by="Count_AnimalName",ascending=False)

In [43]: t
Out[43]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

#Sort by "Y" in descending order
In [44]: t = t.sort_values(by="Y",ascending=False)

In [45]: t
Out[45]:
   W  X   Y   Z
C  8  9  10  11
B  4  5   6   7
A  0  1   2   3

In [46]: t[:2]
Out[46]:
   W  X   Y   Z
C  8  9  10  11
B  4  5   6   7

In [47]: t["Y"]
Out[47]:
C    10
B     6
A     2
Name: Y, dtype: int32

Row or column of pandas

Just now we know how to sort the data according to a certain row or column. Now we want to study the data with the top 100 usage times separately. What should we do?

df_sorted = df.sort_values(by="Count_AnimalName")

df_sorted[:100]

In [48]: t_sorted = t.sort_values(by="Y")

In [50]: t_sorted
Out[50]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [49]: t_sorted[:2]
Out[49]:
   W  X  Y  Z
A  0  1  2  3
B  4  5  6  7

How do we choose a column? df[" Count_AnimalName "]

What if we choose row and column at the same time? df[:100][" Count_AnimalName "]

In [52]: t[:2]["Y"]  #Intersection of the first 2 rows and the "Y" column
Out[52]:
C    10
B     6
Name: Y, dtype: int32

loc of pandas

There are more selection methods optimized by pandas:

1.df.loc index row data by tag

2.df.iloc get row data by location

In [59]: t
Out[59]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [60]:  t.loc["A","Y"]
Out[60]: 2

In [61]:  t.loc["A",["W","Z"]]
Out[61]:
W    0
Z    3
Name: A, dtype: int32

In [62]: type(t.loc["A",["W","Z"]])
Out[62]: pandas.core.series.Series

Select multiple rows and columns for the interval

In [64]: t.loc[["A","C"],["W","Z"]]
Out[64]:
   W   Z
A  0   3
C  8  11

In [65]: t.loc["A":,["W","Z"]]
Out[65]:
   W   Z
A  0   3
B  4   7
C  8  11

The colon is closed in loc, that is, the data behind the colon will be selected

In [67]: t.loc["A":"C",["W","Z"]]
Out[67]:
   W   Z
A  0   3
B  4   7
C  8  11

iloc of pandas

In [68]: t.iloc[1:3,[0,3]]  #
Out[68]:
   W   Z
B  4   7
C  8  11

In [69]: t.iloc[1:3,0:3]
Out[69]:
   W  X   Y
B  4  5   6
C  8  9  10

The process of assigning and changing data:

In [70]: t
Out[70]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [71]:  t.loc["A","Y"]=100

In [72]: t
Out[72]:
   W  X    Y   Z
A  0  1  100   3
B  4  5    6   7
C  8  9   10  11

In [73]: t.iloc[1:3,0:3] =200

In [74]: t
Out[74]:
     W    X    Y   Z
A    0    1  100   3
B  200  200  200   7
C  200  200  200  11

Boolean index of pandas

Back to the question of dog names before, if we want to find the names of all dogs that have been used more than 800 times, what should we choose

import pandas as pd

#pandas reads files in csv
df = pd.read_csv("./dogNames2.csv")

#print(df[(800<df["Count_AnimalName"])|(df["Count_AnimalName"]<1000)])
print(df[df["Count_AnimalName"]>800])

Output results

C:\Users\MLoong\PycharmProjects\pythonProject6\venv\Scripts\python.exe E:/Data analyst learning/21-Data analysis data-new/Data analysis data/day04/code/page108.py
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
3251        COCO               852
9140         MAX              1153
12368      ROCKY               823

Back to the question of dog names before, if we want to find all dog names that have been used more than 700 times and the length of the name string is greater than 4, how should we choose?

import pandas as pd

#pandas reads files in csv
df = pd.read_csv("./dogNames2.csv")

print(df[(df["Count_AnimalName"]>700)&(df["Row_Labels"].str.len()>4)])

Join query symbol

& And
| or

Note: different conditions need to be enclosed in parentheses

String method of pandas

1,cat() Splice string

In [76]: pd.Series(['a','b','c']).str.cat((['A','B','C']),sep=',')
Out[76]:
0    a,A
1    b,B
2    c,C
dtype: object

2,split() Split string

In [79]: s = pd.Series(['a_b_c','c_d_e',np.nan,'f_g_h'])

In [80]: s.str.split('_')
Out[80]:
0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

3. get() gets the string at the specified location

In [83]: s.str.get(2)
Out[83]:
0      b
1      d
2    NaN
3      g
dtype: object

4,join() Each character is spliced with a string of points, which is not commonly used

5. Does contains() contain an expression

6. replace() replace

7. repeat() repeat

8. Pad (left and right)

9,center() Intermediate supplement

10,ljust() Right complement

11,rjust() Left complement

12. zfill() fill 0 on the left

13. wrap() adds a carriage return symbol at the specified position

14,slice() Cut the string at the start and end of a given point

15,slice_replace() Replaces the character at the specified position with the given string

16. count() counts the number of occurrences of a given word

In [84]: s.str.count("a")
Out[84]:
0    1.0
1    0.0
2    NaN
3    0.0
dtype: float64

17. Startswitch() determines whether to start with the given string

18,endswith() Determines whether to end with the given string

19,findall() Find all characters that match the regular expression and return them as an array

20. match() detects whether all strings or expressions match the points

28,lower() All lowercase
29,upper() All caps

30,find() Starting from the left, find the location of a given string

data processing

Processing of missing data

Look at the following set of data

There are usually two cases of missing data:

One is empty, None, etc. in pandas, it is Nan (the same as np.nan)

The other is that we set it to 0 in the blue box

How do we handle NaN data in numpy?

It's very easy for us to handle in pandas

Judge whether the data is NaN: pd.isnull(df),pd.notnull(df)

In [106]: t=pd.DataFrame(np.arange(12).reshape((3,4)),
     ...: index=list(string.ascii_uppercase[:3]),
     ...: columns=list(string.ascii_uppercase[-4:]))

In [107]: t
Out[107]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [108]: t1=t

In [109]: t.iloc[:2,[2,3]]=np.nan

In [110]: t
Out[110]:
   W  X     Y     Z
A  0  1   NaN   NaN
B  4  5   NaN   NaN
C  8  9  10.0  11.0

In [111]: pd.isnull(t)   #Judged as nan
Out[111]:
       W      X      Y      Z
A  False  False   True   True
B  False  False   True   True
C  False  False  False  False

 
In [113]: pd.notnull(t)  #Judged not to be nan
Out[113]:
      W     X      Y      Z
A  True  True  False  False
B  True  True  False  False
C  True  True   True   True

In [115]: t[pd.notnull(t["Y"])]
Out[115]:
   W  X     Y     Z
C  8  9  10.0  11.0

Processing method 1: delete the row and column dropna of NaN (axis = 0, how ='Any ', inplace = false)

Processing method 2: fill data, t.fillna (t.mean()), t.fillna (t.median()), t.fillna (0)

In [116]: t.dropna(axis=0)
Out[116]:
   W  X     Y     Z
C  8  9  10.0  11.0

In [117]: t.dropna(axis=0,how="all")  # Deletion of nan in any row
Out[117]:
   W  X     Y     Z
A  0  1   NaN   NaN
B  4  5   NaN   NaN
C  8  9  10.0  11.0

In [118]: t.dropna(axis=0,how="any")  #Delete as long as there is nan
Out[118]:
   W  X     Y     Z
C  8  9  10.0  11.0

In [120]: t1.dropna(axis=0,how="any",inplace=True)  #inplaced is modified immediately

In [121]: t1
Out[121]:
   W  X     Y     Z
C  8  9  10.0  11.0

In [131]: t
Out[131]:
   W  X     Y     Z
A  0  1   NaN   NaN
B  4  5   NaN   NaN
C  8  9  10.0  11.0

In [132]: t.fillna(t.mean())  #Mean filling
Out[132]:
   W  X     Y     Z
A  0  1  10.0  11.0
B  4  5  10.0  11.0
C  8  9  10.0  11.0

In [133]: t
Out[133]:
   W  X     Y     Z
A  0  1   NaN   NaN
B  4  5   NaN   NaN
C  8  9  10.0  11.0

In [134]: t.fillna(t.median())  #Median fill
Out[134]:
   W  X     Y     Z
A  0  1  10.0  11.0
B  4  5  10.0  11.0
C  8  9  10.0  11.0

In [135]: t.fillna(0)   #0 fill
Out[135]:
   W  X     Y     Z
A  0  1   0.0   0.0
B  4  5   0.0   0.0
C  8  9  10.0  11.0

Data processed as 0: t[t==0]=np.nan

In [136]: t4=t.fillna(0)

In [137]: t4
Out[137]:
   W  X     Y     Z
A  0  1   0.0   0.0
B  4  5   0.0   0.0
C  8  9  10.0  11.0

In [138]: t4[t4==0]=np.nan  #Change the value of 0 to nan

In [139]: t4
Out[139]:
     W  X     Y     Z
A  NaN  1   NaN   NaN
B  4.0  5   NaN   NaN
C  8.0  9  10.0  11.0

Of course, not every time the data is 0, it needs to be processed

When calculating the average value, nan does not participate in the calculation, but 0 will

Common statistical methods of pandas

Suppose we have a set of data of 1000 most popular films from 2006 to 2016. We want to know the average score of these film data, the number of directors and other information. How should we obtain it?

Data source: https://www.kaggle.com/damianpanek/sunday-eda/data

import pandas as pd
import numpy as np

file_path = "IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)

# print(df.info())
print("-------Print head 1 line of data------")
print(df.head(1))

print("-------Get average score------")
print(df["Rating"].mean())


# print(len(set(df["Director"].tolist())))
print("-------Number of directors required------")
print(len(df["Director"].unique()))

print("-------Get the number of actors------")
temp_actors_list = df["Actors"].str.split(", ").tolist()
actors_list = [i for j in temp_actors_list for i in j]
actors_num = len(set(actors_list))
print(actors_num)

Output results

   Rank                    Title  ... Revenue (Millions) Metascore  #
0     1  Guardians of the Galaxy  ...             333.13      76.0

[1 rows x 12 columns]
-------Get average score------
6.723199999999999
-------Number of directors required------
644
-------Get the number of actors------
2015

For this group of movie data, if we want to rate the distribution of runtime, how should we present the data?

# coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
file_path = "./IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())

#rating,runtime distribution
#Select graph, histogram
#Prepare data
runtime_data = df["Runtime (Minutes)"].values

max_runtime = runtime_data.max()
min_runtime = runtime_data.min()

#Number of calculation groups
print(max_runtime-min_runtime)
num_bin = (max_runtime-min_runtime)//5
print(num_bin)


#Sets the size of the drawing
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin)

_x = [min_runtime]
i = min_runtime
while i<=max_runtime+5:
    i = i+5
    _x.append(i)

plt.xticks(_x)

plt.show()

Operation results

Posted by Digwood on Sat, 20 Nov 2021 17:18:12 -0800

Programmer Group

Data science pandas

Import

Why learn pandas

What is pandas

Basic operation of pandas

Common data types of pandas

Creation of pandas Series

Series Slicing and indexing of pandas

Reading external data from pandas

DataFrame of pandas

Row or column of pandas

loc of pandas

iloc of pandas

Boolean index of pandas

String method of pandas

data processing

Processing of missing data

Common statistical methods of pandas

Hot Keywords