Chapter 7 Data Regulation: Cleanup, Conversion, Merge, Reshaping (4)
The source of ldata is not given in this chapter and is availableMacrodata.csvTurn out available data.
import pandas as pd
import numpy as np
import datetime ###Time processing later
from pandas import Series,DataFrame
###Read inMacrodata.csvfile
data1 = pd.read_csv('E:\pydata-book-2nd-edition\pydata-book-2nd-edition\examples\macrodata.csv')
#####Select useful columns, where quarter represents the quarter and is used to synthesize dates
data2 = DataFrame(data1,columns=['year','quarter','realgdp','infl','unemp'])
####Convert quarter to date with 1 = 3.31, 2 = 6.30, 3 = 9.30, 4 = 12.31
for i in range(len(data2.index)):
if data2.ix[i,'quarter'] == 1: ####Select the element of the i row'quarter'column in the DataFrame data, note the ix[] square brackets
data2.ix[i,'date'] = datetime.datetime(int(data2.ix[i,'year']),3,31)
#Set the value of the new column'date'in row i, using the time formatDatetime.datetime(year, month, day) settings,
elif data2.ix[i,'quarter'] == 2:
data2.ix[i,'date'] = datetime.datetime(int(data2.ix[i,'year']),6,30)
elif data2.ix[i,'quarter'] == 3:
data2.ix[i,'date'] = datetime.datetime(int(data2.ix[i,'year']),9,30)
elif data2.ix[i,'quarter'] == 4:
data2.ix[i,'date'] = datetime.datetime(int(data2.ix[i,'year']),12,31)
####Delete columns year, quarter, note that axis=1, delete columns, default axis=0, delete rows
data3 = data2.drop(['year','quarter'],axis=1)
####Set the date column as an index, the book format as a separate column of date, and the realgdp/infl/unemp merge into one column, so you need to set the date of the separate column as an index first to avoid merging the date in the next rotation
data4 = data3.set_index('date')
###Columns become rows, forming a hierarchical index and being inmost
data5 = data4.stack()
###Store before read, indexed into columns
data5.to_csv('E:\pycode\data\P202.csv')
###Note that when csv is read by default, the first row is set to the column name, the header = None setting has no column name, and the column name is specified with the name parameter
ldata = pd.read_csv('E:\pycode\data\P202.csv',header = None,names=['date','item','value'])
#In [103]:ldata[:10]
#Out[103]:
# date item value
#0 1959-03-31 realgdp 2710.349
#1 1959-03-31 infl 0.000
#2 1959-03-31 unemp 5.800
#3 1959-06-30 realgdp 2778.801
#4 1959-06-30 infl 2.340
#5 1959-06-30 unemp 5.100
#6 1959-09-30 realgdp 2775.488
#7 1959-09-30 infl 2.740
#8 1959-09-30 unemp 5.300
#9 1959-12-31 realgdp 2785.204
Rotate long format to wide format
Time series data is usually stored in a "long format" or "stacked format" in the database.for example
In [103]:ldata[:10]
date | item | value |
---|---|---|
0 | 1959-03-31 | realgdp |
1 | 1959-03-31 | infl |
2 | 1959-03-31 | unemp |
3 | 1959-06-30 | realgdp |
4 | 1959-06-30 | infl |
5 | 1959-06-30 | unemp |
6 | 1959-09-30 | realgdp |
7 | 1959-09-30 | infl |
8 | 1959-09-30 | unemp |
9 | 1959-12-31 | realgdp |
The long format here means that when the same date has four different attribute values listed in the item, the primary key consists of both the date and the item, and the long format changes to the wide format, that is, expands the four attributes in the item.Different item values form a column, and the values in the date column can be indexed directly.The pivot method in the DataFrame can complete the conversion.
pivoted = ldata.pivot('date','item','value')
####The first parameter is the column name used as the row index in ldata, the second parameter is the column name used as the column index in ldata, and the third parameter is the column name used to populate pivoted data columns in ldata
pivoted.head()###DataFrame.head(n = 5) Returns the first n rows, default n = 5
Out[106]:
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
There is only one column involved in the remodeling above, value, assuming that there are two columns involved in the remodeling, value and value2.
ldata['value2'] = np.random.randn(len(ldata))##Add column value2 and fill it with random numbers
###Ignoring the third parameter yields a hierarchical column containing values of value and value1
pivoted = ldata.pivot('date','item')
pivoted.head()
Out[111]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 0.292680 -0.699746 0.668532
1959-06-30 2.34 2778.801 5.1 -0.457626 -1.274159 -0.197795
1959-09-30 2.74 2775.488 5.3 -2.181216 -0.860393 0.142742
1959-12-31 0.27 2785.204 5.6 1.532613 -1.294194 0.127607
1960-03-31 2.31 2847.699 5.2 -1.383715 0.558345 -0.132973
####If specified, only one of the values will be populated
#######And the third parameter can only accept one-dimensional data, not both value and value2
pivoted = ldata.pivot('date','item','value2')
pivoted.head()
Out[115]:
item infl realgdp unemp
date
1959-03-31 0.292680 -0.699746 0.668532
1959-06-30 -0.457626 -1.274159 -0.197795
1959-09-30 -2.181216 -0.860393 0.142742
1959-12-31 1.532613 -1.294194 0.127607
1960-03-31 -1.383715 0.558345 -0.132973
pivot is just a quick way to get results by using the inverse process of the first part transformation.
temp_ldata = ldata.set_index(['date','item'])####Set the hierarchical index date, item, and rotate the item index to expand
res_ldata = temp_ldata.unstack('item') ###Rotate row item as column
res_ldata.head()
Out[123]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 0.292680 -0.699746 0.668532
1959-06-30 2.34 2778.801 5.1 -0.457626 -1.274159 -0.197795
1959-09-30 2.74 2775.488 5.3 -2.181216 -0.860393 0.142742
1959-12-31 0.27 2785.204 5.6 1.532613 -1.294194 0.127607
1960-03-31 2.31 2847.699 5.2 -1.383715 0.558345 -0.132973