R Language Data Processing Package dplyr, tidyr Notes

Keywords: R Language Attribute

The dplyr package, a new work of Hadley Wickham, is mainly used for data cleaning and collation. It focuses on data frame data format, which greatly improves data processing speed and provides an interface with other databases. The author of the tidyr package is Hadley Wickham, which is used for "tidy" your data. This package is often used in conjunction with dplyr.

In this paper, the following five functions of dplyr package are introduced:

  • Screening: filter()
  • Arrangement: arrange()
  • Selection: select()
  • Deformation: mutate()
  • Summary: summarise()
  • Grouping: group_by()

And the following four functions of the tidyr package are used:

  • gather-wide data to long data;
  • spread-long data to wide data;
  • unit - multiple columns merged into one column;
  • separate - separate a column into multiple columns;

Installation and loading of dplyr and tidyr packages

install.packages("dplyr")
install.packages("tidyr")

library(dplyr)
library(tidyr)

Using the mtcars data set in the datasets package as a demonstration, we first organize the overly long data into friendly tbl_df data:

mtcars_df = tbl_df(mtcars)

Basic operation of dplyr package

1.1 Screening: filter()

Screening the required sub-data set according to the given logical judgement

filter(mtcars_df,mpg==21,hp==110)

#  A tibble: 2 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    21     6   160   110   3.9 2.620 16.46     0     1     4     4
2    21     6   160   110   3.9 2.875 17.02     0     1     4     4

1.2 Arrangement: arrange()

Sort rows by given column names:

arrange(mtcars_df, disp) #You can reverse the column name plus desc(disp)

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
2   30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
3   32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
4   27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
5   30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
6   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
7   21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
8   26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
9   21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2
10  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
# ... with 22 more rows

1.3 Choice: select()

Use column names as parameters to select subdatasets:

select(mtcars_df, disp:wt)

# A tibble: 32 x 4
    disp    hp  drat    wt
*  <dbl> <dbl> <dbl> <dbl>
1  160.0   110  3.90 2.620
2  160.0   110  3.90 2.875
3  108.0    93  3.85 2.320
4  258.0   110  3.08 3.215
5  360.0   175  3.15 3.440
6  225.0   105  2.76 3.460
7  360.0   245  3.21 3.570
8  146.7    62  3.69 3.190
9  140.8    95  3.92 3.150
10 167.6   123  3.92 3.440
# ... with 22 more rows

1.4 Deformation: mutate()

Data operations are performed on existing columns and added as new columns:

mutate(mtcars_df,
  NO = 1:dim(mtcars_df)[1])  

# A tibble: 32 x 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    NO
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4     1
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4     2
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1     3
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1     4
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2     5
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1     6
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4     7
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2     8
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2     9
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4    10
# ... with 22 more rows

1.5 Summary: summarise()

The data box calls other functions for aggregation operations, returning one-dimensional results:

summarise(mtcars_df, 
    mdisp = mean(disp, na.rm = TRUE))
# A tibble: 1 x 1
     mdisp
     <dbl>
1 230.7219

1.6 grouping: group_by()

When grouping information is added to the data set through group_by(), mutate(), arrange() and summarise() functions automatically perform grouping operations on these tbl class data.

cars <- group_by(mtcars_df, cyl)
countcars <- summarise(cars, count = n()) # count = n() is used to calculate the number of times

# A tibble: 3 x 2
    cyl count
  <dbl> <int>
1     4    11
2     6     7
3     8    14

tidyr package basic operation

2.1 Width to Length: gather()

The gather() function is used to realize the conversion of the wide table to the long table. The grammar is as follows:

gather(data, key, value, …, na.rm = FALSE, convert = FALSE)
data: Wide tables that need to be converted
key: Assign all columns in the original data box to a new variable key
value: Assign all values in the original data box to a new variable value
…: Which columns can be specified to be clustered in the same column
na.rm: Whether to delete missing values

widedata <- data.frame(person=c('Alex','Bob','Cathy'),grade=c(2,3,4),score=c(78,89,88))
widedata
  person grade score
1   Alex     2    78
2    Bob     3    89
3  Cathy     4    88
longdata <- gather(widedata, variable, value,-person)
longdata
  person variable value
1   Alex    grade     2
2    Bob    grade     3
3  Cathy    grade     4
4   Alex    score    78
5    Bob    score    89
6  Cathy    score    88

2.2 Length to Width: spread()

Sometimes, in order to meet the requirements of modeling or drawing, it is often necessary to convert a long table into a wide table or a wide table into a long table. How to realize the conversion of these two data table types. The spread() function is used to turn the long table to the wide table. The grammar is as follows:

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
data: Long tables for conversion
key: Variable values need to be expanded to field variables
value: Values that need to be dispersed
fill: For missing values, the fill Value assignment to the missing value after transformation

mtcarsSpread <- mtcarsNew %>% spread(attribute, value)
head(mtcarsSpread)
                 car am carb cyl disp drat gear  hp  mpg  qsec vs    wt
1        AMC Javelin  0    2   8  304 3.15    3 150 15.2 17.30  0 3.435
2 Cadillac Fleetwood  0    4   8  472 2.93    3 205 10.4 17.98  0 5.250
3         Camaro Z28  0    4   8  350 3.73    3 245 13.3 15.41  0 3.840
4  Chrysler Imperial  0    4   8  440 3.23    3 230 14.7 17.42  0 5.345
5         Datsun 710  1    1   4  108 3.85    4  93 22.8 18.61  1 2.320
6   Dodge Challenger  0    2   8  318 2.76    3 150 15.5 16.87  0 3.520

2.3 Merge: unit()

unite's call format is as follows:

unite(data, col, …, sep = "_", remove = TRUE)
data: Data frame
col: New column names to be combined
…: Specify which columns need to be grouped
sep: Connectors between composite columns, default underscored
remove: Whether to delete the combined columns

wideunite<-unite(widedata, information, person, grade, score, sep= "-")
wideunite
  information
1   Alex-2-78
2    Bob-3-89
3  Cathy-4-88

2.4 split: separate()

The separate() function can split a column into multiple columns. It can be used to split log data or date-time data. The grammar is as follows:

separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,
convert = FALSE, extra = "warn", fill = "warn", …)
data: Data frame
col: Columns that need to be split
into: New column name, string vector
sep: Separated separators
remove: Whether to delete partitioned columns 

widesep <- separate(wideunite, information,c("person","grade","score"), sep = "-")
widesep
  person grade score
1   Alex     2    78
2    Bob     3    89
3  Cathy     4    88

It can be seen that the functions of separate() function and unite() function are opposite.

Feedback and suggestions

Posted by Riddick on Wed, 20 Mar 2019 20:18:55 -0700