R reshape2 converting data tutorial

reshape2 is a package written by Hadley Wickham, which can easily convert data between wide and long formats.

1. Length width data format

Wide data format
Each variable of wide data format corresponds to a column, for example:

##      ozone      wind     temp
## 1 23.61538 11.622581 65.54839
## 2 29.44444 10.266667 79.10000
## 3 59.11538  8.941935 83.90323
## 4 59.96154  8.793548 83.96774

Long data format
There is no ID (identification) variable, and all measurement variables are in the use column;

##    variable     value
## 1     ozone 23.615385
## 2     ozone 29.444444
## 3     ozone 59.115385
## 4     ozone 59.961538
## 5      wind 11.622581
## 6      wind 10.266667
## 7      wind  8.941935
## 8      wind  8.793548
## 9      temp 65.548387
## 10     temp 79.100000
## 11     temp 83.903226
## 12     temp 83.967742

Long format data has one column as the possible variable type and another column as the value of which variables. The long format does not have to have only two columns. For example, there may be a daily ozone measurement for a certain year. At this time, columns representing days will be added, that is, there are different length levels. What data shapes are needed depends on your business requirements.

It turns out that for some types of data analysis, you need wide format data, while for other types of data, you need long format data. In practical applications, long format data is more widely used than wide format. For example, ggplot2 requires long format data (lm(),glm(),gam()), but wide format is easier for users to read.

2. reshape2 package

reshape2 mainly uses two key functions: melt and cast:

melt converts wide format to long format
cast converts long format to wide format

Imagine metal: if you melt metal, it will drop and grow. If you pour it into a mold, it will become wider.

2.1 convert from wide to long format using melt function

The following example uses the airquality built-in dataset. First change the column name to lowercase:

names(airquality) <- tolower(names(airquality))
head(airquality)

##   ozone solar.r wind temp month day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

First, what data will be generated when we run the melt function with default parameters?

aql <- melt(airquality)
## No id variables; using all as measure variables
head(aql)
##   variable value
## 1    ozone    41
## 2    ozone    36
## 3    ozone    12
## 4    ozone    18
## 5    ozone    NA
## 6    ozone    28

tail(aql)
##     variable value
## 913      day    25
## 914      day    26
## 915      day    27
## 916      day    28
## 917      day    29
## 918      day    30

By default, melt thinks that all numerical columns are variable values, which is usually no problem. But now we need ozone, solar.r, wind, and temp under each month's day. At this time, we specify month and day through id.vars. ID identification variables are used to identify each row of data.

aql <- melt(airquality, id.vars = c("month", "day"))
head(aql)
##   month day variable value
## 1     5   1    ozone    41
## 2     5   2    ozone    36
## 3     5   3    ozone    12
## 4     5   4    ozone    18
## 5     5   5    ozone    NA
## 6     5   6    ozone    28

If you need to modify the long data format column name, you can use other parameters:

aql <- melt(airquality, id.vars = c("month", "day"),
                        variable.name = "climate_variable", 
                        value.name = "climate_value")
head(aql)
##   month day climate_variable climate_value
## 1     5   1            ozone            41
## 2     5   2            ozone            36
## 3     5   3            ozone            12
## 4     5   4            ozone            18
## 5     5   5            ozone            NA
## 6     5   6            ozone            28

2.2 convert from long to wide format using cast function

The wide to long format conversion is quite straightforward, but the long to wide format conversion is a little complex. Except for the simplest cases, it usually involves some attempts and errors. The following is an example.

In reshape 2, there are multiple cast functions. The most frequently encountered data format is the data.frame object. First, let's look at the dcast function. In addition, acast returns vector,matrix or array.

Let's convert airquality into different wide data formats. First, reply to the data format and use the dcast function to convert it, and compare the differences between the two.

dcast uses a formula to describe the data shape. The left side of the parameter is the identification variable and the right side is the measurement variable.
At the beginning, you need to try again and again to find the correct formula and will make mistakes. So if you encounter mistakes, don't be discouraged. There are usually many ways to write this formula.

library(reshape2)
aql <- melt(airquality, id.vars = c("month", "day"))
aqw <- dcast(aql, month + day ~ variable)
head(aqw)
##   month day ozone solar.r wind temp
## 1     5   1    41     190  7.4   67
## 2     5   2    36     118  8.0   72
## 3     5   3    12     149 12.6   74
## 4     5   4    18     313 11.5   62
## 5     5   5    NA      NA 14.3   56
## 6     5   6    28      NA 14.9   66

## View the original data for comparison
head(airquality) # original data
##   ozone solar.r wind temp month day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Here, dcast is told that month and day are identification variables, and variable is used as measurement. Because only one column is reserved, dcast calculates the list of values contained in it. Our page can display the declaration values.var, which must be done in some scenarios.

ok, except for the order, we successfully recovered the data.

If we don't fully understand it, we can see the following figure:

The above figure describes the dcast function. The blue shading indicates the id identification variable as the row identification. The red shading indicates the variable to be converted to the column name, and the gray indicates the data value to be filled into the cell.

The error we usually encounter in the actual operation of converting data sets is that each cell has more than one value. For example, we remove the day variable of ID variable:

dcast(aql, month ~ variable)

## Aggregation function missing: defaulting to length

##   month ozone solar.r wind temp
## 1     5    31      31   31   31
## 2     6    30      30   30   30
## 3     7    31      31   31   31
## 4     8    31      31   31   31
## 5     9    30      30   30   30

Warning during run: missing aggregate function.
If we look at the output and the cell fills in the consolidated data of each month, we can see the days of the month displayed in the result. Therefore, when multiple values of the converted data are put into one cell, we need to know how to aggregate the data, such as the mean, median or sum function.

2.3 aggregate parameters of cast function

Let's look at another example, but this time we calculate the average value and specify the na.rm=TRUE option to delete the NA value:

dcast(aql, month ~ variable, fun.aggregate = mean, 
  na.rm = TRUE)
##   month    ozone  solar.r      wind     temp
## 1     5 23.61538 181.2963 11.622581 65.54839
## 2     6 29.44444 190.1667 10.266667 79.10000
## 3     7 59.11538 216.4839  8.941935 83.90323
## 4     8 59.96154 171.8571  8.793548 83.96774
## 5     9 31.44828 167.4333 10.180000 76.90000

3. Summary

Unlike melt, dcast can also do some other interesting things. Readers can check the help file to learn

Posted by dirkie on Fri, 22 Oct 2021 20:06:39 -0700

Programmer Group