How to convert a factor to an integer \ \ number without losing information?

Keywords: Attribute

When I convert a factor to a number or an integer, I get the underlying level code, not the number.

f <- factor(sample(runif(5), 20, replace = TRUE))
##  [1] 0.0248644019011408 0.0248644019011408 0.179684827337041 
##  [4] 0.0284090070053935 0.363644931698218  0.363644931698218 
##  [7] 0.179684827337041  0.249704354675487  0.249704354675487 
## [10] 0.0248644019011408 0.249704354675487  0.0284090070053935
## [13] 0.179684827337041  0.0248644019011408 0.179684827337041 
## [16] 0.363644931698218  0.249704354675487  0.363644931698218 
## [19] 0.179684827337041  0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218

as.numeric(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

as.integer(f)
##  [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2

I have to turn to paste for real value:

as.numeric(paste(f))
##  [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
##  [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901

Is there a better way to convert factors into values?

#1 building

R has many (unrecorded) convenience functions for conversion factors:

  • as.character.factor
  • as.data.frame.factor
  • as.Date.factor
  • as.list.factor
  • as.vector.factor
  • ...

But what's troubling is that there's nothing to deal with - digital conversion. As an extension of Joshua Ulrich's answer, I propose to overcome this omission by defining my own idiomatic functions:

as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}

You can store it at the beginning of a script, or even at .Rprofile In the document.

#2 building

Only possible if the factor label matches the original value. I'll explain it with an example.

Suppose the data is a vector x:

x <- c(20, 10, 30, 20, 10, 40, 10, 40)

Now, I'll create a factor with four labels:

f <- factor(x, levels = c(10, 20, 30, 40), labels = c("A", "B", "C", "D"))

1) x is of type double and f is of type integer. This is the first inevitable loss of information. Factors are always stored as integers.

> typeof(x)
[1] "double"
> typeof(f)
[1] "integer"

2) It is not possible to revert to the original values (10, 20, 30, 40) that are only available for F. We can see that f contains only integer values 1, 2, 3, 4 and two attributes - the tag list ("A", "B", "C", "D") and the class attribute "factor.". Nothing more.

> str(f)
 Factor w/ 4 levels "A","B","C","D": 2 1 3 2 1 4 1 4
> attributes(f)
$levels
[1] "A" "B" "C" "D"

$class
[1] "factor"

In order to recover to the original value, we must know the horizontal value used to create the factor. In this case c(10, 20, 30, 40). If we know the original level (in the correct order), we can revert to the original value.

> orig_levels <- c(10, 20, 30, 40)
> x1 <- orig_levels[f]
> all.equal(x, x1)
[1] TRUE

This only works if a label is defined for all possible values in the original data.

Therefore, if you need the original values, you must keep them. Otherwise, it is likely that they will not be able to be contacted by one factor alone.

#3 building

The easiest way is to use the package In varhandle unfactor function

unfactor(your_factor_variable)

This example is a quick start:

x <- rep(c("a", "b", "c"), 20)
y <- rep(c(1, 1, 0), 20)

class(x)  # -> "character"
class(y)  # -> "numeric"

x <- factor(x)
y <- factor(y)

class(x)  # -> "factor"
class(y)  # -> "factor"

library(varhandle)
x <- unfactor(x)
y <- unfactor(y)

class(x)  # -> "character"
class(y)  # -> "numeric"

#4 building

see also ?factor Warning section of:

In particular, it makes no sense for as.numeric to be applied to a factor and can be implicitly enforced. In order to convert factor F to approximate its original value, it is recommended to use as.numeric(levels(f))[f] which is more efficient than as.numeric(as.character(f)).

FAQ about R There are similar suggestions .

Why is as.numeric(levels(f))[f] more effective than as.numeric(as.character(f))?

as.numeric(as.character(f)) is actually as.numeric(levels(f)[f]), so you are performing a numeric conversion of the length(x) value instead of the nlevers (x) value. For low level long vector, the speed difference is the most obvious. If most of these values are unique, there won't be much difference in speed. Regardless of your transformation, this operation is unlikely to be a bottleneck in your code, so don't worry too much.

Some timing

library(microbenchmark)
microbenchmark(
  as.numeric(levels(f))[f],
  as.numeric(levels(f)[f]),
  as.numeric(as.character(f)),
  paste0(x),
  paste(x),
  times = 1e5
)
## Unit: microseconds
##                         expr   min    lq      mean median     uq      max neval
##     as.numeric(levels(f))[f] 3.982 5.120  6.088624  5.405  5.974 1981.418 1e+05
##     as.numeric(levels(f)[f]) 5.973 7.111  8.352032  7.396  8.250 4256.380 1e+05
##  as.numeric(as.character(f)) 6.827 8.249  9.628264  8.534  9.671 1983.694 1e+05
##                    paste0(x) 7.964 9.387 11.026351  9.956 10.810 2911.257 1e+05
##                     paste(x) 7.965 9.387 11.127308  9.956 11.093 2419.458 1e+05

#5 building

Note: this particular answer is not used to convert a numeric factor to a numeric value, but to convert a classification factor to its corresponding level number.

None of the answers in this article can produce results for me, NA is being generated.

y2<-factor(c("A","B","C","D","A")); 
as.numeric(levels(y2))[y2] 
[1] NA NA NA NA NA Warning message: NAs introduced by coercion

What works for me-

as.integer(y2)
# [1] 1 2 3 4 1

Posted by jmgrhm on Tue, 07 Jan 2020 23:56:12 -0800