Sqldf Packet Learning in R
SQL has powerful functions, not only to achieve data cleaning, statistics, operations, but also to achieve data storage, control, definition and invocation. More and more companies require data analysts not only to master the theoretical methods and programming capabilities of statistical modeling and data mining, but also to have the ability to use sql. Given the importance of SQL, I wanted to learn sqldf package a long time ago, but I have to wait until now. Below I share my preliminary learning results.
Brief description:
The 1 sqldf package is an R extension package. It needs to be installed before it can be used.
Using sqldf function, we need to master the writing of SQL statements.
3 sqldf packages, literally, query the contents of the dataframe in the form of SQL.
I. Important functions involved in packages
(1) read a file with SQL (read.csv.sql)
Syntax:
read.csv.sql(file = path or url, sql = select * from file, header = TRUE, sep = ",")
Example:
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE) iris2 <- read.csv.sql("iris.csv",sql = "select * from file where Species = 'setosa' ") iris2
(2) sql statement sqldf() of data box
Example:
a1s <- sqldf("select * from warpbreaks limit 6")
2. Next, make a comparative analysis of sql statement and R language.
(1) The first six rows of observations for extracting data sets
a1r <- head(warpbreaks) a1s <- sqldf("select * from warpbreaks limit 6") identical(a1r, a1s) #If the result is TRUE, the result is the same.
(2) Extracting data sets with specified contents
a2r <- subset(CO2, grepl("^Qn", Plant)) a2s <- sqldf("select * from CO2 where Plant like 'Qn%'") a ll.equal(as.data.frame(a2r), a2s) data(farms, package = "MASS") a3r <- subset(farms, Manag %in% c("BF", "HF")) a3s <- sqldf("select * from farms where Manag in ('BF', 'HF')") row.names(a3r) <- NULL identical(a3r, a3s) a4r <- subset(warpbreaks, breaks >= 20 & breaks <= 30) a4s <- sqldf("select * from warpbreaks where breaks between 20 and 30", row.names = TRUE) identical(a4r, a4s) a5r <- subset(farms, Mois == 'M1') a5s <- sqldf("select * from farms where Mois = 'M1'", row.names = TRUE) identical(a5r, a5s) a6r <- subset(farms, Mois == 'M2') a6s <- sqldf("select * from farms where Mois = 'M2'", row.names = TRUE) identical(a6r, a6s)
(3) Data Set Merging
a7r <- rbind(a5r, a6r) a7s <- sqldf("select * from a5s union all select * from a6s") # aggregate a8r <- aggregate(iris[1:2], iris[5], mean)#Processing object iris[1:2]; grouping iris[5]; average a8s <- sqldf('select Species, avg("Sepal.Length") `Sepal.Length`, avg("Sepal.Width") `Sepal.Width` from iris group by Species') a9r <- do.call(rbind, by(iris, iris[5], function(x) with(x, data.frame(Species = Species[1], mean.Sepal.Length = mean(Sepal.Length), mean.Sepal.Width = mean(Sepal.Width), mean.Sepal.ratio = mean(Sepal.Length/Sepal.Width))))) ###do.call() tells the list a function, and then all the elements in the list execute the function. row.names(a9r) <- NULL a9s <- sqldf('select Species, avg("Sepal.Length") `mean.Sepal.Length`, avg("Sepal.Width") `mean.Sepal.Width`, avg("Sepal.Length"/"Sepal.Width") `mean.Sepal.ratio` from iris group by Species') # head - top 3 breaks a10r <- head(warpbreaks[order(warpbreaks$breaks, decreasing = TRUE), ], 3) a10s <- sqldf("select * from warpbreaks order by breaks desc limit 3") row.names(a10r) <- NULL identical(a10r, a10s)