--- title: "Tools for Data Manipulation" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tools for Data Manipulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, results='asis', echo=FALSE} cat(gsub("\\n ", "", packageDescription("dat", fields = "Description"))) ``` ## Installation ### From GitHub ```{r eval=FALSE} remotes::install_github("wahani/dat") ``` ### From CRAN ```{r eval=FALSE} install.packages("dat") ``` ## Why should you care? - You probably have to rewrite all your dplyr / data.table code once you put it inside a package. I.e. working around non standard evaluation or find another way to satisfy `R CMD check`. And you don't like that. - `dplyr` is not respecting the class of the object it operates on; the class attribute changes on-the-fly. - Neither `dplyr` nor `data.table` are playing nice with S4, but you really, really want a S4 *data.table* or *tbl_df*. - You like currying as in `rlist` and `purrr`. ## Tools for data manipulation The examples are from the introductory vignette of `dplyr`. You still work with data frames: so you can simply mix in dplyr features whenever you need them. ```{r} library("nycflights13") library("dat") ``` ### Select rows We can use `mutar` to select rows. When you reference a variable in the data frame, you can indicate this by using a one sided formula. ```{r results='hide'} mutar(flights, ~ month == 1 & day == 1) mutar(flights, ~ 1:10) ``` And for sorting: ```{r} mutar(flights, ~ order(year, month, day)) ``` ### Select cols You can use characters, logicals, regular expressions and functions to select columns. Regular expressions are indicated by a leading "^". ```{r results='hide'} flights %>% extract(c("year", "month", "day")) %>% extract("^day$") %>% extract(is.numeric) ``` ### Operations on columns The main difference between `dplyr::mutate` and `mutar` is that you use a `~` instead of `=`. ```{r results='hide'} mutar( flights, gain ~ arr_delay - dep_delay, speed ~ distance / air_time * 60 ) ``` Grouping data is handled within `mutar`: ```{r results = 'hide'} mutar(flights, n ~ .N, by = "month") ``` ```{r results='hide'} mutar(flights, delay ~ mean(dep_delay, na.rm = TRUE), by = "month") ``` You can also provide additional arguments to a formula. This is especially helpful when you want to pass arguments from a function to such expressions. The additional augmentation can be anything which you can use to select columns (character, regular expression, function) or a named list where each element is a character. ```{r} mutar( flights, .n ~ mean(.n, na.rm = TRUE) | "^.*delay$", .x ~ mean(.x, na.rm = TRUE) | list(.x = "arr_time"), by = "month" ) ``` ## A link to S4 Using this package you can create S4 classes to contain a data frame (or a data.table) and use the interface to `dplyr`. Both `dplyr` and `data.table` do not support integration with S4. The main function here is `mutar` which is generic enough to link to subsetting of rows and cols as well as mutate and summarise. In the background `dplyr`s ability to work on a `data.table` is being used. ```{r eval = FALSE} library("data.table") setClass("DataTable", "data.table") DataTable <- function(...) { new("DataTable", data.table::data.table(...)) } setMethod("[", "DataTable", mutar) dtflights <- do.call(DataTable, nycflights13::flights) dtflights[1:10, c("year", "month", "day")] dtflights[n ~ .N, by = "month"] dtflights[n ~ .N, sby = "month"] dtflights %>% filtar(~month > 6) %>% mutar(n ~ .N, by = "month") %>% sumar(n ~ data.table::first(n), by = "month") ``` ## Working with vectors Inspired by `rlist` and `purrr` some low level operations on vectors are supported. The aim here is to integrate syntactic sugar for anonymous functions. Furthermore the functions should support the use of pipes. - `map` and `flatmap` as replacements for the apply functions - `extract` for subsetting - `replace` for replacing elements in a vector What we can do with map: ```{r eval=FALSE} map(1:3, ~ .^2) flatmap(1:3, ~ .^2) map(1:3 ~ 11:13, c) # zip dat <- data.frame(x = 1, y = "") map(dat, x ~ x + 1, is.numeric) ``` What we can do with extract: ```{r eval=FALSE} extract(1:10, ~ . %% 2 == 0) %>% sum extract(1:15, ~ 15 %% . == 0) l <- list(aList = list(x = 1), aAtomic = "hi") extract(l, "^aL") extract(l, is.atomic) ``` What we can do with replace: ```{r eval=FALSE} replace(c(1, 2, NA), is.na, 0) replace(c(1, 2, NA), rep(TRUE, 3), 0) replace(c(1, 2, NA), 3, 0) replace(list(x = 1, y = 2), "x", 0) replace(list(x = 1, y = 2), "^x$", 0) replace(list(x = 1, y = "a"), is.character, NULL) ```