---
title: "Tools for Data Manipulation"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Tools for Data Manipulation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, results='asis', echo=FALSE}
cat(gsub("\\n   ", "", packageDescription("dat", fields = "Description")))
```

## Installation

### From GitHub

```{r eval=FALSE}
remotes::install_github("wahani/dat")
```

### From CRAN

```{r eval=FALSE}
install.packages("dat")
```


## Why should you care?

- You probably have to rewrite all your dplyr / data.table code once you put it 
inside a package. I.e. working around non standard evaluation or find another
way to satisfy `R CMD check`. And you don't like that.
- `dplyr` is not respecting the class of the object it operates on; the class
attribute changes on-the-fly.
- Neither `dplyr` nor `data.table` are playing nice with S4, but you really,
really want a S4 *data.table* or *tbl_df*.
- You like currying as in `rlist` and `purrr`.


## Tools for data manipulation

The examples are from the introductory vignette of `dplyr`. You still work with
data frames: so you can simply mix in dplyr features whenever you need them.

```{r}
library("nycflights13")
library("dat")
```

### Select rows

We can use `mutar` to select rows. When you
reference a variable in the data frame, you can indicate this by using a one
sided formula.

```{r results='hide'}
mutar(flights, ~ month == 1 & day == 1)
mutar(flights, ~ 1:10)
```

And for sorting:

```{r}
mutar(flights, ~ order(year, month, day))
```


### Select cols

You can use characters, logicals, regular expressions and functions to select
columns. Regular expressions are indicated by a leading "^".

```{r results='hide'}
flights %>%
  extract(c("year", "month", "day")) %>%
  extract("^day$") %>%
  extract(is.numeric)
```


### Operations on columns

The main difference between `dplyr::mutate` and `mutar` is that you use a `~`
instead of `=`.
    
```{r results='hide'}
mutar(
  flights,
  gain ~ arr_delay - dep_delay,
  speed ~ distance / air_time * 60
)
```

Grouping data is handled within `mutar`:

```{r results = 'hide'}
mutar(flights, n ~ .N, by = "month")
```

```{r results='hide'}
mutar(flights, delay ~ mean(dep_delay, na.rm = TRUE), by = "month")
```

You can also provide additional arguments to a formula. This is especially
helpful when you want to pass arguments from a function to such expressions. The
additional augmentation can be anything which you can use to select columns
(character, regular expression, function) or a named list where each element is
a character.
    
```{r}
mutar(
  flights,
  .n ~ mean(.n, na.rm = TRUE) | "^.*delay$",
  .x ~ mean(.x, na.rm = TRUE) | list(.x = "arr_time"),
  by = "month"
)
```


## A link to S4

Using this package you can create S4 classes to contain a data frame (or a
data.table) and use the interface to `dplyr`. Both `dplyr` and `data.table` do
not support integration with S4. The main function here is `mutar` which is
generic enough to link to subsetting of rows and cols as well as mutate and
summarise. In the background `dplyr`s ability to work on a `data.table` is being
used.

```{r eval = FALSE}
library("data.table")

setClass("DataTable", "data.table")

DataTable <- function(...) {
  new("DataTable", data.table::data.table(...))
}

setMethod("[", "DataTable", mutar)

dtflights <- do.call(DataTable, nycflights13::flights)

dtflights[1:10, c("year", "month", "day")]
dtflights[n ~ .N, by = "month"]
dtflights[n ~ .N, sby = "month"]

dtflights %>%
  filtar(~month > 6) %>%
  mutar(n ~ .N, by = "month") %>%
  sumar(n ~ data.table::first(n), by = "month")
```


## Working with vectors

Inspired by `rlist` and `purrr` some low level operations on vectors are
supported. The aim here is to integrate syntactic sugar for anonymous functions.
Furthermore the functions should support the use of pipes.

- `map` and `flatmap` as replacements for the apply functions
- `extract` for subsetting
- `replace` for replacing elements in a vector

What we can do with map:

```{r eval=FALSE}
map(1:3, ~ .^2)
flatmap(1:3, ~ .^2)
map(1:3 ~ 11:13, c) # zip
dat <- data.frame(x = 1, y = "")
map(dat, x ~ x + 1, is.numeric)
```

What we can do with extract:

```{r eval=FALSE}
extract(1:10, ~ . %% 2 == 0) %>% sum
extract(1:15, ~ 15 %% . == 0)
l <- list(aList = list(x = 1), aAtomic = "hi")
extract(l, "^aL")
extract(l, is.atomic)
```

What we can do with replace:

```{r eval=FALSE}
replace(c(1, 2, NA), is.na, 0)
replace(c(1, 2, NA), rep(TRUE, 3), 0)
replace(c(1, 2, NA), 3, 0)
replace(list(x = 1, y = 2), "x", 0)
replace(list(x = 1, y = 2), "^x$", 0)
replace(list(x = 1, y = "a"), is.character, NULL)
```