Aggregation (dplyr::summarise) functions and window functions (dplyr::lead, lag, top_n etc.) operate on rows only. Aggregation functions output single row from an input of a group of rows. Window functions output one row per input row.

Table of Contents

library(dplyr, warn.conflicts = F)
library(tidyr, warn.conflicts = F)

Aggregation (dplyr::summarise) functions and window functions (dplyr::lead, lag, top_n etc.) operate on rows only. Aggregation functions output single row from an input of a group of rows. Window functions output one row per input row.

gather, spread, separate, unite are different. They reshape data that is they work not only on rows, but also on columns. Their input and output have some effect on columns in addition to rows. So, they change the shape of the data.

There is a great visual cheatsheet on data wrangling by RStudio

gather: gather columns into key-value pairs

Usage

gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

Input: multiple columns. Output: two columns as key-value columns and multiple rows.

gather example 01

set.seed(1)
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:1,
  X = rnorm(2, 0, 1),
  Y = rnorm(2, 0, 2)
)
stocks
  #>         time          X         Y
  #> 1 2009-01-01 -0.6264538 -1.671257
  #> 2 2009-01-02  0.1836433  3.190562
stocks %>% gather(stock, price, -time)
  #>         time stock      price
  #> 1 2009-01-01     X -0.6264538
  #> 2 2009-01-02     X  0.1836433
  #> 3 2009-01-01     Y -1.6712572
  #> 4 2009-01-02     Y  3.1905616

stock and price are key - value pairs.

Note that, X and Y are actually not variables. stock is the variable.

gather example 02

mini_iris <- iris[c(1, 51), c(1,2,5)]
mini_iris
  #>    Sepal.Length Sepal.Width    Species
  #> 1           5.1         3.5     setosa
  #> 51          7.0         3.2 versicolor

Sepal.Length and Sepal.Width are not variables. They are key values. values for these keys are 5.1, 3.5 etc.

tidyr::gather(mini_iris, key = flower_att, value = measurement,
              Sepal.Length, Sepal.Width)
  #>      Species   flower_att measurement
  #> 1     setosa Sepal.Length         5.1
  #> 2 versicolor Sepal.Length         7.0
  #> 3     setosa  Sepal.Width         3.5
  #> 4 versicolor  Sepal.Width         3.2
gathered_iris = mini_iris %>% 
  tidyr::gather(key = flower_att, value = measurement, -Species)
gathered_iris
  #>      Species   flower_att measurement
  #> 1     setosa Sepal.Length         5.1
  #> 2 versicolor Sepal.Length         7.0
  #> 3     setosa  Sepal.Width         3.5
  #> 4 versicolor  Sepal.Width         3.2

spread: ungather key-value pair across multiple columns

gathered_iris %>%
  tidyr::spread(flower_att, measurement)
  #>      Species Sepal.Length Sepal.Width
  #> 1     setosa          5.1         3.5
  #> 2 versicolor          7.0         3.2

Error: spread fails when there are redundant columns

s1 = tibble::tribble(
  ~id,   ~key, ~value,
  1,   "a",  2,
  1,   "b",  3,
  2,   "a",  2
  ) %>%
  tidyr::spread(key, value)
s1
  #> # A tibble: 2 x 3
  #>      id     a     b
  #> * <dbl> <dbl> <dbl>
  #> 1     1     2     3
  #> 2     2     2    NA
library(dplyr, warn.conflicts = F)
s2 = tibble::tribble(
  ~id,   ~key, ~value, ~amount
  1,   "a",  2, 10,
  1,   "b",  3, 20,
  2,   "a",  2, 15
  ) %>%
  tidyr::spread(key, value)
  #> Error: <text>:4:3: unexpected numeric constant
  #> 3:   ~id,   ~key, ~value, ~amount
  #> 4:   1
  #>      ^

separate: separate one column into several

Usage

separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
tb = tibble::tribble(
  ~year,   ~demo, ~n,
  2015 ,   "m04",  2,
  2015 ,   "f04",  3,
  2015 ,   "m05",  1,
  2015 ,   "f05",  0
  )
sep_tb = tb %>%
  tidyr::separate(demo, c("sex", "age"), 1)
sep_tb
  #> # A tibble: 4 x 4
  #>    year   sex   age     n
  #> * <dbl> <chr> <chr> <dbl>
  #> 1  2015     m    04     2
  #> 2  2015     f    04     3
  #> 3  2015     m    05     1
  #> 4  2015     f    05     0

Note: spread is similar to separate because both reshape one column into multiple columns. But spread takes as input two columns: key-value whereas separate takes as input only one column.

unite: unseparate several columns into one

Usage

unite(data, col, ..., sep = "_", remove = TRUE)
sep_tb %>%
  tidyr::unite(demo, sex, age, sep = "")
  #> # A tibble: 4 x 3
  #>    year  demo     n
  #> * <dbl> <chr> <dbl>
  #> 1  2015   m04     2
  #> 2  2015   f04     3
  #> 3  2015   m05     1
  #> 4  2015   f05     0

unnest: unnest a list column

A column can be a list or dataframe. unnest converts multiple values in a row into multiple rows.

Usage

unnest(data, ..., .drop = NA, .id = NULL, .sep = NULL)
df1 = tibble::tribble(
  ~x,    ~y,
   1,   "a",
   2, "d,e"
)
df2 = df1 %>%
  transform(y = strsplit(y, ","))
str(df2)
  #> 'data.frame':    2 obs. of  2 variables:
  #>  $ x: num  1 2
  #>  $ y:List of 2
  #>   ..$ : chr "a"
  #>   ..$ : chr  "d" "e"

Note that, y column contains a list of character vectors.

df2 %>%
  unnest(y)
  #>   x y
  #> 1 1 a
  #> 2 2 d
  #> 3 2 e

Alternatively use unnest directly:

df2 = df1 %>%
  unnest(y = strsplit(y, ","))
df2
  ## # A tibble: 3 x 2
  ##       x     y
  ##   <dbl> <chr>
  ## 1     1     a
  ## 2     2     d
  ## 3     2     e

nest: reverse of unnest

df3 = df2 %>%
  nest(y)
str(df3)
  ## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
  ##  $ x   : num  1 2
  ##  $ data:List of 2
  ##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of  1 variable:
  ##   .. ..$ y: chr "a"
  ##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
  ##   .. ..$ y: chr  "d" "e"

unnest and group_by: take last element in each group

ef1 = data.frame( a = c("ali,veli", "can,cin" ) )
ef1
  ##          a
  ## 1 ali,veli
  ## 2  can,cin

I want to get the last word in a column for each row.

ef2 = ef1 %>%
  dplyr::mutate( b = stringr::str_split(a, ",") ) %>%
  tidyr::unnest(b) %>%
  dplyr::group_by(a) 
ef2
  #> # A tibble: 4 x 2
  #> # Groups:   a [2]
  #>          a     b
  #>     <fctr> <chr>
  #> 1 ali,veli   ali
  #> 2 ali,veli  veli
  #> 3  can,cin   can
  #> 4  can,cin   cin
ef2 %>%
  dplyr::filter(row_number()==n())
  #> # A tibble: 2 x 2
  #> # Groups:   a [2]
  #>          a     b
  #>     <fctr> <chr>
  #> 1 ali,veli  veli
  #> 2  can,cin   cin

What does row_number()==n() mean? Let’s check what row_number() and n() produce by themselves?

ef2 %>%
  dplyr::mutate(row = row_number())
  #> # A tibble: 4 x 3
  #> # Groups:   a [2]
  #>          a     b   row
  #>     <fctr> <chr> <int>
  #> 1 ali,veli   ali     1
  #> 2 ali,veli  veli     2
  #> 3  can,cin   can     1
  #> 4  can,cin   cin     2

ef2 %>%
  dplyr::mutate(row = n())
  #> # A tibble: 4 x 3
  #> # Groups:   a [2]
  #>          a     b   row
  #>     <fctr> <chr> <int>
  #> 1 ali,veli   ali     2
  #> 2 ali,veli  veli     2
  #> 3  can,cin   can     2
  #> 4  can,cin   cin     2