Tidy Data
pivoting
gather
and spread
are being replaced by the pivot
function. How it works…
pivot_longer()
String data in column names
This feels very much like the old gather()
function.
head(relig_income)
## # A tibble: 6 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` $75-1…¹ $100-…² `>150k` Don't…³
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122 109 84 96
## 2 Atheist 12 27 37 52 35 70 73 59 74 76
## 3 Buddhist 27 21 30 34 33 58 62 39 53 54
## 4 Catholic 418 617 732 670 638 1116 949 792 633 1489
## 5 Don’t know/refused 15 14 15 11 10 35 21 17 18 116
## 6 Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529
## # … with abbreviated variable names ¹`$75-100k`, ²`$100-150k`, ³`Don't know/refused`
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")
## # A tibble: 180 × 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # … with 170 more rows
Numeric data in column names
Now, here’s where you can start to see the power of the function. We grab the columns with a select helper, give it a name, and then the names_prefix
identifies what to drop from the new variable. And we can automatically drop NAs.
head(billboard)
## # A tibble: 6 × 79
## artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA
## 3 3 Door… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44
## 4 3 Door… Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66
## 5 504 Bo… Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36
## # … with 62 more variables: wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>,
## # wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>,
## # wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## # wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
## # wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, …
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # … with 5,297 more rows
Many variables in column names
Even more complex, here we’re taking multiple columns, gathering them, and then creating three new columns of data. The names_pattern
uses the same specification as separate()
, in this case regex to split out the data. To understand it, run it first, without the multiple columns, then again with it.
head(who)
## # A tibble: 6 × 60
## country iso2 iso3 year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷ new_s…⁸ new_s…⁹ new_s…˟
## <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Afghanist… AF AFG 1980 NA NA NA NA NA NA NA NA NA NA
## 2 Afghanist… AF AFG 1981 NA NA NA NA NA NA NA NA NA NA
## 3 Afghanist… AF AFG 1982 NA NA NA NA NA NA NA NA NA NA
## 4 Afghanist… AF AFG 1983 NA NA NA NA NA NA NA NA NA NA
## 5 Afghanist… AF AFG 1984 NA NA NA NA NA NA NA NA NA NA
## 6 Afghanist… AF AFG 1985 NA NA NA NA NA NA NA NA NA NA
## # … with 46 more variables: new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>,
## # new_sn_f3544 <int>, new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>, new_ep_f3544 <int>,
## # new_ep_f4554 <int>, new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>, …
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
# names_to = c("diagnosis", "gender", "age"),
# names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
) %>% sample_n(5)
## # A tibble: 5 × 6
## country iso2 iso3 year name count
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Malawi MW MWI 1996 newrel_m4554 NA
## 2 Azerbaijan AZ AZE 1997 new_sn_m014 NA
## 3 Czech Republic CZ CZE 2007 newrel_m4554 NA
## 4 Gambia GM GMB 1990 newrel_f1524 NA
## 5 Vanuatu VU VUT 2002 new_sp_m2534 2
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
) %>% sample_n(5)
## # A tibble: 5 × 8
## country iso2 iso3 year diagnosis gender age count
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 New Caledonia NC NCL 1982 rel f 2534 NA
## 2 Serbia & Montenegro YU SCG 1989 rel f 4554 NA
## 3 Bosnia and Herzegovina BA BIH 1982 sp f 2534 NA
## 4 Sierra Leone SL SLE 1991 rel m 5564 NA
## 5 Albania AL ALB 2009 sn f 2534 3
To take it even further, we can identify these newly created variables as factors and define them within the same function using the names_ptypes
command.
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_ptypes = list(
gender = factor(levels = c("f", "m")),
age = factor(
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
## # A tibble: 405,440 × 8
## country iso2 iso3 year diagnosis gender age count
## <chr> <chr> <chr> <int> <chr> <fct> <ord> <int>
## 1 Afghanistan AF AFG 1980 sp m 014 NA
## 2 Afghanistan AF AFG 1980 sp m 1524 NA
## 3 Afghanistan AF AFG 1980 sp m 2534 NA
## 4 Afghanistan AF AFG 1980 sp m 3544 NA
## 5 Afghanistan AF AFG 1980 sp m 4554 NA
## 6 Afghanistan AF AFG 1980 sp m 5564 NA
## 7 Afghanistan AF AFG 1980 sp m 65 NA
## 8 Afghanistan AF AFG 1980 sp f 014 NA
## 9 Afghanistan AF AFG 1980 sp f 1524 NA
## 10 Afghanistan AF AFG 1980 sp f 2534 NA
## # … with 405,430 more rows
Multiple observations per row
In the below, each row has two individuals, or sometimes an NA for the second. This gathers all rows except family, and the use “.value” in the names_to
command specifies that part of the column name should be a new variable.
family <- tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
) %>% mutate_at(vars(starts_with("dob")), parse_date)
family
## # A tibble: 5 × 5
## family dob_child1 dob_child2 gender_child1 gender_child2
## <int> <date> <date> <int> <int>
## 1 1 1998-11-26 2000-01-29 1 2
## 2 2 1996-06-22 NA 2 NA
## 3 3 2002-07-11 2004-04-05 2 2
## 4 4 2004-10-10 2009-08-27 1 1
## 5 5 2000-12-05 2005-02-28 2 1
family %>%
pivot_longer(
-family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
## # A tibble: 9 × 4
## family child dob gender
## <int> <chr> <date> <int>
## 1 1 child1 1998-11-26 1
## 2 1 child2 2000-01-29 2
## 3 2 child1 1996-06-22 2
## 4 3 child1 2002-07-11 2
## 5 3 child2 2004-04-05 2
## 6 4 child1 2004-10-10 1
## 7 4 child2 2009-08-27 1
## 8 5 child1 2000-12-05 2
## 9 5 child2 2005-02-28 1
Another example of using the “.value” is in the following. .value captures the new variable name and “time” is the numeric identifier.
pnl <- tibble(
x = 1:4,
a = c(1, 1,0, 0),
b = c(0, 1, 1, 1),
y1 = rnorm(4),
y2 = rnorm(4),
z1 = rep(3, 4),
z2 = rep(-2, 4),
)
pnl
## # A tibble: 4 × 7
## x a b y1 y2 z1 z2
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0.450 -1.44 3 -2
## 2 2 1 1 -1.15 1.64 3 -2
## 3 3 0 1 1.90 0.986 3 -2
## 4 4 0 1 1.54 -0.666 3 -2
pnl %>%
pivot_longer(
-c(x, a, b),
names_to = c(".value", "time"),
names_pattern = "(.)(.)"
)
## # A tibble: 8 × 6
## x a b time y z
## <int> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 1 0 1 0.450 3
## 2 1 1 0 2 -1.44 -2
## 3 2 1 1 1 -1.15 3
## 4 2 1 1 2 1.64 -2
## 5 3 0 1 1 1.90 3
## 6 3 0 1 2 0.986 -2
## 7 4 0 1 1 1.54 3
## 8 4 0 1 2 -0.666 -2
pivot_wider()
Data capturing anytime a fish passes through a station. A row is only added if a fish was seen, so often times there will be NAs. If we widen the data so that each station is a column, the NAs are apparent. The NAs can be filled within the function using the values_fill
command.
fish_encounters
## # A tibble: 114 × 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## 7 4842 BCW 1
## 8 4842 BCE2 1
## 9 4842 BCW2 1
## 10 4842 MAE 1
## # … with 104 more rows
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
# values_fill = list(seen = 0)
) %>% head()
## # A tibble: 6 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6 4848 1 1 1 1 NA NA NA NA NA NA NA
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
) %>% head()
## # A tibble: 6 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 0 0 0 0 0 0
## 5 4847 1 1 1 0 0 0 0 0 0 0 0
## 6 4848 1 1 1 1 0 0 0 0 0 0 0
Pivot Wider with Duplicated values
Assume we have a data set that has some duplicated values - when we go to use pivot_wider
we’ll get a warning and our columns will be nested.
set.seed(1234)
tibble::tibble(
cats = sample(1:10, 7),
dogs = sample(1:10, 7),
birds = sample(1:10, 7)) |>
tidyr::pivot_longer(
cats:birds,
names_to = "animal",
values_to = "hits") |>
tidyr::pivot_wider(
names_from = animal,
values_from = hits
)
## Warning: Values from `hits` are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = {summary_fun}` to summarise duplicates.
## * Use the following dplyr code to identify duplicates.
## {data} %>%
## dplyr::group_by(animal) %>%
## dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
## dplyr::filter(n > 1L)
## # A tibble: 1 × 3
## cats dogs birds
## <list> <list> <list>
## 1 <int [7]> <int [7]> <int [7]>
We can work around this by identifying each row as unique, and then pivot.
set.seed(1234)
tibble::tibble(
cats = sample(1:10, 7),
dogs = sample(1:10, 7),
birds = sample(1:10, 7)) |>
tidyr::pivot_longer(
cats:birds,
names_to = "animal",
values_to = "hits") |>
dplyr::group_by(animal) |>
dplyr::mutate(row = dplyr::row_number()) |>
tidyr::pivot_wider(
names_from = animal,
values_from = hits
) |>
dplyr::select(-row)
## # A tibble: 7 × 3
## cats dogs birds
## <int> <int> <int>
## 1 10 4 8
## 2 6 2 4
## 3 5 7 9
## 4 4 6 5
## 5 1 9 10
## 6 8 10 3
## 7 2 8 6
Aggregation
This is actually sort of bonkers. We can spread data and aggregate it, all at the same time. The warpbreaks dataset captures 9 experiments for 2 different types of wool and 3 levels of tension. There is no unique way of spreading the data.
warpbreaks <- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks)
warpbreaks
## # A tibble: 54 × 3
## wool tension breaks
## <fct> <fct> <dbl>
## 1 A L 26
## 2 A L 30
## 3 A L 54
## 4 A L 25
## 5 A L 70
## 6 A L 52
## 7 A L 51
## 8 A L 26
## 9 A L 67
## 10 A M 18
## # … with 44 more rows
warpbreaks %>% count(wool, tension)
## # A tibble: 6 × 3
## wool tension n
## <fct> <fct> <int>
## 1 A L 9
## 2 A M 9
## 3 A H 9
## 4 B L 9
## 5 B M 9
## 6 B H 9
warpbreaks %>% pivot_wider(names_from = wool, values_from = breaks)
## Warning: Values from `breaks` are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = {summary_fun}` to summarise duplicates.
## * Use the following dplyr code to identify duplicates.
## {data} %>%
## dplyr::group_by(tension, wool) %>%
## dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
## dplyr::filter(n > 1L)
## # A tibble: 3 × 3
## tension A B
## <fct> <list> <list>
## 1 L <dbl [9]> <dbl [9]>
## 2 M <dbl [9]> <dbl [9]>
## 3 H <dbl [9]> <dbl [9]>
But we can use pivot_wider()
to spread the data and then to aggregate each wool and tension to the average.
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = list(breaks = mean)
)
## # A tibble: 3 × 3
## tension A B
## <fct> <dbl> <dbl>
## 1 L 44.6 28.2
## 2 M 24 28.8
## 3 H 24.6 18.8
Generate column name from multiple variables
Data with a country and production levels of specific products. What if wanted a column for each product and country?
production <- expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
filter((product == "A" & country == "AI") | product == "B") %>%
mutate(production = rnorm(nrow(.)))
production
## # A tibble: 45 × 4
## product country year production
## <chr> <chr> <int> <dbl>
## 1 A AI 2000 -0.511
## 2 A AI 2001 -0.911
## 3 A AI 2002 -0.837
## 4 A AI 2003 2.42
## 5 A AI 2004 0.134
## 6 A AI 2005 -0.491
## 7 A AI 2006 -0.441
## 8 A AI 2007 0.460
## 9 A AI 2008 -0.694
## 10 A AI 2009 -1.45
## # … with 35 more rows
By specifying multiple names_from
we can unite them into single columns.
production %>% pivot_wider(
names_from = c(product, country),
values_from = production
)
## # A tibble: 15 × 4
## year A_AI B_AI B_EI
## <int> <dbl> <dbl> <dbl>
## 1 2000 -0.511 -0.476 -1.11
## 2 2001 -0.911 -0.709 -1.25
## 3 2002 -0.837 -0.501 -0.524
## 4 2003 2.42 -1.63 -0.497
## 5 2004 0.134 -1.17 -1.81
## 6 2005 -0.491 -2.18 -0.582
## 7 2006 -0.441 -1.34 -1.11
## 8 2007 0.460 -0.294 -1.01
## 9 2008 -0.694 -0.466 -0.162
## 10 2009 -1.45 1.45 0.563
## 11 2010 0.575 -1.07 1.65
## 12 2011 -1.02 -0.855 -0.773
## 13 2012 -0.0151 -0.281 1.61
## 14 2013 -0.936 -0.994 -1.16
## 15 2014 1.10 -0.969 0.657
Tidy Census data
This looks very much like a standard use case of spread()
.
us_rent_income
## # A tibble: 104 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # … with 94 more rows
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
## # A tibble: 52 × 6
## GEOID NAME estimate_income estimate_rent moe_income moe_rent
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alabama 24476 747 136 3
## 2 02 Alaska 32940 1200 508 13
## 3 04 Arizona 27517 972 148 4
## 4 05 Arkansas 23789 709 165 5
## 5 06 California 29454 1358 109 3
## 6 08 Colorado 32401 1125 109 5
## 7 09 Connecticut 35326 1123 195 5
## 8 10 Delaware 31560 1076 247 10
## 9 11 District of Columbia 43198 1424 681 17
## 10 12 Florida 25952 1077 70 3
## # … with 42 more rows
Contact List
contacts <- tribble(
~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "john@google.com",
"name", "Huxley Ratcliffe"
)
contacts
## # A tibble: 6 × 2
## field value
## <chr> <chr>
## 1 name Jiena McLellan
## 2 company Toyota
## 3 name John Smith
## 4 company google
## 5 email john@google.com
## 6 name Huxley Ratcliffe
There is no unique identifier for each person, so we have to create one. And from there we can then spread it all out.
contacts <- contacts %>%
mutate(
person_id = cumsum(field == "name")
)
contacts
## # A tibble: 6 × 3
## field value person_id
## <chr> <chr> <int>
## 1 name Jiena McLellan 1
## 2 company Toyota 1
## 3 name John Smith 2
## 4 company google 2
## 5 email john@google.com 2
## 6 name Huxley Ratcliffe 3
contacts %>%
pivot_wider(names_from = field, values_from = value)
## # A tibble: 3 × 4
## person_id name company email
## <int> <chr> <chr> <chr>
## 1 1 Jiena McLellan Toyota <NA>
## 2 2 John Smith google john@google.com
## 3 3 Huxley Ratcliffe <NA> <NA>
Longer, then Wider
World Bank data
The goal is to produce a tidy dataset where each variable is a column.
world_bank_pop
## # A tibble: 1,056 × 20
## country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW SP.URB.TO… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4 4.47e+4 4.44e+4 4.41e+4 4.38e+4
## 2 ABW SP.URB.GR… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1 -6.98e-1 -7.31e-1 -6.24e-1
## 3 ABW SP.POP.TO… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5 1.01e+5 1.01e+5 1.01e+5 1.02e+5
## 4 ABW SP.POP.GR… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1 3.84e-1 1.31e-1 9.86e-2 2.13e-1
## 5 AFG SP.URB.TO… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6 6.15e+6 6.36e+6 6.59e+6 6.84e+6
## 6 AFG SP.URB.GR… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0 3.65e+0 3.40e+0 3.46e+0 3.70e+0
## 7 AFG SP.POP.TO… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7 2.66e+7 2.73e+7 2.80e+7 2.88e+7
## 8 AFG SP.POP.GR… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0 2.76e+0 2.51e+0 2.57e+0 2.81e+0
## 9 AGO SP.URB.TO… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7 1.21e+7 1.27e+7 1.33e+7 1.40e+7
## 10 AGO SP.URB.GR… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0 4.89e+0 4.87e+0 4.85e+0 4.83e+0
## # … with 1,046 more rows, and 7 more variables: `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>, `2017` <dbl>
First, use pivot_longer()
to start.
pop2 <- world_bank_pop %>%
pivot_longer(`2000`:`2017`, names_to = "year", values_to = "value")
pop2
## # A tibble: 19,008 × 4
## country indicator year value
## <chr> <chr> <chr> <dbl>
## 1 ABW SP.URB.TOTL 2000 42444
## 2 ABW SP.URB.TOTL 2001 43048
## 3 ABW SP.URB.TOTL 2002 43670
## 4 ABW SP.URB.TOTL 2003 44246
## 5 ABW SP.URB.TOTL 2004 44669
## 6 ABW SP.URB.TOTL 2005 44889
## 7 ABW SP.URB.TOTL 2006 44881
## 8 ABW SP.URB.TOTL 2007 44686
## 9 ABW SP.URB.TOTL 2008 44375
## 10 ABW SP.URB.TOTL 2009 44052
## # … with 18,998 more rows
But we are still left with multiple “indicator” variables per country. So now we have to separate.
pop3 <- pop2 %>%
separate(indicator, c(NA, "area", "variable"))
pop3
## # A tibble: 19,008 × 5
## country area variable year value
## <chr> <chr> <chr> <chr> <dbl>
## 1 ABW URB TOTL 2000 42444
## 2 ABW URB TOTL 2001 43048
## 3 ABW URB TOTL 2002 43670
## 4 ABW URB TOTL 2003 44246
## 5 ABW URB TOTL 2004 44669
## 6 ABW URB TOTL 2005 44889
## 7 ABW URB TOTL 2006 44881
## 8 ABW URB TOTL 2007 44686
## 9 ABW URB TOTL 2008 44375
## 10 ABW URB TOTL 2009 44052
## # … with 18,998 more rows
Now we can widen the data.
pop3 %>%
pivot_wider(names_from = variable, values_from = value)
## # A tibble: 9,504 × 5
## country area year TOTL GROW
## <chr> <chr> <chr> <dbl> <dbl>
## 1 ABW URB 2000 42444 1.18
## 2 ABW URB 2001 43048 1.41
## 3 ABW URB 2002 43670 1.43
## 4 ABW URB 2003 44246 1.31
## 5 ABW URB 2004 44669 0.951
## 6 ABW URB 2005 44889 0.491
## 7 ABW URB 2006 44881 -0.0178
## 8 ABW URB 2007 44686 -0.435
## 9 ABW URB 2008 44375 -0.698
## 10 ABW URB 2009 44052 -0.731
## # … with 9,494 more rows
Multiple Choice data
multi <- tribble(
~id, ~choice1, ~choice2, ~choice3,
1, "A", "B", "C",
2, "C", "B", NA,
3, "D", NA, NA,
4, "B", "D", NA
)
multi
## # A tibble: 4 × 4
## id choice1 choice2 choice3
## <dbl> <chr> <chr> <chr>
## 1 1 A B C
## 2 2 C B <NA>
## 3 3 D <NA> <NA>
## 4 4 B D <NA>
First, gather it with pivot_longer()
dropping the NAs for people with fewere than 3 choices.
multi2 <- multi %>%
pivot_longer(-id, values_drop_na = TRUE) %>%
mutate(checked = TRUE)
multi2
## # A tibble: 8 × 4
## id name value checked
## <dbl> <chr> <chr> <lgl>
## 1 1 choice1 A TRUE
## 2 1 choice2 B TRUE
## 3 1 choice3 C TRUE
## 4 2 choice1 C TRUE
## 5 2 choice2 B TRUE
## 6 3 choice1 D TRUE
## 7 4 choice1 B TRUE
## 8 4 choice2 D TRUE
Then widen the data
multi2 %>%
pivot_wider(
id_cols = id,
names_from = value,
values_from = checked,
values_fill = list(checked = FALSE)
)
## # A tibble: 4 × 5
## id A B C D
## <dbl> <lgl> <lgl> <lgl> <lgl>
## 1 1 TRUE TRUE TRUE FALSE
## 2 2 FALSE TRUE TRUE FALSE
## 3 3 FALSE FALSE FALSE TRUE
## 4 4 FALSE TRUE FALSE TRUE
separate()
separate
works when a dataframe includes compound variables - for instance, age and gender demographics. separate
works in one of two ways - first, you can pass it a regular expression to split on, but if you choose not to give it, it will default to non-alphanumeric character to split on, as below.
df <- data.frame(x = c(NA, "m.14", "f.20", "f.45"))
df %>% separate(x, c("Gender", "Age"))
## Gender Age
## 1 <NA> <NA>
## 2 m 14
## 3 f 20
## 4 f 45
Second, you can tell it how many characters in to split on. If we change our example somewhat, we can see where this might be useful.
df <- data.frame(x = c(NA, "m14", "f20", "f45"))
df %>% separate(x, c("Gender", "Age") , 1)
## Gender Age
## 1 <NA> <NA>
## 2 m 14
## 3 f 20
## 4 f 45
unite()
The unite
function is helpful in that it pulls disparate columns together into one. To see how it works, we’ll make some fake data, and then use unite
to pull the hour, min, and second into a single variable.
date <- as.Date('2018-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
data
## date hour min second event
## 1 2018-01-01 20 26 2 i
## 2 2018-01-02 17 36 38 p
## 3 2018-01-03 6 21 24 h
## 4 2018-01-04 22 42 30 c
## 5 2018-01-05 13 38 43 y
## 6 2018-01-06 23 6 6 b
## 7 2018-01-07 2 9 57 e
## 8 2018-01-08 21 43 19 v
## 9 2018-01-09 10 19 15 z
## 10 2018-01-10 18 22 22 w
## 11 2018-01-11 5 46 14 x
## 12 2018-01-12 11 40 23 q
## 13 2018-01-13 12 29 53 s
## 14 2018-01-14 4 45 48 o
## 15 2018-01-15 3 16 7 r
data %>% unite(datetime, hour, min, second, sep = ":")
## date datetime event
## 1 2018-01-01 20:26:2 i
## 2 2018-01-02 17:36:38 p
## 3 2018-01-03 6:21:24 h
## 4 2018-01-04 22:42:30 c
## 5 2018-01-05 13:38:43 y
## 6 2018-01-06 23:6:6 b
## 7 2018-01-07 2:9:57 e
## 8 2018-01-08 21:43:19 v
## 9 2018-01-09 10:19:15 z
## 10 2018-01-10 18:22:22 w
## 11 2018-01-11 5:46:14 x
## 12 2018-01-12 11:40:23 q
## 13 2018-01-13 12:29:53 s
## 14 2018-01-14 4:45:48 o
## 15 2018-01-15 3:16:7 r
# the default separator is "_", so if that doesn't work, change it as necessary
The unite
function is also helpful when you need to spread multiple columns in a dataframe.
df <- data.frame(quarter=rep(1:4,2),
section=rep(c("Sec1", "Sec2"), each=4),
QA=c(.9, .7, .6, .8, .6, .9, 1.0, .6),
QB=c(.6, .7, .8, .5, .6, .7, .5, .9))
df
## quarter section QA QB
## 1 1 Sec1 0.9 0.6
## 2 2 Sec1 0.7 0.7
## 3 3 Sec1 0.6 0.8
## 4 4 Sec1 0.8 0.5
## 5 1 Sec2 0.6 0.6
## 6 2 Sec2 0.9 0.7
## 7 3 Sec2 1.0 0.5
## 8 4 Sec2 0.6 0.9
What if we want to spread this so that each column relates to a section and a class average on two quizzes each quarter? We would use a lot of what the tidyr
package has to offer - gather, unite, and then spread.
df <- data.frame(quarter=rep(1:4,2),
section=rep(c("Sec1", "Sec2"), each=4),
QA=c(.9, .7, .6, .8, .6, .9, 1.0, .6),
QB=c(.6, .7, .8, .5, .6, .7, .5, .9))
df <- df %>%
gather(variable, value, -(quarter:section)) %>%
unite(tmp, section, variable) %>%
spread(tmp, value)
df
## quarter Sec1_QA Sec1_QB Sec2_QA Sec2_QB
## 1 1 0.9 0.6 0.6 0.6
## 2 2 0.7 0.7 0.9 0.7
## 3 3 0.6 0.8 1.0 0.5
## 4 4 0.8 0.5 0.6 0.9