An introduction to the Tidyverse

https://bit.ly/3LWSBQK

(Un)Tidy data - a horror story1

1. Empty cells

2. Inconsistency

3. Complicated layout

4. Multiple id-row

6. Summaries

7. True Horror data

Horror data

A Tidy data set

Tidy data

Tidy data principles. Fuente: https://r4ds.had.co.nz/tidy-data.html

Make friends with tidy data

The Tidyverse

Manipulating data with the Tidyverse

The Palmer Penguins Data Set

Importing libraries:

#|
library(tidyverse)
library(palmerpenguins)

Data wrangling and cleaning

  • filter() : it filters the observations (rows) by satisfying a criterion
  • mutate() : creates a new variable (column) based on existing variables.
  • select(): selects variables (columns)
  • group_by(): groups observations according to the categories of a variable.
  • drop_na(): gets rid of NA values in the entire dataset or in certain variables.
  • summarise(): reduces multiple values to a single value based on an operation (e.g. mean())

A common set of operation are written as:

tibble |> 
  function |> 
  function ...

Let’s try them out

What are the max and min body sizes for Adelie and Gentoo penguins?

penguins

penguins |>
  filter(species == "Adelie" | species == "Gentoo")

penguins |>
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g)

penguins |>
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species)

penguins |>
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) |>
  drop_na(body_mass_g)

penguins |>
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) |>
  drop_na(body_mass_g) |>
  summarise(max_weight_g = max(body_mass_g), min_weight_g = min(body_mass_g))

penguins |>
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) |>
  drop_na(body_mass_g) |>
  summarise(max_weight_g = max(body_mass_g), min_weight_g = min(body_mass_g)) |>
  mutate(max_wight_kg = max_weight_g / 1000)

Data visualization with ggplot2

Importing libraries:

library(ggplot2)
library(palmerpenguins)
Code
ggplot(
  data = penguins,
  aes(
    x = flipper_length_mm,
    y = bill_length_mm
  )
) +
  geom_point(
    aes(
      color = species,
      shape = species
    ),
    size = 3,
    alpha = 0.8
  ) +
  scale_color_manual(values = c("darkorange", "purple", "cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
    x = "Flipper length (mm)",
    y = "Bill length (mm)",
    color = "Penguin species",
    shape = "Penguin species"
  ) +
  theme(
    legend.position = c(0.85, 0.15),
    plot.title.position = "plot",
    plot.caption = element_text(hjust = 0, face = "italic"),
    plot.caption.position = "plot"
  )

ggplot(
  data = penguins,
  aes(
    x = flipper_length_mm,
    y = bill_length_mm
  )
)

ggplot(
  data = penguins,
  aes(
    x = flipper_length_mm,
    y = bill_length_mm
  )
) +
  geom_point(
    aes(
      color = species,
      shape = species
    )
  )

ggplot(
  data = penguins,
  aes(
    x = flipper_length_mm,
    y = bill_length_mm
  )
) +
  geom_point(
    aes(
      color = species,
      shape = species
    )
  ) +
  scale_color_manual(
    values = c(
      "darkorange",
      "purple",
      "cyan4"
    )
  )

ggplot(
  data = penguins,
  aes(
    x = flipper_length_mm,
    y = bill_length_mm
  )
) +
  geom_point(
    aes(
      color = species,
      shape = species
    )
  ) +
  scale_color_manual(
    values = c(
      "darkorange",
      "purple",
      "cyan4"
    )
  ) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
    x = "Flipper length (mm)",
    y = "Bill length (mm)",
    color = "Penguin species",
    shape = "Penguin species"
  )

visit in https://exts.ggplot2.tidyverse.org/gallery/

References

Broman, Karl W, and Kara H Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10.