An introduction to the Tidyverse

https://bit.ly/3LWSBQK

(Un)Tidy data - a horror story1

1. Empty cells

2. Inconsistency

3. Complicated layout

4. Multiple id-row

6. Summaries

7. True Horror data

Horror data

A Tidy data set

Tidy data

Tidy data principles. Fuente: https://r4ds.had.co.nz/tidy-data.html

Make friends with tidy data

The Tidyverse

Manipulating data with the Tidyverse

The Palmer Penguins Data Set

Importing libraries:

library(tidyverse)
library(palmerpenguins)

Data wrangling and cleaning

  • filter() : it filters the observations (rows) by satisfying a criterion
  • mutate() : creates a new variable (column) based on existing variables.
  • select(): selects variables (columns)
  • group_by(): groups observations according to the categories of a variable.
  • drop_na(): gets rid of NA values in the entire dataset or in certain variables.
  • summarise(): reduces multiple values to a single value based on an operation (e.g. mean())

A common set of operation are written as:

tibble |> 
  function |> 
  function ...

Let’s try them out

What are the max and min body sizes for Adelie and Gentoo penguins?

penguins

penguins |> 
  filter(species == "Adelie" | species == "Gentoo")

penguins |> 
  filter(species ==  "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g)

penguins |> 
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) 

penguins |> 
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) |> 
  drop_na(body_mass_g)

penguins |> 
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) |> 
  drop_na(body_mass_g) |> 
  summarise(max_weight_g = max(body_mass_g), min_weight_g = min(body_mass_g))

penguins |> 
  filter(species == "Adelie" | species == "Gentoo") |>
  select(species, body_mass_g) |>
  group_by(species) |> 
  drop_na(body_mass_g) |> 
  summarise(max_weight_g = max(body_mass_g), min_weight_g = min(body_mass_g)) |>
  mutate(max_wight_kg = max_weight_g/1000)

Data visualization with ggplot2

Importing libraries:

library(ggplot2)
library(palmerpenguins)
Code
ggplot(
    data = penguins,
    aes(
        x = flipper_length_mm,
        y = bill_length_mm
    )
) +
  geom_point(
      aes(
          color = species,
          shape = species
      ),
      size = 3,
      alpha = 0.8
  ) +
  scale_color_manual(values = c("darkorange", "purple", "cyan4")) +
  labs(
      title = "Flipper and bill length",
      subtitle = "Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
      x = "Flipper length (mm)",
      y = "Bill length (mm)",
      color = "Penguin species",
      shape = "Penguin species"
  ) +
  theme(
      legend.position = c(0.85, 0.15),
      plot.title.position = "plot",
      plot.caption = element_text(hjust = 0, face = "italic"),
      plot.caption.position = "plot"
  )

ggplot(data = penguins,
    aes(
        x = flipper_length_mm,
        y = bill_length_mm
    )
)

ggplot(data = penguins,
    aes(
        x = flipper_length_mm,
        y = bill_length_mm
    )
) +
  geom_point(
      aes(
          color = species,
          shape = species
      )
  )

ggplot(data = penguins,
    aes(
        x = flipper_length_mm,
        y = bill_length_mm
    )
) +
  geom_point(
      aes(
          color = species,
          shape = species
      )
  ) +
  scale_color_manual(
    values = c(
      "darkorange", 
      "purple",
      "cyan4"
      )
  )

ggplot(data = penguins,
    aes(
        x = flipper_length_mm,
        y = bill_length_mm
    )
) +
  geom_point(
      aes(
          color = species,
          shape = species
      )
  ) +
  scale_color_manual(
    values = c(
      "darkorange", 
      "purple",
      "cyan4"
      )
  ) +
  labs(
      title = "Flipper and bill length",
      subtitle = "Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
      x = "Flipper length (mm)",
      y = "Bill length (mm)",
      color = "Penguin species",
      shape = "Penguin species"
  )

visit in https://exts.ggplot2.tidyverse.org/gallery/

References

Broman, Karl W, and Kara H Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10.