6 Data and object types

🎯 Learning goals

After working through Tutorial 6, you’ll be able to…

explain and correctly apply types of data in R (e.g. numeric, character, NA)
explain and correctly apply objects in R (scalars, vectors, data frames, etc.)

1. Types of data in R

Data types describe values. In R, we will most often encounter the following types of data:

Data type	Description	Example
numeric	numbers used for calculations	`c(1, 2, 3)`
character	text strings	`c("Green", "Yellow", "Blue")`
factor	categorical variable with specific levels	`education = factor(education, levels = c("low", "medium", "high"))`
logical	logical values	`TRUE`, `FALSE`
NA	missing value	`NA`

You can check a variable’s type using class().

c(1, 2, 3) |>
  class()

[1] "numeric"

c("Green", "Yellow", "Blue") |>
  class()

[1] "character"

Why do different data types matter? Many statistical functions require numeric input. Try, for example, to calculate mean of the following numbers:

c("1", "2", "3") |>
  mean()

Warning in mean.default(c("1", "2", "3")): Argument ist weder numerisch noch
boolesch: gebe NA zurück

[1] NA

R will throw you an error message. Why? Because the values are stored as character strings, not as numeric values.

c(1, 2, 3) |>
  mean()

[1] 2

1.1 Making character data explicit

Character data requires quotations marks " to be recognized as such. See the difference here:

c(1, 2, 3) |>
  class()

[1] "numeric"

c("1", "2", "3") |>
  class()

[1] "character"

1.2 Missing values

Missing values are stored as NA. You should always check whether values are missing (and of course, why).

Let’s check an example: We will again use the WoJ dataset. It contains a quantitative survey of N = 1,200 journalists from five countries, collected via the World of Journalism study. The data is provided via the tidycomm package developed by Unkel et al.

First, we retrieve the data and save it as an object named data_woj. Next, we count how many missing values exist in the variable work_experience. For this, we use three functions:

summarise(): reduces a dataset to a summary by computing statistics (for example counts or averages).
sum(): adds values together. When used with logical values (TRUE / FALSE), it counts how many TRUE values exist - in this case, how many values are missing.
is.na(): checks whether values are missing and returns TRUE for missing values and FALSE otherwise.

library(tidyverse)
library(tidycomm)

data_woj <- tidycomm::WoJ
data_woj |>
  summarise(n_missing = sum(is.na(work_experience)))

# A tibble: 1 × 1
  n_missing
      <int>
1        13

Most tidyverse functions allow handling missing values explicitly, using the na.rm() argument: Set it to TRUE if missing values (na) should be ignored (rm).

data_woj |>
  summarise(mean = mean(work_experience, na.rm = TRUE))

# A tibble: 1 × 1
   mean
  <dbl>
1  17.8

Now, imagine that you want to set specific values to NA. For example, we may want to set the value Freelancer in employment to missing. For this, we can use mutate, which we already know from previous tutorials, and na_if():

# Original data
data_woj |>
  head()

# A tibble: 6 × 15
  country    reach employment temp_contract autonomy_selection autonomy_emphasis
  <fct>      <fct> <chr>      <fct>                      <dbl>             <dbl>
1 Germany    Nati… Full-time  Permanent                      5                 4
2 Germany    Nati… Full-time  Permanent                      3                 4
3 Switzerla… Regi… Full-time  Permanent                      4                 4
4 Switzerla… Local Part-time  Permanent                      4                 5
5 Austria    Nati… Part-time  Permanent                      4                 4
6 Switzerla… Local Freelancer <NA>                           4                 4
# ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
#   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
#   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

# Redefine missing values
data_woj |>
  mutate(employment = na_if(employment, "Freelancer")) |>
  head()

# A tibble: 6 × 15
  country    reach employment temp_contract autonomy_selection autonomy_emphasis
  <fct>      <fct> <chr>      <fct>                      <dbl>             <dbl>
1 Germany    Nati… Full-time  Permanent                      5                 4
2 Germany    Nati… Full-time  Permanent                      3                 4
3 Switzerla… Regi… Full-time  Permanent                      4                 4
4 Switzerla… Local Part-time  Permanent                      4                 5
5 Austria    Nati… Part-time  Permanent                      4                 4
6 Switzerla… Local <NA>       <NA>                           4                 4
# ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
#   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
#   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

2. Types of objects in R

Great, now you know all about types of data. While data types describe values, object types describe how values are stored.

As promised, this tutorial will also teach you about different types of objects:

Object	Description	Example
scalar	single value	`3`
vector	multiple values of same type	`c(1, 2)`
tibble/data frame	collection of vectors	`WoJ`
list	collection of different objects	`list(number = c(1,2), text = "word")`
functions	a command that is to be computed	`select()`

2.1 Scalar

The “smallest” type of data you will encounter are scalars.

Scalars are objects consisting of a single value - for example, a letter, a word, a sentence, a number, etc.

You’ve already seen what a scalar looks like - remember when we started with R? Here, we defined an object word which only consisted of the word “hello”.

word <- "hello"
word

[1] "hello"

Technically, scalars are vectors of length 1.

2.2 Vector

The next type of data you should know are vectors: Vectors contain multiple values of the same type. In tidyverse thinking:

Each column in a tibble is a vector.

A vector can only have one underlying type (e.g., numeric or character).

In principle, you can often (but not always) compare vectors with variables in data sets: They contain values for all observations in your data set (with all of these values being of the same data type).

An example would be the numbers from 1 to 20 that we worked with before. Now, it becomes apparent what the c() stands for - it specifies the vector format.

We define the object numbers to consist of the a vector c() which contains the values 1 to 20.

numbers <- 1:20
numbers

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

2.3 Tibble/Data frame

We already know tibbles:

A tibble (or a data frame) combines several vectors of equal length.

Our data_woj data set from tidycomm is an example for this:

data_woj

# A tibble: 1,200 × 15
   country   reach employment temp_contract autonomy_selection autonomy_emphasis
   <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
 1 Germany   Nati… Full-time  Permanent                      5                 4
 2 Germany   Nati… Full-time  Permanent                      3                 4
 3 Switzerl… Regi… Full-time  Permanent                      4                 4
 4 Switzerl… Local Part-time  Permanent                      4                 5
 5 Austria   Nati… Part-time  Permanent                      4                 4
 6 Switzerl… Local Freelancer <NA>                           4                 4
 7 Germany   Local Full-time  Permanent                      4                 4
 8 Denmark   Nati… Full-time  Permanent                      3                 3
 9 Switzerl… Local Full-time  Permanent                      5                 5
10 Denmark   Nati… Full-time  Permanent                      2                 4
# ℹ 1,190 more rows
# ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
#   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
#   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

For a nice example for comparing scalars, vectors, and other data types check out this example here.

In R, we can access variables inside tibbles with the dollar sign $:

data_woj$country

We can manually inspect the whole data frame using View():

View(data_woj)

2.4 Lists

Finally lists:

Lists can store different object types of different lengths together.

As discussed, tibbles can include vectors consisting of different types of data (for instance, character and numeric vectors) but always of the same length.

In some cases, lists offer a more flexible way of saving very different objects within one object (i.e., the list).

Let’s check this example for a list:

list(data = data_woj,
     numbers = c(1,2,3),
     text = "example")

As you see, the object includes three elements:

the first element $data is the tibble data_woj
the second element $numbers is the numeric vector numbers
the third element $text is the character scalar text

💡 Take Aways

Types of data: describe what values are (numeric, character, factor, logical).
Types of objects: describe how values are organized (vectors, tibbles, lists).

🤓 Smart Hacks

Smart Hack 1: Using across() to repeat operations over columns

When working with tibbles, you often want to apply the same operation to multiple variables at once. Instead of repeating the same command, you can use across().

across() allows you to apply one function (or several functions) to multiple columns simultaneously.

Suppose we want to transform all character variables to uppercase. We can do this like so:

data_woj |>
  mutate(across(where(is.character), toupper)) |>
  head()

# A tibble: 6 × 15
  country    reach employment temp_contract autonomy_selection autonomy_emphasis
  <fct>      <fct> <chr>      <fct>                      <dbl>             <dbl>
1 Germany    Nati… FULL-TIME  Permanent                      5                 4
2 Germany    Nati… FULL-TIME  Permanent                      3                 4
3 Switzerla… Regi… FULL-TIME  Permanent                      4                 4
4 Switzerla… Local PART-TIME  Permanent                      4                 5
5 Austria    Nati… PART-TIME  Permanent                      4                 4
6 Switzerla… Local FREELANCER <NA>                           4                 4
# ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
#   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
#   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

🎲 Quiz

round = (n, places) => {
  if (!places) return Math.round(n);
  const d = 10 ** places;
  return Math.round(n * d) / d;
}

function textReplacementWithText({ text, placeholder, correctText, fontFamily = "monospace", color = "#000000" }) {
  const parts = text.split(/\{\{input\}\}/); // Teilt den Text an den Eingabestellen

  const root = htl.html`<div style="font-size: HUGE; margin-top: -40; margin-bottom: -40; margin-left: -50px; line-height: 1.8; font-family: ${fontFamily}; color: ${color}; white-space: pre-wrap; text-indent: 0; display: flex; align-items: baseline;">
    <div style="flex: 1;">
      ${parts.map((part, index) =>
        htl.html`${part}${index < parts.length - 1 ? htl.html`<span style="display: inline-flex; align-items: baseline;">
          <input 
            type="text"
            placeholder="${placeholder}"
            style="padding: 3px; font-size: HUGE; border: 1px solid #ccc; margin-left: 0px; border-radius: 4px; width: calc(${correctText.length}ch + 1em); font-family: ${fontFamily}; color: ${color}; text-align: right; box-sizing: border-box; margin-right: 0.2em;"
            oninput=${(e) => {
              const feedbackElement = root.querySelector(`#feedback-final`);
              const allInputsCorrect = Array.from(root.querySelectorAll("input")).every((input) => input.value.trim() === correctText.trim());
              if (allInputsCorrect) {
                feedbackElement.textContent = "✅";
                feedbackElement.style.color = "green";
              } else {
                feedbackElement.textContent = "🔴";
                feedbackElement.style.color = "red";
              }
            }}
          /></span>` : ""}`
      )}
    </div>
    <span id="feedback-final" style="font-weight: bold; font-size: HUGE; margin-left: 0.5em; color: inherit;"></span>
  </div>`;

  return root;
}



function textReplacementTask({ placeholder, correctText, width = "300px", fontFamily = "monospace", color = "black" }) {
  let enteredText = ""; 

  const root = htl.html`<div style="font-size: 14px; line-height: 1.5; margin-bottom: 20px;">
    <h4>Bitte ersetzen Sie den Text:</h4>
    <!-- Eingabefeld -->
    <div style="display: flex; align-items: center;">
      <input
        type="text"
        placeholder="${placeholder}"
        style="padding: 5px; font-size: 14px; border: 1px solid #ccc; border-radius: 4px; width: ${width}; font-family: ${fontFamily}; color: ${color};"
        onInput=${(e) => {
          enteredText = e.target.value;
          root.value = enteredText;
          root.dispatchEvent(new CustomEvent("input"));
          updateFeedback();
        }}
      />
      <!-- Feedback -->
      <div id="feedback" style="font-weight: bold; font-size: 14px; margin-left: 10px; text-align: left; white-space: nowrap; color: inherit;"></div>
    </div>
  </div>`;

// Funktion zur Aktualisierung des Feedbacks
  function updateFeedback() {
    const feedbackElement = root.querySelector("#feedback");
    if (enteredText.trim() === correctText.trim()) {
      feedbackElement.textContent = "✅";
      feedbackElement.style.color = "green";
    } else if (enteredText.trim() === "") {
      feedbackElement.textContent = ""; 
    } else {
      feedbackElement.textContent = "🔴";
      feedbackElement.style.color = "red";
    }
  }

  root.value = null; 
  return root;
}


function singleChoiceInput({ options, correctIndex }) {
  let selectedAnswer = null;

  const root = htl.html`<div style="font-size: HUGE; line-height: 1.5; margin-bottom: 0px;">
    <!-- Antworten -->
    <div style="display: grid; row-gap: 12px;">
      ${options.map(
        (option, index) =>
          htl.html`<div style="display: flex; align-items: center;">
            <!-- Radiobutton und Option -->
            <label style="display: flex; align-items: center; padding-left: 4pt; font-weight: normal;">
              <input
                type="radio"
                name="singleChoice"
                value=${index}
                style="margin-right: 10px;"
                onChange=${() => {
                  selectedAnswer = index;
                  root.value = selectedAnswer;
                  root.dispatchEvent(new CustomEvent("input"));
                  updateFeedback();
                }}
              />
              <span>${option}</span>
            </label>
            <!-- Feedback -->
            <div id="feedback-${index}" style="font-weight: bold; font-size: 14px; margin-left: 10px; text-align: left; white-space: nowrap; color: inherit;"></div>
          </div>`
      )}
    </div>
  </div>`;

// Funktion zur Aktualisierung des Feedbacks
  function updateFeedback() {
    options.forEach((_, index) => {
      const feedbackElement = root.querySelector(`#feedback-${index}`);
      if (selectedAnswer === index) {
        feedbackElement.textContent =
          index === correctIndex ? "✅" : "🔴";
        feedbackElement.style.color = index === correctIndex ? "green" : "red";
      } else {
        feedbackElement.textContent = ""; // Löscht Feedback für nicht ausgewählte Optionen
      }
    });
  }

  root.value = null;
  return root;
}


function quizInput({ questions, options}) {
  let answers = questions.map(() => null);
  let root = htl.html`<div
      style="
        display: grid;
        grid-template-columns: 10% 10% 70% 10%;"
    >
      ${options.map(
        (opt) => htl.html`<div style="font-weight: bold; font-size: HUGE">${opt}</div>`
      )}
      <div style="font-weight: bold">Statement</div>
      <div style="font-weight: bold"></div>
      ${Array.from(questions.entries(), ([i, [question, correct]]) =>
        quizInputRow({
          question,
          options,
          correct,
          onChange: (newAnswer) => {
            answers[i] = newAnswer;
            root.value = answers;
            root.dispatchEvent(new CustomEvent("input"));
          }
        })
      )}
    </div>`;
  root.value = answers;
  return root;
}

function quizInputRow({
  question,
  options,
  correct,
  onChange = () => {}
}) {
  let root = htl.html`<div>`;

  function setAnswer(answer, initial = false) {
    morph(
      root,
      htl.html`<div style="display: contents"> 
      <form style="display: contents">
        ${options.map(
          (opt) =>
            htl.html`<label>&emsp;</label> 
            <input  
              name=${question} &emsp;
              type="radio"
              value="${opt}"
              checked=${opt === answer}
              onChange=${() => setAnswer(opt)}
            >
            </input>`
        )}
      </form>
      <div>${question}</div>
      <div> &emsp; ${
       answer === null ? "" : answer === correct ? "✅" : "🔴"
      }</div>
    </div>`
    );

    root.value = answer;
    if (!initial) {
      root.dispatchEvent(new CustomEvent("input"));
      onChange(answer);
    }
  }

  setAnswer(null, true);
  return root;
}

morph = require("https://bundle.run/nanomorph@5.4.2")

🎲 Question 1

Which of the following statements are correct?

MC_datatypes_1 = [
  ["Numeric data can be used directly for mathematical calculations.", "True"],
  ["Character values are recognized without quotation marks.", "False"],
  ["TRUE and FALSE belong to the logical data type.", "True"],
  ["NA represents a missing value.", "True"],
  ["c(\"1\", \"2\") is stored as numeric data.", "False"]
]

viewof answers_datatypes_1 = quizInput({
  questions: MC_datatypes_1,
  options: ["True", "False"]
})

🎲 Question 2

Which of the following statements about missing data are correct?

MC_missing_1 = [
  ["I do not need to check for missing values - R can handle this automatically", "False"],
  ["is.na() checks whether values are missing.", "True"],
  ["sum(is.na(x)) counts how many missing values exist.", "True"],
  ["Setting na.rm = TRUE tells R to ignore missing values in calculations.", "True"]
]

viewof answers_missing_1 = quizInput({
  questions: MC_missing_1,
  options: ["True", "False"]
})

🎲 Question 3

You want to set all values in ethics_1 coded as “3” to “NA”. Type in the correct command!

TRT_6_1 = [
  `data_woj |>
        mutate(ethics_1 = {{input}}(work_experience, 3))`,
  "Code",
  "na_if",
  "monospace",
  "#000000"
];

viewof answer_TRT_6_1 = textReplacementWithText({
  text: TRT_6_1[0],
  placeholder: TRT_6_1[1],
  correctText: TRT_6_1[2],
  fontFamily: TRT_6_1[3],
  color: TRT_6_1[4]
});

📚 More tutorials on this

You still have questions? The following tutorials & papers can help you with that:

📌 Test your knowledge

Task 1 (Easy🔥)

Use the WoJ data. Save it as data_woj. Check the data type of the following variables:

country
employment

data_woj  |>
  select(country, employment) |>
  head()

# A tibble: 6 × 2
  country     employment
  <fct>       <chr>     
1 Germany     Full-time 
2 Germany     Full-time 
3 Switzerland Full-time 
4 Switzerland Part-time 
5 Austria     Part-time 
6 Switzerland Freelancer

# or.... 

data_woj |> 
  summarise(across(c("country", "employment"), class))

# A tibble: 1 × 2
  country employment
  <chr>   <chr>     
1 factor  character

Task 2 (Medium🔥🔥)

Using data_woj…

Count how many missing values exist in work_experience.
Replace all values coded as more than 15 in work_experience with NA.
Save the modified dataset as data_woj_clean.

# How many missing values?
data_woj |>
  summarise(n_missing = sum(is.na(work_experience)))

# A tibble: 1 × 1
  n_missing
      <int>
1        13

# Replace values above 15 with NA
data_woj_cleaned <- data_woj |>
  mutate(work_experience = replace(work_experience,
                                   work_experience > 15,
                                   NA)) 

# Check result
data_woj_cleaned |>
  select(work_experience)

# A tibble: 1,200 × 1
   work_experience
             <dbl>
 1              10
 2               7
 3               6
 4               7
 5              15
 6              NA
 7              NA
 8              11
 9              NA
10               4
# ℹ 1,190 more rows

Task 3 (Hard🔥🔥🔥)

This tasks contains several commands you do not yet know! Use data_woj (not the cleaned data set) and …

Group the data by country.
For all numeric variables, calculate the mean and the number of missing values.
Save the result as country_summary.
For the country with the highest mean in trust_parliament, plot the average trust score!

country_summary <- data_woj |>
  group_by(country) |>
  summarise(across(where(is.numeric), list(mean = ~ mean(.x, na.rm = TRUE),
                                           n_miss = ~ sum(is.na(.x)))))

# Check the result
country_summary

# A tibble: 5 × 23
  country   autonomy_selection_m…¹ autonomy_selection_n…² autonomy_emphasis_mean
  <fct>                      <dbl>                  <int>                  <dbl>
1 Austria                     3.92                      0                   4.19
2 Denmark                     3.76                      0                   3.90
3 Germany                     3.97                      1                   4.34
4 Switzerl…                   3.92                      0                   4.07
5 UK                          3.91                      2                   4.08
# ℹ abbreviated names: ¹autonomy_selection_mean, ²autonomy_selection_n_miss
# ℹ 19 more variables: autonomy_emphasis_n_miss <int>, ethics_1_mean <dbl>,
#   ethics_1_n_miss <int>, ethics_2_mean <dbl>, ethics_2_n_miss <int>,
#   ethics_3_mean <dbl>, ethics_3_n_miss <int>, ethics_4_mean <dbl>,
#   ethics_4_n_miss <int>, work_experience_mean <dbl>,
#   work_experience_n_miss <int>, trust_parliament_mean <dbl>,
#   trust_parliament_n_miss <int>, trust_government_mean <dbl>, …

# For the country with the highest mean in `trust_parliament`, plot the average trust score!
country_summary |>
  filter(trust_parliament_mean == max(trust_parliament_mean)) |>
  
  # a very simple plot
  ggplot(aes(x = country, y = trust_parliament_mean)) +
  geom_col()