Introduction to reproducible data analysis with R and Quarto

KLI Seminar 2023

Anne-Kathrin Kleine

Schedule

DAY 1

9.00-9.30: Introduction

Welcome and overview
Reproducibility and open science

9.30-10.30: Getting Started with R and Quarto

Basics of R programming
Best practices for organizing your code and materials

11.00-12.00: Creating data analysis projects in R and quarto

Creating a data analysis project
Integrating code, text, and output within a Quarto document

12.15-13.00: Working with data

Working with data in R: importing, manipulating, and exporting data
Data visualization and data analysis

DAY 2

09.00-10.00: Welcome and Hands-on Practice PART I

Independent practice: Working on your data analysis project (with support from workshop facilitator)

10.10-11.00: Publishing Reproducible Data Analysis Scripts

Publishing through GitHub, RPubs; integration with osf

11.30-12.30: Hands-on Practice PART II

Continue working on your data analysis projects
Publishing your projects

12.30-13.00: Closing Remarks

Recap of the workshop
Q&A session

Notes

If you want to say something, please raise your hand. If I don’t see it, please unmute yourself and start talking

Please participate actively throughout the course - can be comments, questions (!), answers

Mute yourself when you are not speaking

Small technical challenges may be addressed directly (e.g., “Where do I have to click to open RStudio?”)

Bigger technical challenges may be addressed either after the workshop today or before the workshop tomorrow (e.g., “My RStudio does not work”)

Workshop Requirements

Are you on the latest version of RStudio?

Open annekathrinkleine.com

Open RStudio

Reproducibility and open science

Reproducibility and replicability

Reproducibility

Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research.

Reproducibility emphasizes re-using original data and methods for verification, while replicability focuses on achieving consistent results with repeats of the experiment under similar conditions but using new data.

From The Open Science Training Handbook

Elements of Open Science

Source: The six core principles of Open Science

The benefits of Open Science for researchers

Source: The benefits of Open Science

The role of R and Quarto (RMarkdown) for Open Science

R and Quarto are open-source

ensuring accessibility, collaboration, customizability, longevity

Producing shareable and collaborative high-quality documents that integrate text and R code for analysis and visualization within one document

minimize “false” results in output documents because nothing needs to be copy-pasted
“coding” of research papers (easy integration of later modifications)

Change Your Mental Model

Source

A blank Word document

Output

A blank Word document

Source

---
title: "ggplot2 demo"
author: "Norah Jones"
date: "5/22/2021"
format: 
  html:
    fig-width: 8
    fig-height: 4
    code-fold: true
---

## Air Quality

@fig-airquality further explores the impact of temperature 
  on ozone level.

```{r}
#| label: fig-airquality
#| fig-cap: Temperature and ozone level.
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) + 
  geom_point() + 
  geom_smooth(method = "loess"
)
```

Output

Getting Started with R

Creating a data analysis project

Create a folder for your R project

Create a Quarto document (report.qmd)

Install relevant packages

```{r}
#| eval: false
pkg_list <- c("tidyverse", "haven", "flextable", "broom", "report", "effectsize", "rempsyc")
install.packages(pkg_list)
```

Create an R project

R projects

When a project is opened within RStudio the following actions are taken:

A new R session is started

The .Rprofile, .RData file, and .Rhistory files in the project’s main directory are loaded

The current working directory is set to the project directory

Previously edited source documents are restored into editor tabs

Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed

A new R session is started: A fresh R session begins to guarantee that any variables, functions, and loaded packages from the previous session are cleared. This helps prevent any conflicts or issues that may have carried over from other projects or work.
The .Rprofile, .RData file, and .Rhistory files in the project’s main directory are loaded: These files preserve the state of your R environment. .Rprofile contains code that is run every time you start R, .RData saves the objects in your environment, and .Rhistory stores your command history. Loading these files ensures that your project’s environment and history are restored.
The current working directory is set to the project directory: This means that when you read or write files, RStudio will automatically look in the project directory, which saves you having to specify the full file path. It makes it much easier to run and manage your project’s files.
Previously edited source documents are restored into editor tabs: The source files (like .R scripts) that were open when you last closed the project will re-open. This restores your working context, allowing you to pick up right where you left off.
Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed: This feature preserves the UI state of your RStudio session. Active tabs (like Console, Environment, etc.) are switched back to their previous state. Also, splitter positions (the size and arrangement of your panels/windows) are restored, so you don’t have to spend time rearranging your workspace every time you re-open your project.

Working with Quarto

How does Quarto work?

A diagram of how a QMD is turned into output formats via knitr and pandoc

So what is Quarto?

Quarto is a command line interface (CLI) that renders plain text formats (.qmd, .rmd, .md) into static PDF/Word/HTML reports, books, websites, presentations and more

One install, “Batteries included”

Quarto is bundled and comes pre-installed with RStudio v2022.07.1 and beyond!
more infos

Rendering

Render button

A screenshot of the render button in RStudio

Rendering

Terminal shell via quarto render

terminal

quarto render document.qmd # defaults to html
quarto render document.qmd --to pdf
quarto render document.qmd --to docx

Quick excourse: The Terminal

Windows Power Shell

Click Start, type PowerShell, right-click Windows PowerShell, and then click Run as administrator

Image source: Wikipedia

Mac Terminal

Click the Launchpad icon in the Dock, type Terminal in the search field, then click Terminal

Image source: Wikipedia

Why you should know how to use the terminal (at least a little)

Install software

Open programs

Run programs directly

Scheduling scripts

You get feedback (error messages)

Version control (Git and GitHub)

Once you know how to use it, it’s more efficient for navigation and creating/ modifying files

Rendering

R console via quarto R package

```{r}
#| eval: false
library(quarto)
quarto_render("document.qmd") # defaults to html
quarto_render("document.qmd", output_format = "pdf")
```

Working with a `.qmd`

A `.qmd` is a plain text file

Metadata (YAML)

format: html
engine: knitr

Code

library(dplyr)
mtcars |> 
  group_by(cyl) |> 
  summarize(mean = mean(mpg))

Text

# Heading 1
This is a sentence with some **bold text**, *italic text* and an 
![image](image.png){fig-alt="Alt text for this image"}.

Metadata: YAML

The YAML metadata or header is:

influences the final document in different ways. It is placed at the very beginning of the document and is read by each of Pandoc, Quarto and knitr. Along the way, the information that it contains can affect the code, content, and the rendering process.

YAML

title: "My Document"
format: 
  html:
    toc: true
    code_folding: 'show'

See more formats and other YAML metadata options here

Why YAML?

To avoid manually typing out all the options, every time!

terminal

quarto render document.qmd --to html

terminal

quarto render document.qmd --to html -M code fold:true

terminal

quarto render document.qmd --to html -M code-fold:true -P alpha:0.2 -P ratio:0.3

Quarto workflow

Executing the Quarto Render button in RStudio will call Quarto render in a background job - this will prevent Quarto rendering from cluttering up the R console, and gives you and easy way to stop.

Markdown

Quarto uses markdown as its underlying document syntax. Markdown is a plain text format that is designed to be easy to write, and, even more importantly, easy to read

Text Formatting

Markdown Syntax	Output
`italics and bold`	italics and bold
`superscript^2^ / subscript~2~`	superscript² / subscript₂
`~~strikethrough~~`	~~strikethrough~~
`verbatim code`	`verbatim code`

Headings

Markdown Syntax	Output
`# Header 1`	Header 1
`## Header 2`	Header 2
`### Header 3`	Header 3
`#### Header 4`	Header 4

Code

```{r}
#| output-location: column
#| fig-cap: Temperature and ozone level.
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) + 
  geom_point() + 
  geom_smooth(method = "loess")
```

Data importing, manipulating, and exporting

Get data

Download the example dataset
Store data file in “data/raw” folder

Reading data in

```{r}
## Read in data
library(readxl)
library(tidyverse)
data <- read_excel("Exercise/data/raw/data.xlsx", col_names = T) # if you created an R project you can use the direct path "data/raw/data.xlsx"
```

Manipulating data

```{r}
# Name correction
names(data) <- gsub("d2priv", "dapriv2", names(data))

# Own function
recode_5 <- function(x) {               
  x * (-1)+6
}

# Use of mutate_at to apply function
data_proc <- data %>%
  mutate_at(vars(matches("gattAI1_3|gattAI1_6|gattAI1_8|gattAI1_9|gattAI1_10|gattAI2_5|gattAI2_9|gattAI2_10")), recode_5)
```

Exporting data

```{r}
library(haven)
write_sav(data_proc, "Exercise/data/processed/data_proc.sav") # if you created an R project you can use the direct path "data/processed/data_proc.sav"
```

The R folder

Storing and sourcing custom functions from the R folder

You may store your custom functions in a separate file (e.g., “R/custom-functions.R”) that you may source in your report document (“report.qmd”)

In R/custom-functions.R:

```{r}
# Own function
recode_5 <- function(x) {               
  x * (-1)+6
}
```

Reading in your custom functions:

```{r}
source("Exercise/R/custom-functions.R")
```

Tables and visualisations

Reading in data

```{r}
data <- haven::read_sav("Exercise/data/processed/data_proc.sav")
```

Creating a correlation table

Creating composites

```{r}
data <- data[ , purrr::map_lgl(data, is.numeric)] %>% # select numeric variables
  select(matches("gattAI1|soctechblind|trust1|anxty1|SocInf1|Age")) # select relevant variables

comp_split <- data %>% sjlabelled::remove_all_labels(.) %>% 
  split.default(sub("_.*", "", names(data))) # creating a list of dataframes, where each dataframe consists of the columns from the original data that shared the same prefix (all characters before the underscore)

comp <- purrr::map(comp_split, ~ rowMeans(.x, na.rm=T)) #calculating the row-wise mean of each data frame in the list `comp_split`, with the output being a new list (`comp`) where each element is a numeric vector of row means from each corresponding data frame in `comp_split`

comp_df <- do.call("cbind", comp) %>% as.data.frame(.) # binding all the elements in the list `comp` into a single data frame, `comp_df`
```

Creating the correlation table

```{r}
cor_tab <- corstars(comp_df, removeTriangle = "upper")
cor_tab
```

                   Age   SocInf1    anxty1   gattAI1 soctechblind
Age                                                              
SocInf1       0.12                                               
anxty1       -0.03     -0.11                                     
gattAI1       0.00      0.17*     0.16*                          
soctechblind  0.09     -0.03      0.30***   0.30***              
trust1       -0.06      0.57***  -0.21**    0.12        -0.22**

Creating correlation plot

```{r}
cor_matrix <- cor(comp_df[1:6])

corrplot::corrplot(cor_matrix, method="color", type="upper", order="hclust", 
         addCoef.col = "black", # Add correlation coefficient on the plot
         tl.col="black", # Text label color
         tl.srt=90, # Text label rotation
         title="Correlation matrix", mar=c(0,0,1,0))
```

Scatterplot

```{r}
scatter <- ggplot(comp_df, aes(x=SocInf1, y=trust1, color=Age)) +
  geom_point() +
  labs(x="Anxiety", y="GattAI1", color="Age") +
  theme_minimal() +
  ggtitle("Scatterplot of social influence and trust colored by age")
scatter
```

Saving scatterplot

```{r}
ggsave(filename="Exercise/output/figs/scatter.png", plot=scatter)
```

Referencing, style and the config folder

Export the bibliography

Select a csl style

Zotero Style Repository

___
bibliography: "/config/refs.bib"
csl: "/config/apa.csl"
---

Store a template word document

format: 
  docx:
    reference-doc: "/config/template_apa.docx"

Folder structure overview

Current folder structure

Back to “report.qmd”

The visual editor and inserting text and citations

Questions and outlook

Questions?

Outlook

Tomorrow we will look into data analysis publication for true reproducibility

You will get the chance to work on your own data analysis project or use the material provided

Thank you and see you tomorrow!

Introduction to reproducible data analysis with R and Quarto

Schedule

DAY 1

9.00-9.30: Introduction

9.30-10.30: Getting Started with R and Quarto

11.00-12.00: Creating data analysis projects in R and quarto

12.15-13.00: Working with data

DAY 2

09.00-10.00: Welcome and Hands-on Practice PART I

10.10-11.00: Publishing Reproducible Data Analysis Scripts

11.30-12.30: Hands-on Practice PART II

12.30-13.00: Closing Remarks

Notes

Workshop Requirements

Reproducibility and open science

Reproducibility and replicability

Elements of Open Science

The benefits of Open Science for researchers

The role of R and Quarto (RMarkdown) for Open Science

R and Quarto are open-source

Producing shareable and collaborative high-quality documents that integrate text and R code for analysis and visualization within one document

Change Your Mental Model

Getting Started with R

Getting Started with R

Creating a data analysis project

Create a folder for your R project

Create a Quarto document (report.qmd)

Install relevant packages

Create an R project

R projects

Working with Quarto

How does Quarto work?

So what is Quarto?

One install, “Batteries included”

Rendering

Rendering

Quick excourse: The Terminal

Why you should know how to use the terminal (at least a little)

Rendering

Working with a .qmd

A .qmd is a plain text file

Metadata: YAML

YAML

Why YAML?

Quarto workflow

Markdown

Text Formatting

Headings

Header 1

Header 2

Header 3

Header 4

Code

Data importing, manipulating, and exporting

Get data

Reading data in

Manipulating data

Exporting data

The R folder

Storing and sourcing custom functions from the R folder

Tables and visualisations

Reading in data

Creating a correlation table

Creating composites

Creating the correlation table

Creating correlation plot

Scatterplot

Saving scatterplot

Referencing, style and the config folder

Export the bibliography

Select a csl style

Store a template word document

Folder structure overview

Current folder structure

Back to “report.qmd”

The visual editor and inserting text and citations

Questions and outlook

Questions?

Outlook

Thank you and see you tomorrow!

Working with a `.qmd`

A `.qmd` is a plain text file