Introduction to reproducible data analysis with R and Quarto

KLI Seminar 2023

Anne-Kathrin Kleine

Schedule

DAY 1

9.00-9.30: Introduction

  • Welcome and overview
  • Reproducibility and open science

9.30-10.30: Getting Started with R and Quarto

  • Basics of R programming
  • Best practices for organizing your code and materials

11.00-12.00: Creating data analysis projects in R and quarto

  • Creating a data analysis project
  • Integrating code, text, and output within a Quarto document

12.15-13.00: Working with data

  • Working with data in R: importing, manipulating, and exporting data
  • Data visualization and data analysis

DAY 2

09.00-10.00: Welcome and Hands-on Practice PART I

  • Independent practice: Working on your data analysis project (with support from workshop facilitator)

10.10-11.00: Publishing Reproducible Data Analysis Scripts

  • Publishing through GitHub, RPubs; integration with osf

11.30-12.30: Hands-on Practice PART II

  • Continue working on your data analysis projects
  • Publishing your projects

12.30-13.00: Closing Remarks

  • Recap of the workshop
  • Q&A session

Notes

  • If you want to say something, please raise your hand. If I don’t see it, please unmute yourself and start talking
  • Please participate actively throughout the course - can be comments, questions (!), answers
  • Mute yourself when you are not speaking
  • Small technical challenges may be addressed directly (e.g., “Where do I have to click to open RStudio?”)
  • Bigger technical challenges may be addressed either after the workshop today or before the workshop tomorrow (e.g., “My RStudio does not work”)

Workshop Requirements

  • Open RStudio

Reproducibility and open science

Reproducibility and replicability

Reproducibility

Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research.

Reproducibility emphasizes re-using original data and methods for verification, while replicability focuses on achieving consistent results with repeats of the experiment under similar conditions but using new data.

From The Open Science Training Handbook

Elements of Open Science

Source: The six core principles of Open Science

The benefits of Open Science for researchers

Source: The benefits of Open Science

The role of R and Quarto (RMarkdown) for Open Science

R and Quarto are open-source

  • ensuring accessibility, collaboration, customizability, longevity

Producing shareable and collaborative high-quality documents that integrate text and R code for analysis and visualization within one document

  • minimize “false” results in output documents because nothing needs to be copy-pasted
  • “coding” of research papers (easy integration of later modifications)

Change Your Mental Model

Source

A blank Word document

Output

A blank Word document

Source

---
title: "ggplot2 demo"
author: "Norah Jones"
date: "5/22/2021"
format: 
  html:
    fig-width: 8
    fig-height: 4
    code-fold: true
---

## Air Quality

@fig-airquality further explores the impact of temperature 
  on ozone level.

```{r}
#| label: fig-airquality
#| fig-cap: Temperature and ozone level.
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) + 
  geom_point() + 
  geom_smooth(method = "loess"
)
```

Output

Getting Started with R

Getting Started with R

Creating a data analysis project

Create a folder for your R project

Create a Quarto document (report.qmd)

Install relevant packages

```{r}
#| eval: false
pkg_list <- c("tidyverse", "haven", "flextable", "broom", "report", "effectsize", "rempsyc")
install.packages(pkg_list)
```

Create an R project

R projects

When a project is opened within RStudio the following actions are taken:

  • A new R session is started
  • The .Rprofile, .RData file, and .Rhistory files in the project’s main directory are loaded
  • The current working directory is set to the project directory
  • Previously edited source documents are restored into editor tabs
  • Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed

Working with Quarto

How does Quarto work?

A diagram of how a QMD is turned into output formats via knitr and pandoc

So what is Quarto?

Quarto is a command line interface (CLI) that renders plain text formats (.qmd, .rmd, .md) into static PDF/Word/HTML reports, books, websites, presentations and more

One install, “Batteries included”

Rendering

  1. Render button

A screenshot of the render button in RStudio

Rendering

  1. Terminal shell via quarto render
terminal
quarto render document.qmd # defaults to html
quarto render document.qmd --to pdf
quarto render document.qmd --to docx

Quick excourse: The Terminal

Windows Power Shell

  • Click Start, type PowerShell, right-click Windows PowerShell, and then click Run as administrator

Image source: Wikipedia

Mac Terminal

  • Click the Launchpad icon in the Dock, type Terminal in the search field, then click Terminal

Image source: Wikipedia

Why you should know how to use the terminal (at least a little)

  • Install software
  • Open programs
  • Run programs directly
  • Scheduling scripts
  • You get feedback (error messages)
  • Version control (Git and GitHub)
  • Once you know how to use it, it’s more efficient for navigation and creating/ modifying files

Rendering

  1. R console via quarto R package
```{r}
#| eval: false
library(quarto)
quarto_render("document.qmd") # defaults to html
quarto_render("document.qmd", output_format = "pdf")
```

Working with a .qmd

A .qmd is a plain text file

  • Metadata (YAML)
format: html
engine: knitr
  • Code
library(dplyr)
mtcars |> 
  group_by(cyl) |> 
  summarize(mean = mean(mpg))
  • Text
# Heading 1
This is a sentence with some **bold text**, *italic text* and an 
![image](image.png){fig-alt="Alt text for this image"}.

Metadata: YAML

The YAML metadata or header is:

influences the final document in different ways. It is placed at the very beginning of the document and is read by each of Pandoc, Quarto and knitr. Along the way, the information that it contains can affect the code, content, and the rendering process.

YAML

title: "My Document"
format: 
  html:
    toc: true
    code_folding: 'show'

See more formats and other YAML metadata options here

Why YAML?

To avoid manually typing out all the options, every time!

terminal
quarto render document.qmd --to html


terminal
quarto render document.qmd --to html -M code fold:true


terminal
quarto render document.qmd --to html -M code-fold:true -P alpha:0.2 -P ratio:0.3

Quarto workflow

Executing the Quarto Render button in RStudio will call Quarto render in a background job - this will prevent Quarto rendering from cluttering up the R console, and gives you and easy way to stop.

Markdown

Quarto uses markdown as its underlying document syntax. Markdown is a plain text format that is designed to be easy to write, and, even more importantly, easy to read

Text Formatting

Markdown Syntax Output
*italics* and **bold**
italics and bold
superscript^2^ / subscript~2~
superscript2 / subscript2
~~strikethrough~~
strikethrough
`verbatim code`
verbatim code

Headings

Markdown Syntax Output
# Header 1

Header 1

## Header 2

Header 2

### Header 3

Header 3

#### Header 4

Header 4

Code

```{r}
#| output-location: column
#| fig-cap: Temperature and ozone level.
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) + 
  geom_point() + 
  geom_smooth(method = "loess")
```

Temperature and ozone level.

Data importing, manipulating, and exporting

Get data

Reading data in

```{r}
## Read in data
library(readxl)
library(tidyverse)
data <- read_excel("Exercise/data/raw/data.xlsx", col_names = T) # if you created an R project you can use the direct path "data/raw/data.xlsx"
```

Manipulating data

```{r}
# Name correction
names(data) <- gsub("d2priv", "dapriv2", names(data))

# Own function
recode_5 <- function(x) {               
  x * (-1)+6
}

# Use of mutate_at to apply function
data_proc <- data %>%
  mutate_at(vars(matches("gattAI1_3|gattAI1_6|gattAI1_8|gattAI1_9|gattAI1_10|gattAI2_5|gattAI2_9|gattAI2_10")), recode_5)
```

Exporting data

```{r}
library(haven)
write_sav(data_proc, "Exercise/data/processed/data_proc.sav") # if you created an R project you can use the direct path "data/processed/data_proc.sav"
```

The R folder

Storing and sourcing custom functions from the R folder

  • You may store your custom functions in a separate file (e.g., “R/custom-functions.R”) that you may source in your report document (“report.qmd”)

In R/custom-functions.R:

```{r}
# Own function
recode_5 <- function(x) {               
  x * (-1)+6
}
```

Reading in your custom functions:

```{r}
source("Exercise/R/custom-functions.R")
```

Tables and visualisations

Reading in data

```{r}
data <- haven::read_sav("Exercise/data/processed/data_proc.sav")
```

Creating a correlation table

Creating composites

```{r}
data <- data[ , purrr::map_lgl(data, is.numeric)] %>% # select numeric variables
  select(matches("gattAI1|soctechblind|trust1|anxty1|SocInf1|Age")) # select relevant variables

comp_split <- data %>% sjlabelled::remove_all_labels(.) %>% 
  split.default(sub("_.*", "", names(data))) # creating a list of dataframes, where each dataframe consists of the columns from the original data that shared the same prefix (all characters before the underscore)

comp <- purrr::map(comp_split, ~ rowMeans(.x, na.rm=T)) #calculating the row-wise mean of each data frame in the list `comp_split`, with the output being a new list (`comp`) where each element is a numeric vector of row means from each corresponding data frame in `comp_split`

comp_df <- do.call("cbind", comp) %>% as.data.frame(.) # binding all the elements in the list `comp` into a single data frame, `comp_df`
```

Creating the correlation table

```{r}
cor_tab <- corstars(comp_df, removeTriangle = "upper")
cor_tab
```
                   Age   SocInf1    anxty1   gattAI1 soctechblind
Age                                                              
SocInf1       0.12                                               
anxty1       -0.03     -0.11                                     
gattAI1       0.00      0.17*     0.16*                          
soctechblind  0.09     -0.03      0.30***   0.30***              
trust1       -0.06      0.57***  -0.21**    0.12        -0.22**  

Creating correlation plot

```{r}
cor_matrix <- cor(comp_df[1:6])

corrplot::corrplot(cor_matrix, method="color", type="upper", order="hclust", 
         addCoef.col = "black", # Add correlation coefficient on the plot
         tl.col="black", # Text label color
         tl.srt=90, # Text label rotation
         title="Correlation matrix", mar=c(0,0,1,0))
```

Scatterplot

```{r}
scatter <- ggplot(comp_df, aes(x=SocInf1, y=trust1, color=Age)) +
  geom_point() +
  labs(x="Anxiety", y="GattAI1", color="Age") +
  theme_minimal() +
  ggtitle("Scatterplot of social influence and trust colored by age")
scatter
```

Saving scatterplot

```{r}
ggsave(filename="Exercise/output/figs/scatter.png", plot=scatter)
```

Referencing, style and the config folder

Export the bibliography

Select a csl style

Zotero Style Repository

___
bibliography: "/config/refs.bib"
csl: "/config/apa.csl"
---

Store a template word document

format: 
  docx:
    reference-doc: "/config/template_apa.docx"

Folder structure overview

Current folder structure

Back to “report.qmd”

The visual editor and inserting text and citations

Questions and outlook

Questions?

Outlook

  • Tomorrow we will look into data analysis publication for true reproducibility
  • You will get the chance to work on your own data analysis project or use the material provided

Thank you and see you tomorrow!