Introduction to reproducible data analysis with R and Quarto - Day II

KLI Seminar 2023

Anne-Kathrin Kleine

Schedule

DAY 2

09.00-9.30: Recap from yesterday, tidystats and the groundhog package


09.30-10.30: Publishing Reproducible Data Analysis Scripts

  • Sharing your scripts and data
  • Publishing through GitHub, RPubs; integration with osf

11.00-12.30: Hands-on Practice (with break)

  • Structure, code, and version control: Work on your data analysis project

12.30-13.00: Closing Remarks

  • Recap of the workshop
  • Q&A session

Recap, dependencies, and the groundhog package

Packages and package dependecies

What happened and how to avoid it

  • The custom_functions.R file was pretty massive and needed many R packages that many of you had not installed
  • One way around that: deleting all functions that are not needed and install relevant packages for the remaining (what we did yesterday)

Another solution: using the dependencies package

```{r}
#| eval: false
library(renv)
deps <- dependencies(path = "Exercise/R")

# Extract package names
pkgs <- deps$Package

# Remove duplicates, just in case
pkgs <- unique(pkgs)

# Install packages
install.packages(pkgs)
```

… however, that does not work if you are using different versions of packages!

The groundhog package

  • Different versions of a package may produce different results or are different in terms of functionality
  • When sharing scripts/code or when collaborating with others on a project, using the same version of a package becomes crucial
  • The groundhog package in R is used for managing the package dependencies in an R script with specific requirements for package versions
  • The groundhog package allows users to load packages from a specific date in the past
  • Instead of loading the most recent (or installed) version of a package with the library() function, users can load a version of the package that was current as of a specific date

The groundhog package

  • With groundhog, the only thing you need to change to make your R code reproducible is:
Instead of: library(pkg)
Do this: groundhog.library(pkg, date)

The groundhog package

```{r}
#| eval: false
library("groundhog")
groundhog.library(pkgs, "2022-12-01")
```

For your current scripts

```{r}
#| eval: false
library("groundhog")
groundhog.library("
    library(pkgA)
    library(pkgB)
    library(pkgC)   ",    date)
```

pretty insightful article on the groundhog package

Tidystats

Learn how to use it here

Version control

What is version control?

  • Version control, also known as source control, is the practice of tracking and managing changes to software code
  • Version control systems are software tools that help software teams manage changes to source code over time

The benefits of version control

  • A complete long-term change history of every file
  • Branching and merging
  • Ability to make your work more reproducible
  • Collaboration through platforms hosting versions of your code

What is Git?

  • Git is a software that keeps track of versions of a set of files
  • It is local to you; the records are kept on your computer

What is GitHub?

  • a hosting service that can keep the records
  • it is remote to you, like Dropbox
  • GitHub is specifically structured to keep records with Git

Getting started with Git

1. Check that Git is installed

  • In the terminal (in RStudio), type
terminal
which git


  • Check your git version
terminal
git --version


2. Generate token

3. Set credentials from within RStudio

terminal
gitcreds::gitcreds_set()


4. Tell Git who you are

terminal
git config --global user.name "jack.bel" # use your GitHub username instead
git config --global user.email jack.bel@gmail.com # use your mail address that you have for your Github account instead


Or:

terminal
install.packages("usethis")
usethis::use_git_config(user.name="Jane Doe", user.email="jane@example.org")

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

GitHub Knowledge Base

Commands Operations
git init <directory> Create empty Git repo in specified directory
git clone <repository> Clone a repository located at your local machine
git config user.name <username> Define author name to be used for all commits in current repository
git add <directory> Stage all changes in for the next commit

GitHub Knowledge Base

Commands Operations
git commit -m <"message"> Commit the staged snapshot, but instead of launching a text editor, use <“message”> as the commit message
git status List which files are staged, unstaged, and untracked
git log Display the entire commit history using the default format

GitHub Knowledge Base

Commands Operations
git pull <remote> Fetch the specified remote’s copy of current branch and immediately merge it into the local copy
git push <remote> <branch> Upload local repository content to a remote repository

GitHub Knowledge Base

Pull requests
  • Proposed changes to a repository submitted by a user and accepted or rejected by a repository’s collaborators

  • Pull requests each have their own discussion forum

Issues
  • Suggested improvements, tasks or questions related to the repository

  • Can be created by anyone (for public repositories), and are moderated by repository collaborators

  • Each issue contains its own discussion thread

The version control workflow with Git and GitHub

On GitHub

  1. Go to GitHub and create a new repository

  1. Fill in some info, create a public repository

  1. Follow the steps in Option 1: “…create a new repository on the command line”

In the terminal

  1. In the terminal 📱, navigate to your Quarto project folder:
terminal
cd project_folder


  1. initialize a git repo on your local machine:
terminal
git init


  1. create content you can then add in the next step (e.g., README file):
terminal
touch README.md

  1. stage all the content in that folder to be added:
terminal
git add .


  1. stage all the content in that folder to be committed:
terminal
git commit -m "add empty readme"

  1. connect local repo to the remote repo. Substitute the link with your repo URL!
terminal
git remote add origin https://github.com/AnneOkk/testrepo.git


  1. push all the content from Git to GitHub:
terminal
git push origin master


🎈 The pushed files should appear in your GitHub repository 🎈

Your Turn (45 min)

Poll: What would you like to focus on in the exercise?

1) Focus on R Project structure

2) Focus on code improvement

3) Focus on version control with Git and GitHub

Your Turn (45 min)

  1. [Get Git and GitHub running]

  2. (Re)structure your project based on yesterday’s instructions

  3. Connect your local R project folder with a GitHub repository

  4. Change some of the content in R, save, and then push the changes to GitHub

terminal
git add .
git commit -m "senseful commit message that describes the change(s)"
git push origin master

Connecting osf to GitHub

Connecting osf to GitHub

  1. Create your osf project
  1. Enable GitHub in Add-ons

  1. Import GitHub Account

4. Select Repo

🎊 Yey, you’re all set to connect your GitHub content to osf! 🎊

Extra: .gitignore

Create a .gitignore file in your project folder

terminal
touch .gitignore

Inside the .gitignore file

.Rproj.user
.Rhistory
.RData

# Data preparation folder
/Data_prep

# Some folders
/Manuscript_cache
/Tables
/Manuscript_files
/OLD

# Manuscript file
Manuscript.docx

…And there is so much more!

Find me at:

@AnKaKleine

@AnneOkk

http://annekathrinkleine.com/

  • You will get the chance to work on your own data analysis project

  • For this, you will have ~ 30 minutes to prepare the folder structure tomorrow

  • You may use either the example material or work on your own projects