This lesson is being piloted (Beta version)

Variant Analysis Workshop: Setup

Before the workshop starts

  1. Get the latest version of R.
  2. Get the latest version of Bioconductor.
  3. Install the specific Bioconductor packages we will be using.
  4. Download the large genomics files that we will use.

Software

This workshop will be conducted in R. You should have the most recent version of R installed from CRAN. I highly recommend also installing the free version of RStudio Desktop to make R easier to work with.

You should also install the latest version of Bioconductor. As of October 2020, you can install Bioconductor with the following commands at the R prompt:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.12")

The version number changes every six months. If asked about updating packages, I recommend selecting ‘a’ to update all, and not compiling packages from source. If some packages fail to install, close RStudio, reopen it, and run the last install command again. Because of all the dependencies, even if you are doing everything correctly it can take a few rounds.

Additionally, you should run code to install packages that we will need.

BiocManager::install(c("VariantAnnotation", "snpStats", "GenomicFeatures",
                       "airway", "scater"))

From CRAN, we’ll install magrittr, which will let us use the %>% (pipe) symbol to make some code more readable, as well as ggplot2 for making some graphs, and dplyr for some data frame manipulation.

install.packages(c("magrittr", "ggplot2", "dplyr"))

To make sure it worked, run the code

library(VariantAnnotation)
library(snpStats)
library(GenomicFeatures)
library(scater)
library(dplyr)
library(ggplot2)

Please complete all of the above installations at least a few hours before the workshop begins, and email the instructor if you encounter any errors.

Make a project in RStudio

In RStudio, go to File –> New Project –> New Directory –> New Project. Name the project variant_analysis_2020 or something that makes sense to you, and put it in a folder of your choosing on your computer. Once that is done, in the lower right pane click “New Folder” and make a folder called “data”.

If you have never made an RStudio project before and are confused, don’t worry, I will do a quick demonstration at the beginning of the workshop. Please still download the data files below, and save them somewhere that you can quickly find them.

Data

We will work with some public data from maize. There are four files that you need to download and save to the data folder in your project directory.

For the December 2020 workshop

In the Box folder for this workshop, download the four files in the data folder. See below for the origins of these files.

Otherwise

Download the following two files to your computer and unzip them (using gunzip on Linux or Mac, or 7-Zip on Windows):

B73 v4 reference genome

B73 v4 gene annotations

You should also download the example VCF that we’ll be working with. It is derived from a panel of 1210 maize lines (Bukowski et al., 2018).

Lastly, there is a small CSV to download.

R Code

You will learn the most if you follow along by typing the code yourself. Mistakes are a good thing! You find what you did wrong and remember it for next time, and probably learn something new about R in the process. However, I also don’t want anyone falling 15 minutes behind while tracking down one misplaced parenthesis. Therefore, I recommend downloading the RMarkdown files, which you can use to quickly get caught up.

For the December 2020 workshop

Download the one .md and four .Rmd files from the “Lessons_and_code” folder on Box. Note that exercise solutions have been deleted from these files.

Otherwise

Go to the _episodes_rmd folder on GitHub. From there, you can click on an individual episode and click “Raw”. Save the .Rmd file to your computer. (If you see raw text in your web browser, right click on it and select “Save Page As”.) Save the files into your project directory.