This lesson is being piloted (Beta version)

Bioconductor basics

Overview

Teaching: 50 min
Exercises: 15 min
Questions
  • What are the Bioconductor classes for the types of data we would find in a VCF?

Objectives
  • Create a GRanges object to indicate regions of the genome that you are interested in.

  • Load DNA sequences from a reference genome.

  • Extract assay metadata from the results of an experiment.

  • Find help pages to learn more about what you can do with this data.

Bioconductor packages are all designed to work together. This is why, when you tried installing five packages, it probably installed dozens of dependencies. The benefit is that the same functions are useful across many different bioinformatics workflows. To explore those building blocks, we’ll load a few of those dependencies.

library(GenomicRanges)
library(Biostrings)
library(Rsamtools)
library(SummarizedExperiment)

We’ll also load magrittr to make some code more readable.

library(magrittr)

Reference genome

In the setup, you downloaded and unzipped a reference genome file.

ref_file <- "data/Zm-B73-REFERENCE-GRAMENE-4.0.fa"

You should build an index for the file. This is another file, ending in .fai, that indicates where in the file each chromosome starts. It enables quick access to specific regions of the genome.

indexFa(ref_file)

Now we can import the reference genome.

mygenome <- FaFile(ref_file)

If we try to print it out:

mygenome
class: FaFile 
path: data/Zm-B73-REFERENCE-GRAMENE-4.0.fa
index: data/Zm-B73-REFERENCE-GRAMENE-4.0.fa.fai
gzindex: data/Zm-B73-REFERENCE-GRAMENE-4.0.fa.gzi
isOpen: FALSE 
yieldSize: NA 

Hmm, we don’t see any DNA sequences, just names of files. We haven’t actually loaded the data into R, just told R where to look when we want the data. This saves RAM since we don’t usually need to analyze the whole genome at once.

We can get some information about the genome:

seqinfo(mygenome)
Seqinfo object with 266 sequences from an unspecified genome:
  seqnames      seqlengths isCircular genome
  Chr10          150982314       <NA>   <NA>
  Chr1           307041717       <NA>   <NA>
  Chr2           244442276       <NA>   <NA>
  Chr3           235667834       <NA>   <NA>
  Chr4           246994605       <NA>   <NA>
  ...                  ...        ...    ...
  B73V4_ctg253       82972       <NA>   <NA>
  B73V4_ctg254       69595       <NA>   <NA>
  B73V4_ctg255       46945       <NA>   <NA>
  B73V4_ctg2729     124116       <NA>   <NA>
  Pt                140447       <NA>   <NA>

There are chromosomes, and some smaller contigs.

Challenge: Chromosome names

Look at the help page for seqinfo. Can you find a way to view all of the sequence names?

Solution

There is a seqnames function. It doesn’t work directly on a FaFile object however. It works on the SeqInfo object returned by seqinfo.

seqnames(seqinfo(mygenome))
  [1] "Chr10"         "Chr1"          "Chr2"          "Chr3"         
  [5] "Chr4"          "Chr5"          "Chr6"          "Chr7"         
  [9] "Chr8"          "Chr9"          "B73V4_ctg1"    "B73V4_ctg2"   
 [13] "B73V4_ctg3"    "B73V4_ctg4"    "B73V4_ctg5"    "B73V4_ctg6"   
 [17] "B73V4_ctg7"    "B73V4_ctg8"    "B73V4_ctg9"    "B73V4_ctg10"  
 [21] "B73V4_ctg11"   "B73V4_ctg12"   "B73V4_ctg13"   "B73V4_ctg14"  
 [25] "B73V4_ctg15"   "B73V4_ctg16"   "B73V4_ctg17"   "B73V4_ctg18"  
 [29] "B73V4_ctg19"   "B73V4_ctg20"   "B73V4_ctg21"   "B73V4_ctg22"  
 [33] "B73V4_ctg23"   "B73V4_ctg24"   "B73V4_ctg25"   "B73V4_ctg26"  
 [37] "B73V4_ctg27"   "B73V4_ctg28"   "B73V4_ctg29"   "B73V4_ctg30"  
 [41] "B73V4_ctg31"   "B73V4_ctg32"   "B73V4_ctg33"   "B73V4_ctg34"  
 [45] "B73V4_ctg35"   "B73V4_ctg36"   "B73V4_ctg37"   "B73V4_ctg38"  
 [49] "B73V4_ctg39"   "B73V4_ctg40"   "B73V4_ctg41"   "B73V4_ctg42"  
 [53] "B73V4_ctg43"   "B73V4_ctg44"   "B73V4_ctg45"   "B73V4_ctg46"  
 [57] "B73V4_ctg47"   "B73V4_ctg48"   "B73V4_ctg49"   "B73V4_ctg50"  
 [61] "B73V4_ctg51"   "B73V4_ctg52"   "B73V4_ctg53"   "B73V4_ctg54"  
 [65] "B73V4_ctg56"   "B73V4_ctg57"   "B73V4_ctg58"   "B73V4_ctg59"  
 [69] "B73V4_ctg60"   "B73V4_ctg61"   "B73V4_ctg62"   "B73V4_ctg63"  
 [73] "B73V4_ctg64"   "B73V4_ctg65"   "B73V4_ctg66"   "B73V4_ctg67"  
 [77] "B73V4_ctg68"   "B73V4_ctg69"   "B73V4_ctg70"   "B73V4_ctg71"  
 [81] "B73V4_ctg72"   "B73V4_ctg73"   "B73V4_ctg74"   "B73V4_ctg75"  
 [85] "B73V4_ctg76"   "B73V4_ctg77"   "B73V4_ctg78"   "B73V4_ctg79"  
 [89] "B73V4_ctg80"   "B73V4_ctg81"   "B73V4_ctg82"   "B73V4_ctg83"  
 [93] "B73V4_ctg84"   "B73V4_ctg85"   "B73V4_ctg86"   "B73V4_ctg87"  
 [97] "B73V4_ctg88"   "B73V4_ctg89"   "B73V4_ctg90"   "B73V4_ctg91"  
[101] "B73V4_ctg92"   "B73V4_ctg93"   "B73V4_ctg94"   "B73V4_ctg95"  
[105] "B73V4_ctg96"   "B73V4_ctg97"   "B73V4_ctg98"   "B73V4_ctg99"  
[109] "B73V4_ctg100"  "B73V4_ctg101"  "B73V4_ctg102"  "B73V4_ctg103" 
[113] "B73V4_ctg104"  "B73V4_ctg105"  "B73V4_ctg106"  "B73V4_ctg107" 
[117] "B73V4_ctg108"  "B73V4_ctg109"  "B73V4_ctg110"  "B73V4_ctg111" 
[121] "B73V4_ctg112"  "B73V4_ctg113"  "B73V4_ctg114"  "B73V4_ctg115" 
[125] "B73V4_ctg116"  "B73V4_ctg117"  "B73V4_ctg118"  "B73V4_ctg119" 
[129] "B73V4_ctg120"  "B73V4_ctg121"  "B73V4_ctg122"  "B73V4_ctg123" 
[133] "B73V4_ctg124"  "B73V4_ctg125"  "B73V4_ctg126"  "B73V4_ctg127" 
[137] "B73V4_ctg128"  "B73V4_ctg129"  "B73V4_ctg130"  "B73V4_ctg131" 
[141] "B73V4_ctg132"  "B73V4_ctg133"  "B73V4_ctg134"  "B73V4_ctg135" 
[145] "B73V4_ctg136"  "B73V4_ctg137"  "B73V4_ctg138"  "B73V4_ctg139" 
[149] "B73V4_ctg140"  "B73V4_ctg141"  "B73V4_ctg142"  "B73V4_ctg143" 
[153] "B73V4_ctg144"  "B73V4_ctg145"  "B73V4_ctg146"  "B73V4_ctg147" 
[157] "B73V4_ctg148"  "B73V4_ctg149"  "B73V4_ctg150"  "B73V4_ctg151" 
[161] "B73V4_ctg152"  "B73V4_ctg153"  "B73V4_ctg154"  "B73V4_ctg155" 
[165] "B73V4_ctg156"  "B73V4_ctg157"  "B73V4_ctg158"  "B73V4_ctg159" 
[169] "B73V4_ctg160"  "B73V4_ctg161"  "B73V4_ctg162"  "B73V4_ctg163" 
[173] "B73V4_ctg164"  "B73V4_ctg165"  "B73V4_ctg166"  "B73V4_ctg167" 
[177] "B73V4_ctg168"  "B73V4_ctg169"  "B73V4_ctg170"  "B73V4_ctg171" 
[181] "B73V4_ctg172"  "B73V4_ctg173"  "B73V4_ctg174"  "B73V4_ctg175" 
[185] "B73V4_ctg176"  "B73V4_ctg177"  "B73V4_ctg178"  "B73V4_ctg179" 
[189] "B73V4_ctg180"  "B73V4_ctg181"  "B73V4_ctg182"  "B73V4_ctg183" 
[193] "B73V4_ctg184"  "B73V4_ctg185"  "B73V4_ctg186"  "B73V4_ctg187" 
[197] "B73V4_ctg188"  "B73V4_ctg189"  "B73V4_ctg190"  "B73V4_ctg191" 
[201] "B73V4_ctg192"  "B73V4_ctg193"  "B73V4_ctg194"  "B73V4_ctg195" 
[205] "B73V4_ctg196"  "B73V4_ctg197"  "B73V4_ctg198"  "B73V4_ctg199" 
[209] "B73V4_ctg200"  "B73V4_ctg201"  "B73V4_ctg202"  "B73V4_ctg203" 
[213] "B73V4_ctg204"  "B73V4_ctg205"  "B73V4_ctg206"  "B73V4_ctg207" 
[217] "B73V4_ctg208"  "B73V4_ctg209"  "B73V4_ctg210"  "B73V4_ctg211" 
[221] "B73V4_ctg212"  "B73V4_ctg213"  "B73V4_ctg214"  "B73V4_ctg215" 
[225] "B73V4_ctg216"  "B73V4_ctg217"  "B73V4_ctg218"  "B73V4_ctg219" 
[229] "B73V4_ctg220"  "B73V4_ctg221"  "B73V4_ctg222"  "B73V4_ctg223" 
[233] "B73V4_ctg224"  "B73V4_ctg225"  "B73V4_ctg226"  "B73V4_ctg227" 
[237] "B73V4_ctg228"  "B73V4_ctg229"  "B73V4_ctg230"  "B73V4_ctg231" 
[241] "B73V4_ctg232"  "B73V4_ctg233"  "B73V4_ctg234"  "B73V4_ctg235" 
[245] "B73V4_ctg236"  "B73V4_ctg237"  "B73V4_ctg238"  "B73V4_ctg239" 
[249] "B73V4_ctg240"  "B73V4_ctg241"  "B73V4_ctg242"  "B73V4_ctg243" 
[253] "B73V4_ctg244"  "B73V4_ctg245"  "B73V4_ctg246"  "B73V4_ctg247" 
[257] "B73V4_ctg248"  "B73V4_ctg249"  "B73V4_ctg250"  "B73V4_ctg251" 
[261] "B73V4_ctg252"  "B73V4_ctg253"  "B73V4_ctg254"  "B73V4_ctg255" 
[265] "B73V4_ctg2729" "Pt"           

Coordinates within a genome

Any time we want to specify coordinates within a genome, we use a GRanges object. This could indicate the locations of anything including SNPs, exons, QTL regions, or entire chromosomes. Somewhat confusingly, every GRanges object contains an IRanges object containing the positions, and then the GRanges object tags on the chromosome name. Let’s build one from scratch.

myqtl <- GRanges(c("Chr2", "Chr2", "Chr8"),
                 IRanges(start = c(134620000, 48023000, 150341000),
                         end   = c(134752000, 48046000, 150372000)))
myqtl
GRanges object with 3 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     Chr2 134620000-134752000      *
  [2]     Chr2   48023000-48046000      *
  [3]     Chr8 150341000-150372000      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

We can add some extra info, like row names and metadata columns.

names(myqtl) <- c("Yld1", "LA1", "LA2")
myqtl$Trait <- c("Yield", "Leaf angle", "Leaf angle")
myqtl
GRanges object with 3 ranges and 1 metadata column:
       seqnames              ranges strand |       Trait
          <Rle>           <IRanges>  <Rle> | <character>
  Yld1     Chr2 134620000-134752000      * |       Yield
   LA1     Chr2   48023000-48046000      * |  Leaf angle
   LA2     Chr8 150341000-150372000      * |  Leaf angle
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Although this appears two-dimensional like a data frame, if we only want certain rows, we index it in a one-dimensional way like a vector.

myqtl[1]
GRanges object with 1 range and 1 metadata column:
       seqnames              ranges strand |       Trait
          <Rle>           <IRanges>  <Rle> | <character>
  Yld1     Chr2 134620000-134752000      * |       Yield
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
myqtl["LA2"]
GRanges object with 1 range and 1 metadata column:
      seqnames              ranges strand |       Trait
         <Rle>           <IRanges>  <Rle> | <character>
  LA2     Chr8 150341000-150372000      * |  Leaf angle
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
myqtl[myqtl$Trait == "Leaf angle"]
GRanges object with 2 ranges and 1 metadata column:
      seqnames              ranges strand |       Trait
         <Rle>           <IRanges>  <Rle> | <character>
  LA1     Chr2   48023000-48046000      * |  Leaf angle
  LA2     Chr8 150341000-150372000      * |  Leaf angle
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Handy utility functions

See ?IRanges::shift for some useful functions for manipulating GRanges objects. The width function is also helpful if you want to know the size of each range. The mcols function retrieves all metadata columns, like our “Trait” column. Check out browseVignettes("GenomicRanges") to learn even more.

We can also import our gene annotations in to a GRanges object. This should be familiar if you took the HPCBio introductory Bioconductor workshop this semester.

gtf0 <- rtracklayer::import("data/Zm-B73-REFERENCE-GRAMENE-4.0_Zm00001d.2.gff3")
gtf0
GRanges object with 2819080 ranges and 20 metadata columns:
            seqnames        ranges strand |   source            type     score
               <Rle>     <IRanges>  <Rle> | <factor>        <factor> <numeric>
        [1]        1   1-307041717      * |  wareLab  chromosome            NA
        [2]        1   44289-49837      + |  gramene  gene                  NA
        [3]        1   44289-49837      + |  gramene  mRNA                  NA
        [4]        1   44289-44350      + |  gramene  five_prime_UTR        NA
        [5]        1   44289-44947      + |  gramene  exon                  NA
        ...      ...           ...    ... .      ...             ...       ...
  [2819076]       Pt 140068-140361      + |  gramene mRNA                   NA
  [2819077]       Pt 140068-140133      + |  gramene five_prime_UTR         NA
  [2819078]       Pt 140068-140361      + |  gramene exon                   NA
  [2819079]       Pt 140068-140361      + |  gramene CDS                    NA
  [2819080]       Pt 140350-140361      + |  gramene three_prime_UTR        NA
                phase                     ID        biotype    description
            <integer>            <character>    <character>    <character>
        [1]      <NA>           chromosome:1           <NA>           <NA>
        [2]      <NA>    gene:Zm00001d027230 protein_coding Zm00001d027230
        [3]      <NA> transcript:Zm00001d0.. protein_coding           <NA>
        [4]      <NA>                   <NA>           <NA>           <NA>
        [5]      <NA>                   <NA>           <NA>           <NA>
        ...       ...                    ...            ...            ...
  [2819076]      <NA> transcript:GRMZM5G85.. protein_coding           <NA>
  [2819077]      <NA>                   <NA>           <NA>           <NA>
  [2819078]      <NA>                   <NA>           <NA>           <NA>
  [2819079]         0  CDS:GRMZM5G855343_P01           <NA>           <NA>
  [2819080]      <NA>                   <NA>           <NA>           <NA>
                   gene_id  logic_name                 Parent
               <character> <character>        <CharacterList>
        [1]           <NA>        <NA>                       
        [2] Zm00001d027230  maker_gene                       
        [3]           <NA>        <NA>    gene:Zm00001d027230
        [4]           <NA>        <NA> transcript:Zm00001d0..
        [5]           <NA>        <NA> transcript:Zm00001d0..
        ...            ...         ...                    ...
  [2819076]           <NA>        <NA>     gene:GRMZM5G855343
  [2819077]           <NA>        <NA> transcript:GRMZM5G85..
  [2819078]           <NA>        <NA> transcript:GRMZM5G85..
  [2819079]           <NA>        <NA> transcript:GRMZM5G85..
  [2819080]           <NA>        <NA> transcript:GRMZM5G85..
                  transcript_id                   Name constitutive
                    <character>            <character>  <character>
        [1]                <NA>                   <NA>         <NA>
        [2]                <NA>                   <NA>         <NA>
        [3] Zm00001d027230_T001                   <NA>         <NA>
        [4]                <NA>                   <NA>         <NA>
        [5]                <NA> Zm00001d027230_T001...            1
        ...                 ...                    ...          ...
  [2819076]   GRMZM5G855343_T01                   <NA>         <NA>
  [2819077]                <NA>                   <NA>         <NA>
  [2819078]                <NA> GRMZM5G855343_T01.ex..            1
  [2819079]                <NA>                   <NA>         <NA>
  [2819080]                <NA>                   <NA>         <NA>
            ensembl_end_phase ensembl_phase                exon_id        rank
                  <character>   <character>            <character> <character>
        [1]              <NA>          <NA>                   <NA>        <NA>
        [2]              <NA>          <NA>                   <NA>        <NA>
        [3]              <NA>          <NA>                   <NA>        <NA>
        [4]              <NA>          <NA>                   <NA>        <NA>
        [5]                 0            -1 Zm00001d027230_T001...           1
        ...               ...           ...                    ...         ...
  [2819076]              <NA>          <NA>                   <NA>        <NA>
  [2819077]              <NA>          <NA>                   <NA>        <NA>
  [2819078]                -1            -1 GRMZM5G855343_T01.ex..           1
  [2819079]              <NA>          <NA>                   <NA>        <NA>
  [2819080]              <NA>          <NA>                   <NA>        <NA>
                   protein_id           Alias Is_circular
                  <character> <CharacterList> <character>
        [1]              <NA>                        <NA>
        [2]              <NA>                        <NA>
        [3]              <NA>                        <NA>
        [4]              <NA>                        <NA>
        [5]              <NA>                        <NA>
        ...               ...             ...         ...
  [2819076]              <NA>                        <NA>
  [2819077]              <NA>                        <NA>
  [2819078]              <NA>                        <NA>
  [2819079] GRMZM5G855343_P01                        <NA>
  [2819080]              <NA>                        <NA>
  -------
  seqinfo: 267 sequences from an unspecified genome; no seqlengths

Unfortunately, the chromosome names have been shortened and don’t match the reference genome. We’ll find this problem with VCFs as well. Here is how to fix it.

newnames <- as.character(seqnames(gtf0))
tofix <- which(!newnames %in% seqnames(seqinfo(mygenome)))
unique(newnames[tofix])
 [1] "1"  "10" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "Mt"
newnames[newnames %in% as.character(1:10)] <- paste0("Chr", newnames[newnames %in% as.character(1:10)])
tofix <- which(!newnames %in% seqnames(seqinfo(mygenome)))
unique(newnames[tofix])
[1] "Mt"

The mitochondrial genome is in the GFF but not the FASTA, but we will ignore that for now. Continuing on with the fix:

gtf0a <- GRanges(newnames, ranges(gtf0))
mcols(gtf0a) <- mcols(gtf0)
gtf0a
GRanges object with 2819080 ranges and 20 metadata columns:
            seqnames        ranges strand |   source            type     score
               <Rle>     <IRanges>  <Rle> | <factor>        <factor> <numeric>
        [1]     Chr1   1-307041717      * |  wareLab  chromosome            NA
        [2]     Chr1   44289-49837      * |  gramene  gene                  NA
        [3]     Chr1   44289-49837      * |  gramene  mRNA                  NA
        [4]     Chr1   44289-44350      * |  gramene  five_prime_UTR        NA
        [5]     Chr1   44289-44947      * |  gramene  exon                  NA
        ...      ...           ...    ... .      ...             ...       ...
  [2819076]       Pt 140068-140361      * |  gramene mRNA                   NA
  [2819077]       Pt 140068-140133      * |  gramene five_prime_UTR         NA
  [2819078]       Pt 140068-140361      * |  gramene exon                   NA
  [2819079]       Pt 140068-140361      * |  gramene CDS                    NA
  [2819080]       Pt 140350-140361      * |  gramene three_prime_UTR        NA
                phase                     ID        biotype    description
            <integer>            <character>    <character>    <character>
        [1]      <NA>           chromosome:1           <NA>           <NA>
        [2]      <NA>    gene:Zm00001d027230 protein_coding Zm00001d027230
        [3]      <NA> transcript:Zm00001d0.. protein_coding           <NA>
        [4]      <NA>                   <NA>           <NA>           <NA>
        [5]      <NA>                   <NA>           <NA>           <NA>
        ...       ...                    ...            ...            ...
  [2819076]      <NA> transcript:GRMZM5G85.. protein_coding           <NA>
  [2819077]      <NA>                   <NA>           <NA>           <NA>
  [2819078]      <NA>                   <NA>           <NA>           <NA>
  [2819079]         0  CDS:GRMZM5G855343_P01           <NA>           <NA>
  [2819080]      <NA>                   <NA>           <NA>           <NA>
                   gene_id  logic_name                 Parent
               <character> <character>        <CharacterList>
        [1]           <NA>        <NA>                       
        [2] Zm00001d027230  maker_gene                       
        [3]           <NA>        <NA>    gene:Zm00001d027230
        [4]           <NA>        <NA> transcript:Zm00001d0..
        [5]           <NA>        <NA> transcript:Zm00001d0..
        ...            ...         ...                    ...
  [2819076]           <NA>        <NA>     gene:GRMZM5G855343
  [2819077]           <NA>        <NA> transcript:GRMZM5G85..
  [2819078]           <NA>        <NA> transcript:GRMZM5G85..
  [2819079]           <NA>        <NA> transcript:GRMZM5G85..
  [2819080]           <NA>        <NA> transcript:GRMZM5G85..
                  transcript_id                   Name constitutive
                    <character>            <character>  <character>
        [1]                <NA>                   <NA>         <NA>
        [2]                <NA>                   <NA>         <NA>
        [3] Zm00001d027230_T001                   <NA>         <NA>
        [4]                <NA>                   <NA>         <NA>
        [5]                <NA> Zm00001d027230_T001...            1
        ...                 ...                    ...          ...
  [2819076]   GRMZM5G855343_T01                   <NA>         <NA>
  [2819077]                <NA>                   <NA>         <NA>
  [2819078]                <NA> GRMZM5G855343_T01.ex..            1
  [2819079]                <NA>                   <NA>         <NA>
  [2819080]                <NA>                   <NA>         <NA>
            ensembl_end_phase ensembl_phase                exon_id        rank
                  <character>   <character>            <character> <character>
        [1]              <NA>          <NA>                   <NA>        <NA>
        [2]              <NA>          <NA>                   <NA>        <NA>
        [3]              <NA>          <NA>                   <NA>        <NA>
        [4]              <NA>          <NA>                   <NA>        <NA>
        [5]                 0            -1 Zm00001d027230_T001...           1
        ...               ...           ...                    ...         ...
  [2819076]              <NA>          <NA>                   <NA>        <NA>
  [2819077]              <NA>          <NA>                   <NA>        <NA>
  [2819078]                -1            -1 GRMZM5G855343_T01.ex..           1
  [2819079]              <NA>          <NA>                   <NA>        <NA>
  [2819080]              <NA>          <NA>                   <NA>        <NA>
                   protein_id           Alias Is_circular
                  <character> <CharacterList> <character>
        [1]              <NA>                        <NA>
        [2]              <NA>                        <NA>
        [3]              <NA>                        <NA>
        [4]              <NA>                        <NA>
        [5]              <NA>                        <NA>
        ...               ...             ...         ...
  [2819076]              <NA>                        <NA>
  [2819077]              <NA>                        <NA>
  [2819078]              <NA>                        <NA>
  [2819079] GRMZM5G855343_P01                        <NA>
  [2819080]              <NA>                        <NA>
  -------
  seqinfo: 267 sequences from an unspecified genome; no seqlengths

We can free up memory by removing gtf0 now that we don’t need it.

rm(gtf0)

Challenge: Subset GFF

Make a GRanges object called gtf1 that only contains gene locations, i.e. it only contains rows where the “type” is “gene”. Be sure to start with our gtf0a object.

Solution

gtf1 <- gtf0a[gtf0a$type == "gene"]
gtf1
GRanges object with 39498 ranges and 20 metadata columns:
          seqnames        ranges strand |   source     type     score     phase
             <Rle>     <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
      [1]     Chr1   44289-49837      * |  gramene     gene        NA      <NA>
      [2]     Chr1   50877-55716      * |  gramene     gene        NA      <NA>
      [3]     Chr1   92299-95134      * |  gramene     gene        NA      <NA>
      [4]     Chr1 111655-118312      * |  gramene     gene        NA      <NA>
      [5]     Chr1 118683-119739      * |  gramene     gene        NA      <NA>
      ...      ...           ...    ... .      ...      ...       ...       ...
  [39494]       Pt 134341-134862      * |  gramene     gene        NA      <NA>
  [39495]       Pt 134923-135222      * |  gramene     gene        NA      <NA>
  [39496]       Pt 138323-139807      * |  gramene     gene        NA      <NA>
  [39497]       Pt 139824-140048      * |  gramene     gene        NA      <NA>
  [39498]       Pt 140068-140361      * |  gramene     gene        NA      <NA>
                           ID        biotype            description
                  <character>    <character>            <character>
      [1] gene:Zm00001d027230 protein_coding         Zm00001d027230
      [2] gene:Zm00001d027231 protein_coding         Zm00001d027231
      [3] gene:Zm00001d027232 protein_coding         Zm00001d027232
      [4] gene:Zm00001d027233 protein_coding         Zm00001d027233
      [5] gene:Zm00001d027234 protein_coding         Zm00001d027234
      ...                 ...            ...                    ...
  [39494]  gene:GRMZM5G885905 protein_coding Uncharacterized prot..
  [39495]  gene:GRMZM5G866761 protein_coding Putative uncharacter..
  [39496]  gene:GRMZM5G818111 protein_coding                   <NA>
  [39497]  gene:GRMZM5G866064 protein_coding                   <NA>
  [39498]  gene:GRMZM5G855343 protein_coding                   <NA>
                 gene_id  logic_name          Parent transcript_id        Name
             <character> <character> <CharacterList>   <character> <character>
      [1] Zm00001d027230  maker_gene                          <NA>        <NA>
      [2] Zm00001d027231  maker_gene                          <NA>        <NA>
      [3] Zm00001d027232  maker_gene                          <NA>        <NA>
      [4] Zm00001d027233  maker_gene                          <NA>        <NA>
      [5] Zm00001d027234  maker_gene                          <NA>        <NA>
      ...            ...         ...             ...           ...         ...
  [39494]  GRMZM5G885905 genebuilder                          <NA>     ycf73-A
  [39495]  GRMZM5G866761 genebuilder                          <NA>     ycf15-A
  [39496]  GRMZM5G818111 genebuilder                          <NA>        <NA>
  [39497]  GRMZM5G866064 genebuilder                          <NA>        <NA>
  [39498]  GRMZM5G855343 genebuilder                          <NA>        <NA>
          constitutive ensembl_end_phase ensembl_phase     exon_id        rank
           <character>       <character>   <character> <character> <character>
      [1]         <NA>              <NA>          <NA>        <NA>        <NA>
      [2]         <NA>              <NA>          <NA>        <NA>        <NA>
      [3]         <NA>              <NA>          <NA>        <NA>        <NA>
      [4]         <NA>              <NA>          <NA>        <NA>        <NA>
      [5]         <NA>              <NA>          <NA>        <NA>        <NA>
      ...          ...               ...           ...         ...         ...
  [39494]         <NA>              <NA>          <NA>        <NA>        <NA>
  [39495]         <NA>              <NA>          <NA>        <NA>        <NA>
  [39496]         <NA>              <NA>          <NA>        <NA>        <NA>
  [39497]         <NA>              <NA>          <NA>        <NA>        <NA>
  [39498]         <NA>              <NA>          <NA>        <NA>        <NA>
           protein_id           Alias Is_circular
          <character> <CharacterList> <character>
      [1]        <NA>                        <NA>
      [2]        <NA>                        <NA>
      [3]        <NA>                        <NA>
      [4]        <NA>                        <NA>
      [5]        <NA>                        <NA>
      ...         ...             ...         ...
  [39494]        <NA>                        <NA>
  [39495]        <NA>                        <NA>
  [39496]        <NA>                        <NA>
  [39497]        <NA>                        <NA>
  [39498]        <NA>                        <NA>
  -------
  seqinfo: 267 sequences from an unspecified genome; no seqlengths

Here’s where things start to get really helpful. What genes are within our QTL ranges?

qtl_genes <- subsetByOverlaps(gtf1, myqtl)
qtl_genes
GRanges object with 5 ranges and 20 metadata columns:
      seqnames              ranges strand |   source     type     score
         <Rle>           <IRanges>  <Rle> | <factor> <factor> <numeric>
  [1]     Chr2 134686452-134689699      * |  gramene     gene        NA
  [2]     Chr2 134712849-134713289      * |  gramene     gene        NA
  [3]     Chr2 134747713-134748868      * |  gramene     gene        NA
  [4]     Chr8 150334356-150342192      * |  gramene     gene        NA
  [5]     Chr8 150346180-150347905      * |  gramene     gene        NA
          phase                  ID        biotype            description
      <integer>         <character>    <character>            <character>
  [1]      <NA> gene:Zm00001d004734 protein_coding         Zm00001d004734
  [2]      <NA> gene:Zm00001d004735 protein_coding         Zm00001d004735
  [3]      <NA> gene:Zm00001d004736 protein_coding         Zm00001d004736
  [4]      <NA> gene:Zm00001d011430 protein_coding         Zm00001d011430
  [5]      <NA> gene:Zm00001d011431 protein_coding GPI-anchored protein..
             gene_id  logic_name          Parent transcript_id        Name
         <character> <character> <CharacterList>   <character> <character>
  [1] Zm00001d004734  maker_gene                          <NA>        <NA>
  [2] Zm00001d004735  maker_gene                          <NA>        <NA>
  [3] Zm00001d004736  maker_gene                          <NA>        <NA>
  [4] Zm00001d011430  maker_gene                          <NA>        <NA>
  [5] Zm00001d011431  maker_gene                          <NA>        <NA>
      constitutive ensembl_end_phase ensembl_phase     exon_id        rank
       <character>       <character>   <character> <character> <character>
  [1]         <NA>              <NA>          <NA>        <NA>        <NA>
  [2]         <NA>              <NA>          <NA>        <NA>        <NA>
  [3]         <NA>              <NA>          <NA>        <NA>        <NA>
  [4]         <NA>              <NA>          <NA>        <NA>        <NA>
  [5]         <NA>              <NA>          <NA>        <NA>        <NA>
       protein_id           Alias Is_circular
      <character> <CharacterList> <character>
  [1]        <NA>                        <NA>
  [2]        <NA>                        <NA>
  [3]        <NA>                        <NA>
  [4]        <NA>                        <NA>
  [5]        <NA>                        <NA>
  -------
  seqinfo: 267 sequences from an unspecified genome; no seqlengths

Later, when we import a VCF, the positions of the variants within the genome will be stored as GRanges.

DNA sequences

Now let’s get the sequences for those genes. You could do this using any GRanges object.

qtl_genes_seq <- scanFa(mygenome, qtl_genes)
qtl_genes_seq
DNAStringSet object of length 5:
    width seq                                               names               
[1]  3248 TTAAAACTTAAAATGTAATACAT...CCGGGAGACTGGACTTAGAAAGC Chr2
[2]   441 TCAAATCCTTGCAAGGTGCCGCC...CCAAGGGGCTCATTTGAGGCCAT Chr2
[3]  1156 CTACATCATCTACCGTGCCTGTT...CTGATAAGTGATAAGTAAACAAG Chr2
[4]  7837 CAGCAGACTGCACCACACGGCAC...AATAGTATATATATTTTGTCTGA Chr8
[5]  1726 CTACTTTACACCTCTTCTTCTTC...AAATATATAATGGCCATTCCACT Chr8

This is a DNAStringSet, which is how you’ll typically find DNA sequences represented. You can do handy things like take the reverse complement, or translate to amino acids:

reverseComplement(qtl_genes_seq)
DNAStringSet object of length 5:
    width seq                                               names               
[1]  3248 GCTTTCTAAGTCCAGTCTCCCGG...ATGTATTACATTTTAAGTTTTAA Chr2
[2]   441 ATGGCCTCAAATGAGCCCCTTGG...GGCGGCACCTTGCAAGGATTTGA Chr2
[3]  1156 CTTGTTTACTTATCACTTATCAG...AACAGGCACGGTAGATGATGTAG Chr2
[4]  7837 TCAGACAAAATATATATACTATT...GTGCCGTGTGGTGCAGTCTGCTG Chr8
[5]  1726 AGTGGAATGGCCATTATATATTT...GAAGAAGAAGAGGTGTAAAGTAG Chr8
translate(qtl_genes_seq)
AAStringSet object of length 5:
    width seq                                               names               
[1]  1082 LKLKM*YIQLIYYFS*YHMGKNS...EAAVVSFPTFLSCLFCPGDWT*K Chr2
[2]   147 SNPCKVPP*PKHTMFRLHLHKAI...WFFRCTSTFSDVRDGAKGLI*GH Chr2
[3]   385 LHHLPCLFF*DS*CHQAGILFNS...LYRFCVYARLEKLALY**VISKQ Chr2
[4]  2612 QQTAPHGTWHCQPRRARPIYPQS...CKTNQILDAWI*IQIKIVYIFCL Chr8
[5]   575 LLYTSSSSSCSSCPLPLGFLCSH...PSSSRCTRCLYIHTCLNI*WPFH Chr8

Of course, this is the full gene sequence containing exons and introns, so we don’t get the correct amino acid sequence like we would if we were using the CDS.

Extracting transcript sequences

If you do want to get the sequences of transcripts or CDS from a genome, see ?GenomicFeatures::extractTranscriptSeqs. You would need to import the GFF with makeTxDbFromGFF rather than rtracklayer::import.

If you need to create a DNAStringSet from scratch, you can do it directly from a character vector.

test_dna <- DNAStringSet(c("AGGG", "TCAGATTTAAC", "TC"))
test_dna
DNAStringSet object of length 3:
    width seq
[1]     4 AGGG
[2]    11 TCAGATTTAAC
[3]     2 TC

Bonus Challenge: The whole thing

How would you import the full sequence for chromosome 3?

Solution

chr3length <- seqlengths(seqinfo(mygenome))["Chr3"]
chr3range <- GRanges("Chr3", IRanges(start = 1, end = chr3length))
chr3seq <- scanFa(mygenome, chr3range)

When we import data from VCF, the reference and alternative alleles are stored in DNAStringSets.

Experimental results

The SummarizedExperiment class contains matrix-like data, with assays in rows and samples in columns. Like much of Bioconductor, it was originally designed for microarray data, where there would be many probes on a microarray, each representing a gene, and the fluorescence of that probe indicated expression levels of the gene in that sample. However, SummarizedExperiment is flexible enough to also contain RNA-seq data, or in our case, genotyping results.

The RangedSummarizedExperiment class is exactly like SummarizedExperiment, except that each assay is associated with a location within the genome. This makes sense for our variant data, where each “assay” is a SNP. It also makes sense for gene expression data, where each gene has a location within the genome. Below we’ll load an example expression dataset from Bioconductor.

data(airway, package="airway")

airway
class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

We can see that there are row names that are gene identities, and column names that are sample identifiers. There are eight samples and 64,102 genes. One handy thing about the printout is that it lists functions we can use to access the data. Here’s some metadata about the samples:

colData(airway)
DataFrame with 8 rows and 9 columns
           SampleName     cell      dex    albut        Run avgLength
             <factor> <factor> <factor> <factor>   <factor> <integer>
SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
           Experiment    Sample    BioSample
             <factor>  <factor>     <factor>
SRR1039508  SRX384345 SRS508568 SAMN02422669
SRR1039509  SRX384346 SRS508567 SAMN02422675
SRR1039512  SRX384349 SRS508571 SAMN02422678
SRR1039513  SRX384350 SRS508572 SAMN02422670
SRR1039516  SRX384353 SRS508575 SAMN02422682
SRR1039517  SRX384354 SRS508576 SAMN02422673
SRR1039520  SRX384357 SRS508579 SAMN02422683
SRR1039521  SRX384358 SRS508580 SAMN02422677

And the read counts for gene expression (I am using the head function to avoid printing them all out):

assays(airway)$counts %>% head(10)
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000005          0          0          0          0          0
ENSG00000000419        467        515        621        365        587
ENSG00000000457        260        211        263        164        245
ENSG00000000460         60         55         40         35         78
ENSG00000000938          0          0          2          0          1
ENSG00000000971       3251       3679       6177       4252       6721
ENSG00000001036       1433       1062       1733        881       1424
ENSG00000001084        519        380        595        493        820
ENSG00000001167        394        236        464        175        658
                SRR1039517 SRR1039520 SRR1039521
ENSG00000000003       1047        770        572
ENSG00000000005          0          0          0
ENSG00000000419        799        417        508
ENSG00000000457        331        233        229
ENSG00000000460         63         76         60
ENSG00000000938          0          0          0
ENSG00000000971      11027       5176       7995
ENSG00000001036       1439       1359       1109
ENSG00000001084        714        696        704
ENSG00000001167        584        360        269

Challenge: Gene and sample names

If you wanted a vector of gene IDs or a vector of sample names, what functions can you use to get those?

Solution

geneids <- rownames(airway)
head(geneids, 10)
 [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
 [5] "ENSG00000000460" "ENSG00000000938" "ENSG00000000971" "ENSG00000001036"
 [9] "ENSG00000001084" "ENSG00000001167"
samples <- colnames(airway)
samples
[1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516"
[6] "SRR1039517" "SRR1039520" "SRR1039521"

It also wouldn’t be wrong, just more complicated, to get them like this:

geneids <- rownames(assays(airway)$counts)
samples <- colnames(assays(airway)$counts)

The rowRanges function gets us the GRanges object associated with each gene.

rowRanges(airway)
GRangesList object of length 64102:
$ENSG00000000003
GRanges object with 17 ranges and 2 metadata columns:
       seqnames            ranges strand |   exon_id       exon_name
          <Rle>         <IRanges>  <Rle> | <integer>     <character>
   [1]        X 99883667-99884983      - |    667145 ENSE00001459322
   [2]        X 99885756-99885863      - |    667146 ENSE00000868868
   [3]        X 99887482-99887565      - |    667147 ENSE00000401072
   [4]        X 99887538-99887565      - |    667148 ENSE00001849132
   [5]        X 99888402-99888536      - |    667149 ENSE00003554016
   ...      ...               ...    ... .       ...             ...
  [13]        X 99890555-99890743      - |    667156 ENSE00003512331
  [14]        X 99891188-99891686      - |    667158 ENSE00001886883
  [15]        X 99891605-99891803      - |    667159 ENSE00001855382
  [16]        X 99891790-99892101      - |    667160 ENSE00001863395
  [17]        X 99894942-99894988      - |    667161 ENSE00001828996
  -------
  seqinfo: 722 sequences (1 circular) from an unspecified genome

...
<64101 more elements>

The first gene is on the X chromosome. All the exons are listed individually. Because of this, the GRanges for each gene are compiled into a GRangesList for the whole dataset. The use of GRanges means we could subset the whole dataset by genomic position. Let’s say we’re just interested in a certain region of chromosome 4:

myregion <- GRanges("4", IRanges(start = 82660000, end = 119280000))

airway4 <- subsetByOverlaps(airway, myregion)

airway4
class: RangedSummarizedExperiment 
dim: 458 8 
metadata(1): ''
assays(1): counts
rownames(458): ENSG00000005059 ENSG00000029559 ... ENSG00000273389
  ENSG00000273447
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
rowRanges(airway4)
GRangesList object of length 458:
$ENSG00000005059
GRanges object with 20 ranges and 2 metadata columns:
       seqnames              ranges strand |   exon_id       exon_name
          <Rle>           <IRanges>  <Rle> | <integer>     <character>
   [1]        4 110481361-110481592      + |    161238 ENSE00002034107
   [2]        4 110481365-110481592      + |    161239 ENSE00002071601
   [3]        4 110481368-110481592      + |    161240 ENSE00002078395
   [4]        4 110482146-110482173      + |    161241 ENSE00002027139
   [5]        4 110569640-110569805      + |    161242 ENSE00001695709
   ...      ...                 ...    ... .       ...             ...
  [16]        4 110605596-110606523      + |    161254 ENSE00001890650
  [17]        4 110605599-110605802      + |    161255 ENSE00000736563
  [18]        4 110606407-110606523      + |    161256 ENSE00000333811
  [19]        4 110608671-110608872      + |    161257 ENSE00001847574
  [20]        4 110608671-110609874      + |    161258 ENSE00001866116
  -------
  seqinfo: 722 sequences (1 circular) from an unspecified genome

...
<457 more elements>

Now we just have 458 genes in that region.

In the next episode, we will import data from a VCF.

Key Points

  • FaFile creates a pointer to a reference genome file on your computer.

  • An index file allows quick access to specific information from large files.

  • GRanges stores positions within a genome for any type of feature (SNP, exon, etc.)

  • DNAStringSet stores DNA sequences.

  • SummarizedExperiment stores the results of a set of assays across a set of samples.