Bioconductor basics
Overview
Teaching: 50 min
Exercises: 15 minQuestions
What are the Bioconductor classes for the types of data we would find in a VCF?
Objectives
Create a GRanges object to indicate regions of the genome that you are interested in.
Load DNA sequences from a reference genome.
Extract assay metadata from the results of an experiment.
Find help pages to learn more about what you can do with this data.
Bioconductor packages are all designed to work together. This is why, when you tried installing five packages, it probably installed dozens of dependencies. The benefit is that the same functions are useful across many different bioinformatics workflows. To explore those building blocks, we’ll load a few of those dependencies.
library(GenomicRanges)
library(Biostrings)
library(Rsamtools)
library(SummarizedExperiment)
We’ll also load magrittr
to make some code more readable.
library(magrittr)
Reference genome
In the setup, you downloaded and unzipped a reference genome file.
ref_file <- "data/Zm-B73-REFERENCE-GRAMENE-4.0.fa"
You should build an index for the file. This is another file, ending in
.fai
, that indicates where in the file each chromosome starts. It enables
quick access to specific regions of the genome.
indexFa(ref_file)
Now we can import the reference genome.
mygenome <- FaFile(ref_file)
If we try to print it out:
mygenome
class: FaFile
path: data/Zm-B73-REFERENCE-GRAMENE-4.0.fa
index: data/Zm-B73-REFERENCE-GRAMENE-4.0.fa.fai
gzindex: data/Zm-B73-REFERENCE-GRAMENE-4.0.fa.gzi
isOpen: FALSE
yieldSize: NA
Hmm, we don’t see any DNA sequences, just names of files. We haven’t actually loaded the data into R, just told R where to look when we want the data. This saves RAM since we don’t usually need to analyze the whole genome at once.
We can get some information about the genome:
seqinfo(mygenome)
Seqinfo object with 266 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
Chr10 150982314 <NA> <NA>
Chr1 307041717 <NA> <NA>
Chr2 244442276 <NA> <NA>
Chr3 235667834 <NA> <NA>
Chr4 246994605 <NA> <NA>
... ... ... ...
B73V4_ctg253 82972 <NA> <NA>
B73V4_ctg254 69595 <NA> <NA>
B73V4_ctg255 46945 <NA> <NA>
B73V4_ctg2729 124116 <NA> <NA>
Pt 140447 <NA> <NA>
There are chromosomes, and some smaller contigs.
Challenge: Chromosome names
Look at the help page for
seqinfo
. Can you find a way to view all of the sequence names?Solution
There is a
seqnames
function. It doesn’t work directly on aFaFile
object however. It works on theSeqInfo
object returned byseqinfo
.seqnames(seqinfo(mygenome))
[1] "Chr10" "Chr1" "Chr2" "Chr3" [5] "Chr4" "Chr5" "Chr6" "Chr7" [9] "Chr8" "Chr9" "B73V4_ctg1" "B73V4_ctg2" [13] "B73V4_ctg3" "B73V4_ctg4" "B73V4_ctg5" "B73V4_ctg6" [17] "B73V4_ctg7" "B73V4_ctg8" "B73V4_ctg9" "B73V4_ctg10" [21] "B73V4_ctg11" "B73V4_ctg12" "B73V4_ctg13" "B73V4_ctg14" [25] "B73V4_ctg15" "B73V4_ctg16" "B73V4_ctg17" "B73V4_ctg18" [29] "B73V4_ctg19" "B73V4_ctg20" "B73V4_ctg21" "B73V4_ctg22" [33] "B73V4_ctg23" "B73V4_ctg24" "B73V4_ctg25" "B73V4_ctg26" [37] "B73V4_ctg27" "B73V4_ctg28" "B73V4_ctg29" "B73V4_ctg30" [41] "B73V4_ctg31" "B73V4_ctg32" "B73V4_ctg33" "B73V4_ctg34" [45] "B73V4_ctg35" "B73V4_ctg36" "B73V4_ctg37" "B73V4_ctg38" [49] "B73V4_ctg39" "B73V4_ctg40" "B73V4_ctg41" "B73V4_ctg42" [53] "B73V4_ctg43" "B73V4_ctg44" "B73V4_ctg45" "B73V4_ctg46" [57] "B73V4_ctg47" "B73V4_ctg48" "B73V4_ctg49" "B73V4_ctg50" [61] "B73V4_ctg51" "B73V4_ctg52" "B73V4_ctg53" "B73V4_ctg54" [65] "B73V4_ctg56" "B73V4_ctg57" "B73V4_ctg58" "B73V4_ctg59" [69] "B73V4_ctg60" "B73V4_ctg61" "B73V4_ctg62" "B73V4_ctg63" [73] "B73V4_ctg64" "B73V4_ctg65" "B73V4_ctg66" "B73V4_ctg67" [77] "B73V4_ctg68" "B73V4_ctg69" "B73V4_ctg70" "B73V4_ctg71" [81] "B73V4_ctg72" "B73V4_ctg73" "B73V4_ctg74" "B73V4_ctg75" [85] "B73V4_ctg76" "B73V4_ctg77" "B73V4_ctg78" "B73V4_ctg79" [89] "B73V4_ctg80" "B73V4_ctg81" "B73V4_ctg82" "B73V4_ctg83" [93] "B73V4_ctg84" "B73V4_ctg85" "B73V4_ctg86" "B73V4_ctg87" [97] "B73V4_ctg88" "B73V4_ctg89" "B73V4_ctg90" "B73V4_ctg91" [101] "B73V4_ctg92" "B73V4_ctg93" "B73V4_ctg94" "B73V4_ctg95" [105] "B73V4_ctg96" "B73V4_ctg97" "B73V4_ctg98" "B73V4_ctg99" [109] "B73V4_ctg100" "B73V4_ctg101" "B73V4_ctg102" "B73V4_ctg103" [113] "B73V4_ctg104" "B73V4_ctg105" "B73V4_ctg106" "B73V4_ctg107" [117] "B73V4_ctg108" "B73V4_ctg109" "B73V4_ctg110" "B73V4_ctg111" [121] "B73V4_ctg112" "B73V4_ctg113" "B73V4_ctg114" "B73V4_ctg115" [125] "B73V4_ctg116" "B73V4_ctg117" "B73V4_ctg118" "B73V4_ctg119" [129] "B73V4_ctg120" "B73V4_ctg121" "B73V4_ctg122" "B73V4_ctg123" [133] "B73V4_ctg124" "B73V4_ctg125" "B73V4_ctg126" "B73V4_ctg127" [137] "B73V4_ctg128" "B73V4_ctg129" "B73V4_ctg130" "B73V4_ctg131" [141] "B73V4_ctg132" "B73V4_ctg133" "B73V4_ctg134" "B73V4_ctg135" [145] "B73V4_ctg136" "B73V4_ctg137" "B73V4_ctg138" "B73V4_ctg139" [149] "B73V4_ctg140" "B73V4_ctg141" "B73V4_ctg142" "B73V4_ctg143" [153] "B73V4_ctg144" "B73V4_ctg145" "B73V4_ctg146" "B73V4_ctg147" [157] "B73V4_ctg148" "B73V4_ctg149" "B73V4_ctg150" "B73V4_ctg151" [161] "B73V4_ctg152" "B73V4_ctg153" "B73V4_ctg154" "B73V4_ctg155" [165] "B73V4_ctg156" "B73V4_ctg157" "B73V4_ctg158" "B73V4_ctg159" [169] "B73V4_ctg160" "B73V4_ctg161" "B73V4_ctg162" "B73V4_ctg163" [173] "B73V4_ctg164" "B73V4_ctg165" "B73V4_ctg166" "B73V4_ctg167" [177] "B73V4_ctg168" "B73V4_ctg169" "B73V4_ctg170" "B73V4_ctg171" [181] "B73V4_ctg172" "B73V4_ctg173" "B73V4_ctg174" "B73V4_ctg175" [185] "B73V4_ctg176" "B73V4_ctg177" "B73V4_ctg178" "B73V4_ctg179" [189] "B73V4_ctg180" "B73V4_ctg181" "B73V4_ctg182" "B73V4_ctg183" [193] "B73V4_ctg184" "B73V4_ctg185" "B73V4_ctg186" "B73V4_ctg187" [197] "B73V4_ctg188" "B73V4_ctg189" "B73V4_ctg190" "B73V4_ctg191" [201] "B73V4_ctg192" "B73V4_ctg193" "B73V4_ctg194" "B73V4_ctg195" [205] "B73V4_ctg196" "B73V4_ctg197" "B73V4_ctg198" "B73V4_ctg199" [209] "B73V4_ctg200" "B73V4_ctg201" "B73V4_ctg202" "B73V4_ctg203" [213] "B73V4_ctg204" "B73V4_ctg205" "B73V4_ctg206" "B73V4_ctg207" [217] "B73V4_ctg208" "B73V4_ctg209" "B73V4_ctg210" "B73V4_ctg211" [221] "B73V4_ctg212" "B73V4_ctg213" "B73V4_ctg214" "B73V4_ctg215" [225] "B73V4_ctg216" "B73V4_ctg217" "B73V4_ctg218" "B73V4_ctg219" [229] "B73V4_ctg220" "B73V4_ctg221" "B73V4_ctg222" "B73V4_ctg223" [233] "B73V4_ctg224" "B73V4_ctg225" "B73V4_ctg226" "B73V4_ctg227" [237] "B73V4_ctg228" "B73V4_ctg229" "B73V4_ctg230" "B73V4_ctg231" [241] "B73V4_ctg232" "B73V4_ctg233" "B73V4_ctg234" "B73V4_ctg235" [245] "B73V4_ctg236" "B73V4_ctg237" "B73V4_ctg238" "B73V4_ctg239" [249] "B73V4_ctg240" "B73V4_ctg241" "B73V4_ctg242" "B73V4_ctg243" [253] "B73V4_ctg244" "B73V4_ctg245" "B73V4_ctg246" "B73V4_ctg247" [257] "B73V4_ctg248" "B73V4_ctg249" "B73V4_ctg250" "B73V4_ctg251" [261] "B73V4_ctg252" "B73V4_ctg253" "B73V4_ctg254" "B73V4_ctg255" [265] "B73V4_ctg2729" "Pt"
Coordinates within a genome
Any time we want to specify coordinates within a genome, we use a GRanges
object. This could indicate the locations of anything including SNPs, exons,
QTL regions, or entire chromosomes. Somewhat confusingly, every GRanges
object contains an IRanges
object containing the positions, and then the
GRanges
object tags on the chromosome name. Let’s build one from scratch.
myqtl <- GRanges(c("Chr2", "Chr2", "Chr8"),
IRanges(start = c(134620000, 48023000, 150341000),
end = c(134752000, 48046000, 150372000)))
myqtl
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] Chr2 134620000-134752000 *
[2] Chr2 48023000-48046000 *
[3] Chr8 150341000-150372000 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
We can add some extra info, like row names and metadata columns.
names(myqtl) <- c("Yld1", "LA1", "LA2")
myqtl$Trait <- c("Yield", "Leaf angle", "Leaf angle")
myqtl
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | Trait
<Rle> <IRanges> <Rle> | <character>
Yld1 Chr2 134620000-134752000 * | Yield
LA1 Chr2 48023000-48046000 * | Leaf angle
LA2 Chr8 150341000-150372000 * | Leaf angle
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Although this appears two-dimensional like a data frame, if we only want certain rows, we index it in a one-dimensional way like a vector.
myqtl[1]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | Trait
<Rle> <IRanges> <Rle> | <character>
Yld1 Chr2 134620000-134752000 * | Yield
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
myqtl["LA2"]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | Trait
<Rle> <IRanges> <Rle> | <character>
LA2 Chr8 150341000-150372000 * | Leaf angle
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
myqtl[myqtl$Trait == "Leaf angle"]
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | Trait
<Rle> <IRanges> <Rle> | <character>
LA1 Chr2 48023000-48046000 * | Leaf angle
LA2 Chr8 150341000-150372000 * | Leaf angle
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Handy utility functions
See
?IRanges::shift
for some useful functions for manipulatingGRanges
objects. Thewidth
function is also helpful if you want to know the size of each range. Themcols
function retrieves all metadata columns, like our “Trait” column. Check outbrowseVignettes("GenomicRanges")
to learn even more.
We can also import our gene annotations in to a GRanges
object. This should
be familiar if you took the HPCBio introductory Bioconductor workshop this
semester.
gtf0 <- rtracklayer::import("data/Zm-B73-REFERENCE-GRAMENE-4.0_Zm00001d.2.gff3")
gtf0
GRanges object with 2819080 ranges and 20 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 1 1-307041717 * | wareLab chromosome NA
[2] 1 44289-49837 + | gramene gene NA
[3] 1 44289-49837 + | gramene mRNA NA
[4] 1 44289-44350 + | gramene five_prime_UTR NA
[5] 1 44289-44947 + | gramene exon NA
... ... ... ... . ... ... ...
[2819076] Pt 140068-140361 + | gramene mRNA NA
[2819077] Pt 140068-140133 + | gramene five_prime_UTR NA
[2819078] Pt 140068-140361 + | gramene exon NA
[2819079] Pt 140068-140361 + | gramene CDS NA
[2819080] Pt 140350-140361 + | gramene three_prime_UTR NA
phase ID biotype description
<integer> <character> <character> <character>
[1] <NA> chromosome:1 <NA> <NA>
[2] <NA> gene:Zm00001d027230 protein_coding Zm00001d027230
[3] <NA> transcript:Zm00001d0.. protein_coding <NA>
[4] <NA> <NA> <NA> <NA>
[5] <NA> <NA> <NA> <NA>
... ... ... ... ...
[2819076] <NA> transcript:GRMZM5G85.. protein_coding <NA>
[2819077] <NA> <NA> <NA> <NA>
[2819078] <NA> <NA> <NA> <NA>
[2819079] 0 CDS:GRMZM5G855343_P01 <NA> <NA>
[2819080] <NA> <NA> <NA> <NA>
gene_id logic_name Parent
<character> <character> <CharacterList>
[1] <NA> <NA>
[2] Zm00001d027230 maker_gene
[3] <NA> <NA> gene:Zm00001d027230
[4] <NA> <NA> transcript:Zm00001d0..
[5] <NA> <NA> transcript:Zm00001d0..
... ... ... ...
[2819076] <NA> <NA> gene:GRMZM5G855343
[2819077] <NA> <NA> transcript:GRMZM5G85..
[2819078] <NA> <NA> transcript:GRMZM5G85..
[2819079] <NA> <NA> transcript:GRMZM5G85..
[2819080] <NA> <NA> transcript:GRMZM5G85..
transcript_id Name constitutive
<character> <character> <character>
[1] <NA> <NA> <NA>
[2] <NA> <NA> <NA>
[3] Zm00001d027230_T001 <NA> <NA>
[4] <NA> <NA> <NA>
[5] <NA> Zm00001d027230_T001... 1
... ... ... ...
[2819076] GRMZM5G855343_T01 <NA> <NA>
[2819077] <NA> <NA> <NA>
[2819078] <NA> GRMZM5G855343_T01.ex.. 1
[2819079] <NA> <NA> <NA>
[2819080] <NA> <NA> <NA>
ensembl_end_phase ensembl_phase exon_id rank
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] <NA> <NA> <NA> <NA>
[3] <NA> <NA> <NA> <NA>
[4] <NA> <NA> <NA> <NA>
[5] 0 -1 Zm00001d027230_T001... 1
... ... ... ... ...
[2819076] <NA> <NA> <NA> <NA>
[2819077] <NA> <NA> <NA> <NA>
[2819078] -1 -1 GRMZM5G855343_T01.ex.. 1
[2819079] <NA> <NA> <NA> <NA>
[2819080] <NA> <NA> <NA> <NA>
protein_id Alias Is_circular
<character> <CharacterList> <character>
[1] <NA> <NA>
[2] <NA> <NA>
[3] <NA> <NA>
[4] <NA> <NA>
[5] <NA> <NA>
... ... ... ...
[2819076] <NA> <NA>
[2819077] <NA> <NA>
[2819078] <NA> <NA>
[2819079] GRMZM5G855343_P01 <NA>
[2819080] <NA> <NA>
-------
seqinfo: 267 sequences from an unspecified genome; no seqlengths
Unfortunately, the chromosome names have been shortened and don’t match the reference genome. We’ll find this problem with VCFs as well. Here is how to fix it.
newnames <- as.character(seqnames(gtf0))
tofix <- which(!newnames %in% seqnames(seqinfo(mygenome)))
unique(newnames[tofix])
[1] "1" "10" "2" "3" "4" "5" "6" "7" "8" "9" "Mt"
newnames[newnames %in% as.character(1:10)] <- paste0("Chr", newnames[newnames %in% as.character(1:10)])
tofix <- which(!newnames %in% seqnames(seqinfo(mygenome)))
unique(newnames[tofix])
[1] "Mt"
The mitochondrial genome is in the GFF but not the FASTA, but we will ignore that for now. Continuing on with the fix:
gtf0a <- GRanges(newnames, ranges(gtf0))
mcols(gtf0a) <- mcols(gtf0)
gtf0a
GRanges object with 2819080 ranges and 20 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] Chr1 1-307041717 * | wareLab chromosome NA
[2] Chr1 44289-49837 * | gramene gene NA
[3] Chr1 44289-49837 * | gramene mRNA NA
[4] Chr1 44289-44350 * | gramene five_prime_UTR NA
[5] Chr1 44289-44947 * | gramene exon NA
... ... ... ... . ... ... ...
[2819076] Pt 140068-140361 * | gramene mRNA NA
[2819077] Pt 140068-140133 * | gramene five_prime_UTR NA
[2819078] Pt 140068-140361 * | gramene exon NA
[2819079] Pt 140068-140361 * | gramene CDS NA
[2819080] Pt 140350-140361 * | gramene three_prime_UTR NA
phase ID biotype description
<integer> <character> <character> <character>
[1] <NA> chromosome:1 <NA> <NA>
[2] <NA> gene:Zm00001d027230 protein_coding Zm00001d027230
[3] <NA> transcript:Zm00001d0.. protein_coding <NA>
[4] <NA> <NA> <NA> <NA>
[5] <NA> <NA> <NA> <NA>
... ... ... ... ...
[2819076] <NA> transcript:GRMZM5G85.. protein_coding <NA>
[2819077] <NA> <NA> <NA> <NA>
[2819078] <NA> <NA> <NA> <NA>
[2819079] 0 CDS:GRMZM5G855343_P01 <NA> <NA>
[2819080] <NA> <NA> <NA> <NA>
gene_id logic_name Parent
<character> <character> <CharacterList>
[1] <NA> <NA>
[2] Zm00001d027230 maker_gene
[3] <NA> <NA> gene:Zm00001d027230
[4] <NA> <NA> transcript:Zm00001d0..
[5] <NA> <NA> transcript:Zm00001d0..
... ... ... ...
[2819076] <NA> <NA> gene:GRMZM5G855343
[2819077] <NA> <NA> transcript:GRMZM5G85..
[2819078] <NA> <NA> transcript:GRMZM5G85..
[2819079] <NA> <NA> transcript:GRMZM5G85..
[2819080] <NA> <NA> transcript:GRMZM5G85..
transcript_id Name constitutive
<character> <character> <character>
[1] <NA> <NA> <NA>
[2] <NA> <NA> <NA>
[3] Zm00001d027230_T001 <NA> <NA>
[4] <NA> <NA> <NA>
[5] <NA> Zm00001d027230_T001... 1
... ... ... ...
[2819076] GRMZM5G855343_T01 <NA> <NA>
[2819077] <NA> <NA> <NA>
[2819078] <NA> GRMZM5G855343_T01.ex.. 1
[2819079] <NA> <NA> <NA>
[2819080] <NA> <NA> <NA>
ensembl_end_phase ensembl_phase exon_id rank
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] <NA> <NA> <NA> <NA>
[3] <NA> <NA> <NA> <NA>
[4] <NA> <NA> <NA> <NA>
[5] 0 -1 Zm00001d027230_T001... 1
... ... ... ... ...
[2819076] <NA> <NA> <NA> <NA>
[2819077] <NA> <NA> <NA> <NA>
[2819078] -1 -1 GRMZM5G855343_T01.ex.. 1
[2819079] <NA> <NA> <NA> <NA>
[2819080] <NA> <NA> <NA> <NA>
protein_id Alias Is_circular
<character> <CharacterList> <character>
[1] <NA> <NA>
[2] <NA> <NA>
[3] <NA> <NA>
[4] <NA> <NA>
[5] <NA> <NA>
... ... ... ...
[2819076] <NA> <NA>
[2819077] <NA> <NA>
[2819078] <NA> <NA>
[2819079] GRMZM5G855343_P01 <NA>
[2819080] <NA> <NA>
-------
seqinfo: 267 sequences from an unspecified genome; no seqlengths
We can free up memory by removing gtf0
now that we don’t need it.
rm(gtf0)
Challenge: Subset GFF
Make a
GRanges
object calledgtf1
that only contains gene locations, i.e. it only contains rows where the “type” is “gene”. Be sure to start with ourgtf0a
object.Solution
gtf1 <- gtf0a[gtf0a$type == "gene"] gtf1
GRanges object with 39498 ranges and 20 metadata columns: seqnames ranges strand | source type score phase <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> [1] Chr1 44289-49837 * | gramene gene NA <NA> [2] Chr1 50877-55716 * | gramene gene NA <NA> [3] Chr1 92299-95134 * | gramene gene NA <NA> [4] Chr1 111655-118312 * | gramene gene NA <NA> [5] Chr1 118683-119739 * | gramene gene NA <NA> ... ... ... ... . ... ... ... ... [39494] Pt 134341-134862 * | gramene gene NA <NA> [39495] Pt 134923-135222 * | gramene gene NA <NA> [39496] Pt 138323-139807 * | gramene gene NA <NA> [39497] Pt 139824-140048 * | gramene gene NA <NA> [39498] Pt 140068-140361 * | gramene gene NA <NA> ID biotype description <character> <character> <character> [1] gene:Zm00001d027230 protein_coding Zm00001d027230 [2] gene:Zm00001d027231 protein_coding Zm00001d027231 [3] gene:Zm00001d027232 protein_coding Zm00001d027232 [4] gene:Zm00001d027233 protein_coding Zm00001d027233 [5] gene:Zm00001d027234 protein_coding Zm00001d027234 ... ... ... ... [39494] gene:GRMZM5G885905 protein_coding Uncharacterized prot.. [39495] gene:GRMZM5G866761 protein_coding Putative uncharacter.. [39496] gene:GRMZM5G818111 protein_coding <NA> [39497] gene:GRMZM5G866064 protein_coding <NA> [39498] gene:GRMZM5G855343 protein_coding <NA> gene_id logic_name Parent transcript_id Name <character> <character> <CharacterList> <character> <character> [1] Zm00001d027230 maker_gene <NA> <NA> [2] Zm00001d027231 maker_gene <NA> <NA> [3] Zm00001d027232 maker_gene <NA> <NA> [4] Zm00001d027233 maker_gene <NA> <NA> [5] Zm00001d027234 maker_gene <NA> <NA> ... ... ... ... ... ... [39494] GRMZM5G885905 genebuilder <NA> ycf73-A [39495] GRMZM5G866761 genebuilder <NA> ycf15-A [39496] GRMZM5G818111 genebuilder <NA> <NA> [39497] GRMZM5G866064 genebuilder <NA> <NA> [39498] GRMZM5G855343 genebuilder <NA> <NA> constitutive ensembl_end_phase ensembl_phase exon_id rank <character> <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> [2] <NA> <NA> <NA> <NA> <NA> [3] <NA> <NA> <NA> <NA> <NA> [4] <NA> <NA> <NA> <NA> <NA> [5] <NA> <NA> <NA> <NA> <NA> ... ... ... ... ... ... [39494] <NA> <NA> <NA> <NA> <NA> [39495] <NA> <NA> <NA> <NA> <NA> [39496] <NA> <NA> <NA> <NA> <NA> [39497] <NA> <NA> <NA> <NA> <NA> [39498] <NA> <NA> <NA> <NA> <NA> protein_id Alias Is_circular <character> <CharacterList> <character> [1] <NA> <NA> [2] <NA> <NA> [3] <NA> <NA> [4] <NA> <NA> [5] <NA> <NA> ... ... ... ... [39494] <NA> <NA> [39495] <NA> <NA> [39496] <NA> <NA> [39497] <NA> <NA> [39498] <NA> <NA> ------- seqinfo: 267 sequences from an unspecified genome; no seqlengths
Here’s where things start to get really helpful. What genes are within our QTL ranges?
qtl_genes <- subsetByOverlaps(gtf1, myqtl)
qtl_genes
GRanges object with 5 ranges and 20 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] Chr2 134686452-134689699 * | gramene gene NA
[2] Chr2 134712849-134713289 * | gramene gene NA
[3] Chr2 134747713-134748868 * | gramene gene NA
[4] Chr8 150334356-150342192 * | gramene gene NA
[5] Chr8 150346180-150347905 * | gramene gene NA
phase ID biotype description
<integer> <character> <character> <character>
[1] <NA> gene:Zm00001d004734 protein_coding Zm00001d004734
[2] <NA> gene:Zm00001d004735 protein_coding Zm00001d004735
[3] <NA> gene:Zm00001d004736 protein_coding Zm00001d004736
[4] <NA> gene:Zm00001d011430 protein_coding Zm00001d011430
[5] <NA> gene:Zm00001d011431 protein_coding GPI-anchored protein..
gene_id logic_name Parent transcript_id Name
<character> <character> <CharacterList> <character> <character>
[1] Zm00001d004734 maker_gene <NA> <NA>
[2] Zm00001d004735 maker_gene <NA> <NA>
[3] Zm00001d004736 maker_gene <NA> <NA>
[4] Zm00001d011430 maker_gene <NA> <NA>
[5] Zm00001d011431 maker_gene <NA> <NA>
constitutive ensembl_end_phase ensembl_phase exon_id rank
<character> <character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA>
[2] <NA> <NA> <NA> <NA> <NA>
[3] <NA> <NA> <NA> <NA> <NA>
[4] <NA> <NA> <NA> <NA> <NA>
[5] <NA> <NA> <NA> <NA> <NA>
protein_id Alias Is_circular
<character> <CharacterList> <character>
[1] <NA> <NA>
[2] <NA> <NA>
[3] <NA> <NA>
[4] <NA> <NA>
[5] <NA> <NA>
-------
seqinfo: 267 sequences from an unspecified genome; no seqlengths
Later, when we import a VCF, the positions of the variants within the genome
will be stored as GRanges
.
DNA sequences
Now let’s get the sequences for those genes. You could do this using any
GRanges
object.
qtl_genes_seq <- scanFa(mygenome, qtl_genes)
qtl_genes_seq
DNAStringSet object of length 5:
width seq names
[1] 3248 TTAAAACTTAAAATGTAATACAT...CCGGGAGACTGGACTTAGAAAGC Chr2
[2] 441 TCAAATCCTTGCAAGGTGCCGCC...CCAAGGGGCTCATTTGAGGCCAT Chr2
[3] 1156 CTACATCATCTACCGTGCCTGTT...CTGATAAGTGATAAGTAAACAAG Chr2
[4] 7837 CAGCAGACTGCACCACACGGCAC...AATAGTATATATATTTTGTCTGA Chr8
[5] 1726 CTACTTTACACCTCTTCTTCTTC...AAATATATAATGGCCATTCCACT Chr8
This is a DNAStringSet
, which is how you’ll typically find DNA sequences
represented. You can do handy things like take the reverse complement, or
translate to amino acids:
reverseComplement(qtl_genes_seq)
DNAStringSet object of length 5:
width seq names
[1] 3248 GCTTTCTAAGTCCAGTCTCCCGG...ATGTATTACATTTTAAGTTTTAA Chr2
[2] 441 ATGGCCTCAAATGAGCCCCTTGG...GGCGGCACCTTGCAAGGATTTGA Chr2
[3] 1156 CTTGTTTACTTATCACTTATCAG...AACAGGCACGGTAGATGATGTAG Chr2
[4] 7837 TCAGACAAAATATATATACTATT...GTGCCGTGTGGTGCAGTCTGCTG Chr8
[5] 1726 AGTGGAATGGCCATTATATATTT...GAAGAAGAAGAGGTGTAAAGTAG Chr8
translate(qtl_genes_seq)
AAStringSet object of length 5:
width seq names
[1] 1082 LKLKM*YIQLIYYFS*YHMGKNS...EAAVVSFPTFLSCLFCPGDWT*K Chr2
[2] 147 SNPCKVPP*PKHTMFRLHLHKAI...WFFRCTSTFSDVRDGAKGLI*GH Chr2
[3] 385 LHHLPCLFF*DS*CHQAGILFNS...LYRFCVYARLEKLALY**VISKQ Chr2
[4] 2612 QQTAPHGTWHCQPRRARPIYPQS...CKTNQILDAWI*IQIKIVYIFCL Chr8
[5] 575 LLYTSSSSSCSSCPLPLGFLCSH...PSSSRCTRCLYIHTCLNI*WPFH Chr8
Of course, this is the full gene sequence containing exons and introns, so we don’t get the correct amino acid sequence like we would if we were using the CDS.
Extracting transcript sequences
If you do want to get the sequences of transcripts or CDS from a genome, see
?GenomicFeatures::extractTranscriptSeqs
. You would need to import the GFF withmakeTxDbFromGFF
rather thanrtracklayer::import
.
If you need to create a DNAStringSet
from scratch, you can do it directly from
a character vector.
test_dna <- DNAStringSet(c("AGGG", "TCAGATTTAAC", "TC"))
test_dna
DNAStringSet object of length 3:
width seq
[1] 4 AGGG
[2] 11 TCAGATTTAAC
[3] 2 TC
Bonus Challenge: The whole thing
How would you import the full sequence for chromosome 3?
Solution
chr3length <- seqlengths(seqinfo(mygenome))["Chr3"] chr3range <- GRanges("Chr3", IRanges(start = 1, end = chr3length)) chr3seq <- scanFa(mygenome, chr3range)
When we import data from VCF, the reference and alternative alleles are stored
in DNAStringSet
s.
Experimental results
The SummarizedExperiment
class contains matrix-like data, with assays in
rows and samples in columns. Like much of Bioconductor, it was originally
designed for microarray data, where there would be many probes on a microarray,
each representing a gene, and the fluorescence of that probe indicated
expression levels of the gene in that sample. However, SummarizedExperiment
is flexible enough to also contain RNA-seq data, or in our case, genotyping
results.
The RangedSummarizedExperiment
class is exactly like SummarizedExperiment
,
except that each assay is associated with a location within the genome. This
makes sense for our variant data, where each “assay” is a SNP. It also makes
sense for gene expression data, where each gene has a location within the
genome. Below we’ll load an example expression dataset from Bioconductor.
data(airway, package="airway")
airway
class: RangedSummarizedExperiment
dim: 64102 8
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
We can see that there are row names that are gene identities, and column names that are sample identifiers. There are eight samples and 64,102 genes. One handy thing about the printout is that it lists functions we can use to access the data. Here’s some metadata about the samples:
colData(airway)
DataFrame with 8 rows and 9 columns
SampleName cell dex albut Run avgLength
<factor> <factor> <factor> <factor> <factor> <integer>
SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
Experiment Sample BioSample
<factor> <factor> <factor>
SRR1039508 SRX384345 SRS508568 SAMN02422669
SRR1039509 SRX384346 SRS508567 SAMN02422675
SRR1039512 SRX384349 SRS508571 SAMN02422678
SRR1039513 SRX384350 SRS508572 SAMN02422670
SRR1039516 SRX384353 SRS508575 SAMN02422682
SRR1039517 SRX384354 SRS508576 SAMN02422673
SRR1039520 SRX384357 SRS508579 SAMN02422683
SRR1039521 SRX384358 SRS508580 SAMN02422677
And the read counts for gene expression (I am using the head
function to avoid
printing them all out):
assays(airway)$counts %>% head(10)
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003 679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419 467 515 621 365 587
ENSG00000000457 260 211 263 164 245
ENSG00000000460 60 55 40 35 78
ENSG00000000938 0 0 2 0 1
ENSG00000000971 3251 3679 6177 4252 6721
ENSG00000001036 1433 1062 1733 881 1424
ENSG00000001084 519 380 595 493 820
ENSG00000001167 394 236 464 175 658
SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 1047 770 572
ENSG00000000005 0 0 0
ENSG00000000419 799 417 508
ENSG00000000457 331 233 229
ENSG00000000460 63 76 60
ENSG00000000938 0 0 0
ENSG00000000971 11027 5176 7995
ENSG00000001036 1439 1359 1109
ENSG00000001084 714 696 704
ENSG00000001167 584 360 269
Challenge: Gene and sample names
If you wanted a vector of gene IDs or a vector of sample names, what functions can you use to get those?
Solution
geneids <- rownames(airway) head(geneids, 10)
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" [5] "ENSG00000000460" "ENSG00000000938" "ENSG00000000971" "ENSG00000001036" [9] "ENSG00000001084" "ENSG00000001167"
samples <- colnames(airway) samples
[1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516" [6] "SRR1039517" "SRR1039520" "SRR1039521"
It also wouldn’t be wrong, just more complicated, to get them like this:
geneids <- rownames(assays(airway)$counts) samples <- colnames(assays(airway)$counts)
The rowRanges
function gets us the GRanges
object associated with each gene.
rowRanges(airway)
GRangesList object of length 64102:
$ENSG00000000003
GRanges object with 17 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] X 99883667-99884983 - | 667145 ENSE00001459322
[2] X 99885756-99885863 - | 667146 ENSE00000868868
[3] X 99887482-99887565 - | 667147 ENSE00000401072
[4] X 99887538-99887565 - | 667148 ENSE00001849132
[5] X 99888402-99888536 - | 667149 ENSE00003554016
... ... ... ... . ... ...
[13] X 99890555-99890743 - | 667156 ENSE00003512331
[14] X 99891188-99891686 - | 667158 ENSE00001886883
[15] X 99891605-99891803 - | 667159 ENSE00001855382
[16] X 99891790-99892101 - | 667160 ENSE00001863395
[17] X 99894942-99894988 - | 667161 ENSE00001828996
-------
seqinfo: 722 sequences (1 circular) from an unspecified genome
...
<64101 more elements>
The first gene is on the X chromosome. All the exons are listed individually.
Because of this, the GRanges
for each gene are compiled into a GRangesList
for the whole dataset. The use of GRanges
means we could subset the whole
dataset by genomic position. Let’s say we’re just interested in a certain
region of chromosome 4:
myregion <- GRanges("4", IRanges(start = 82660000, end = 119280000))
airway4 <- subsetByOverlaps(airway, myregion)
airway4
class: RangedSummarizedExperiment
dim: 458 8
metadata(1): ''
assays(1): counts
rownames(458): ENSG00000005059 ENSG00000029559 ... ENSG00000273389
ENSG00000273447
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
rowRanges(airway4)
GRangesList object of length 458:
$ENSG00000005059
GRanges object with 20 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 4 110481361-110481592 + | 161238 ENSE00002034107
[2] 4 110481365-110481592 + | 161239 ENSE00002071601
[3] 4 110481368-110481592 + | 161240 ENSE00002078395
[4] 4 110482146-110482173 + | 161241 ENSE00002027139
[5] 4 110569640-110569805 + | 161242 ENSE00001695709
... ... ... ... . ... ...
[16] 4 110605596-110606523 + | 161254 ENSE00001890650
[17] 4 110605599-110605802 + | 161255 ENSE00000736563
[18] 4 110606407-110606523 + | 161256 ENSE00000333811
[19] 4 110608671-110608872 + | 161257 ENSE00001847574
[20] 4 110608671-110609874 + | 161258 ENSE00001866116
-------
seqinfo: 722 sequences (1 circular) from an unspecified genome
...
<457 more elements>
Now we just have 458 genes in that region.
In the next episode, we will import data from a VCF.
Key Points
FaFile creates a pointer to a reference genome file on your computer.
An index file allows quick access to specific information from large files.
GRanges stores positions within a genome for any type of feature (SNP, exon, etc.)
DNAStringSet stores DNA sequences.
SummarizedExperiment stores the results of a set of assays across a set of samples.