This lesson is being piloted (Beta version)

Working with genome annotations

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What genes are near my SNPs of interest?

Objectives
  • Build search windows, sized based on LD in your population, flanking SNPs of interest.

  • Identify genes within your search windows.

  • Determine functional consequences of SNPs.

library(VariantAnnotation)
library(GenomicFeatures)

Here’s code to reload the dataset:

myvcf3 <- VcfFile("data/hmp321_agpv4_chr1_subset_filtered.vcf.bgz")
hdr <- scanVcfHeader(myvcf3)
all_sam <- samples(hdr)
keep_sam <- all_sam[!all_sam %in% c("32", "98F1")]
keep_regions <- GRanges(seqnames = "1",
                        ranges = IRanges(start = c(21.4e6, 22.9e6),
                                         end = c(22.3e6, 23.8e6)))
names(keep_regions) <- c("Region_1", "Region_2")
svp <- ScanVcfParam(info = c("DP", "MAF"), geno = "GT",
                    samples = keep_sam, which = keep_regions,
                    fixed = "ALT")
mydata <- readVcf(myvcf3, param = svp, genome = seqlevels(keep_regions))

Let’s say that after you explored and filtered your dataset, you exported your SNP data to a GWAS software and identified some SNPs that were significantly associated with your trait of interest. How can you find candidate genes? And do any of the variants seem to have functional consequences? Rather than spending an entire week with a genome browser, let’s automate these tasks.

What genes are near my SNP?

In an earlier episode, we loaded the genome annotation from a GFF file.

gtf0 <- rtracklayer::import("data/Zm-B73-REFERENCE-GRAMENE-4.0_Zm00001d.2.gff3")
gtf1 <- gtf0[gtf0$type == "gene"]
gtf1[,c("gene_id", "description")]
GRanges object with 39498 ranges and 2 metadata columns:
          seqnames        ranges strand |        gene_id            description
             <Rle>     <IRanges>  <Rle> |    <character>            <character>
      [1]        1   44289-49837      + | Zm00001d027230         Zm00001d027230
      [2]        1   50877-55716      - | Zm00001d027231         Zm00001d027231
      [3]        1   92299-95134      - | Zm00001d027232         Zm00001d027232
      [4]        1 111655-118312      - | Zm00001d027233         Zm00001d027233
      [5]        1 118683-119739      - | Zm00001d027234         Zm00001d027234
      ...      ...           ...    ... .            ...                    ...
  [39494]       Pt 134341-134862      - |  GRMZM5G885905 Uncharacterized prot..
  [39495]       Pt 134923-135222      - |  GRMZM5G866761 Putative uncharacter..
  [39496]       Pt 138323-139807      + |  GRMZM5G818111                   <NA>
  [39497]       Pt 139824-140048      + |  GRMZM5G866064                   <NA>
  [39498]       Pt 140068-140361      + |  GRMZM5G855343                   <NA>
  -------
  seqinfo: 267 sequences from an unspecified genome; no seqlengths

Let’s say we performed GWAS and have a spreadsheet of our top hits. If you haven’t already, download the example spreadsheet:

download.file("https://raw.githubusercontent.com/HPCBio/variant-analysis-workshop/gh-pages/_episodes_rmd/data/sig_hits.csv",
              destfile = "data/sig_hits.csv")

Now import that data into R.

sig_hits <- read.csv("data/sig_hits.csv", stringsAsFactors = FALSE)
head(sig_hits)
         SNP Chrom      Pos   log10P
1 1-21545431     1 21863251 9.917759
2 1-21470550     1 21776482 9.813656
3 1-23066684     1 23453920 9.633105
4 1-21161473     1 21467486 9.632094
5 1-21540889     1 21858710 9.474378
6 1-21524270     1 21842097 9.468465

To represent the location of these SNPs, we’ll make a GRanges object.

sig_hits_gr <- GRanges(seqnames = as.character(sig_hits$Chrom),
                       ranges = IRanges(start = sig_hits$Pos,
                                        end = sig_hits$Pos),
                       Name = sig_hits$SNP,
                       log10P = sig_hits$log10P)
sig_hits_gr
GRanges object with 37 ranges and 2 metadata columns:
       seqnames    ranges strand |        Name    log10P
          <Rle> <IRanges>  <Rle> | <character> <numeric>
   [1]        1  21863251      * |  1-21545431   9.91776
   [2]        1  21776482      * |  1-21470550   9.81366
   [3]        1  23453920      * |  1-23066684   9.63311
   [4]        1  21467486      * |  1-21161473   9.63209
   [5]        1  21858710      * |  1-21540889   9.47438
   ...      ...       ...    ... .         ...       ...
  [33]        1  22011911      * |  1-21714795   5.36808
  [34]        1  23089222      * |  1-22713889   5.33649
  [35]        1  23658823      * |  1-23273113   5.12980
  [36]        1  23615797      * |  1-23230131   5.04723
  [37]        1  23236989      * |  1-22859536   5.04382
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Based on our LD estimates in the last episode, we’ll want to search within 5 kb of each SNP for candidate genes. We can make another GRanges object to indicate these search regions.

my_ld <- 5000

search_regions <- flank(sig_hits_gr, my_ld, both = TRUE)
search_regions
GRanges object with 37 ranges and 2 metadata columns:
       seqnames            ranges strand |        Name    log10P
          <Rle>         <IRanges>  <Rle> | <character> <numeric>
   [1]        1 21858251-21868250      * |  1-21545431   9.91776
   [2]        1 21771482-21781481      * |  1-21470550   9.81366
   [3]        1 23448920-23458919      * |  1-23066684   9.63311
   [4]        1 21462486-21472485      * |  1-21161473   9.63209
   [5]        1 21853710-21863709      * |  1-21540889   9.47438
   ...      ...               ...    ... .         ...       ...
  [33]        1 22006911-22016910      * |  1-21714795   5.36808
  [34]        1 23084222-23094221      * |  1-22713889   5.33649
  [35]        1 23653823-23663822      * |  1-23273113   5.12980
  [36]        1 23610797-23620796      * |  1-23230131   5.04723
  [37]        1 23231989-23241988      * |  1-22859536   5.04382
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now we’ll use findOverlaps to identify which genes are in each region.

my_hits <- findOverlaps(search_regions, gtf1)
my_hits
Hits object with 26 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1         672
   [2]         1         673
   [3]         2         665
   [4]         2         666
   [5]         2         667
   ...       ...         ...
  [22]        26         672
  [23]        27         671
  [24]        32         681
  [25]        33         682
  [26]        34         710
  -------
  queryLength: 37 / subjectLength: 39498

Let’s translate this to a table of SNP names and gene names.

my_hits2 <- data.frame(SNP = search_regions$Name[queryHits(my_hits)],
                       Gene = gtf1$gene_id[subjectHits(my_hits)],
                       Description = gtf1$description[subjectHits(my_hits)])
my_hits2
          SNP           Gene
1  1-21545431 Zm00001d028062
2  1-21545431 Zm00001d028063
3  1-21470550 Zm00001d028054
4  1-21470550 Zm00001d028055
5  1-21470550 Zm00001d028056
6  1-21470550 Zm00001d028057
7  1-21540889 Zm00001d028062
8  1-21540889 Zm00001d028063
9  1-21524270 Zm00001d028062
10 1-22636910 Zm00001d028102
11 1-22636910 Zm00001d028103
12 1-21603379 Zm00001d028066
13 1-21598747 Zm00001d028066
14 1-22751964 Zm00001d028111
15 1-23224727 Zm00001d028128
16 1-21563089 Zm00001d028064
17 1-22974357 Zm00001d028118
18 1-22583176 Zm00001d028098
19 1-22715750 Zm00001d028108
20 1-22640278 Zm00001d028103
21 1-22640278 Zm00001d028104
22 1-21533453 Zm00001d028062
23 1-21499325 Zm00001d028061
24 1-21694741 Zm00001d028073
25 1-21714795 Zm00001d028074
26 1-22713889 Zm00001d028108
                                                      Description
1  CASP-like protein 5A2 [Source:UniProtKB/Swiss-Prot;Acc:P0DI66]
2                                                  Zm00001d028063
3                                                  Zm00001d028054
4                                                  Zm00001d028055
5                                                  Zm00001d028056
6                                                  Zm00001d028057
7  CASP-like protein 5A2 [Source:UniProtKB/Swiss-Prot;Acc:P0DI66]
8                                                  Zm00001d028063
9  CASP-like protein 5A2 [Source:UniProtKB/Swiss-Prot;Acc:P0DI66]
10                                                 Zm00001d028102
11                                                 Zm00001d028103
12                                                 Zm00001d028066
13                                                 Zm00001d028066
14                                                 Zm00001d028111
15                                                 Zm00001d028128
16                                                 Zm00001d028064
17                                                 Zm00001d028118
18                                                 Zm00001d028098
19                                                 Zm00001d028108
20                                                 Zm00001d028103
21                                                 Zm00001d028104
22 CASP-like protein 5A2 [Source:UniProtKB/Swiss-Prot;Acc:P0DI66]
23                                                 Zm00001d028061
24                                                 Zm00001d028073
25                                                 Zm00001d028074
26                                                 Zm00001d028108

And maybe we want to add a column to our table that lists every nearby gene.

sig_hits$Genes <- sapply(sig_hits$SNP,
                         function(x){
                           genes <- my_hits2$Gene[my_hits2$SNP == x]
                           paste(genes, collapse = ";")
                         })
sig_hits
          SNP Chrom      Pos   log10P
1  1-21545431     1 21863251 9.917759
2  1-21470550     1 21776482 9.813656
3  1-23066684     1 23453920 9.633105
4  1-21161473     1 21467486 9.632094
5  1-21540889     1 21858710 9.474378
6  1-21524270     1 21842097 9.468465
7  1-22534809     1 22917062 9.355975
8  1-22636910     1 23012963 9.046269
9  1-23096680     1 23484423 8.845528
10 1-21603379     1 21904274 8.514568
11 1-21406012     1 21709908 8.384472
12 1-21598747     1 21899696 8.312084
13 1-22751964     1 23127244 8.175816
14 1-23224727     1 23610393 8.165355
15 1-21369527     1 21674878 8.120414
16 1-21563089     1 21879767 7.879734
17 1-21401703     1 21705600 7.846956
18 1-22974357     1 23360219 7.724650
19 1-22583176     1 22959431 7.565953
20 1-21274922     1 21589090 7.103093
21 1-23378850     1 23768488 7.093798
22 1-22991748     1 23377601 7.068908
23 1-22715750     1 23091083 6.958135
24 1-22640278     1 23016331 6.866281
25 1-21152420     1 21458433 6.462329
26 1-21533453     1 21851279 6.395184
27 1-21499325     1 21805256 5.983543
28 1-21436109     1 21741028 5.895989
29 1-21172946     1 21478958 5.894014
30 1-22810129     1 23187584 5.713232
31 1-21403068     1 21706964 5.494226
32 1-21694741     1 21991857 5.450143
33 1-21714795     1 22011911 5.368084
34 1-22713889     1 23089222 5.336487
35 1-23273113     1 23658823 5.129798
36 1-23230131     1 23615797 5.047230
37 1-22859536     1 23236989 5.043819
                                                         Genes
1                                Zm00001d028062;Zm00001d028063
2  Zm00001d028054;Zm00001d028055;Zm00001d028056;Zm00001d028057
3                                                             
4                                                             
5                                Zm00001d028062;Zm00001d028063
6                                               Zm00001d028062
7                                                             
8                                Zm00001d028102;Zm00001d028103
9                                                             
10                                              Zm00001d028066
11                                                            
12                                              Zm00001d028066
13                                              Zm00001d028111
14                                              Zm00001d028128
15                                                            
16                                              Zm00001d028064
17                                                            
18                                              Zm00001d028118
19                                              Zm00001d028098
20                                                            
21                                                            
22                                                            
23                                              Zm00001d028108
24                               Zm00001d028103;Zm00001d028104
25                                                            
26                                              Zm00001d028062
27                                              Zm00001d028061
28                                                            
29                                                            
30                                                            
31                                                            
32                                              Zm00001d028073
33                                              Zm00001d028074
34                                              Zm00001d028108
35                                                            
36                                                            
37                                                            

We can save that output to a file. If you’re worried about having gene names that Excel could change to dates (e.g. MAR7), save to tab-delimited text. Then when you import to Excel, you can tell it to keep particular columns as text.

write.table(sig_hits, file = "sig_hits_genes.tsv", sep = "\t", row.names = FALSE)

Challenge: SNPs within genes

Modify the above code to identify which SNPs are within genes, and which genes they are within.

Solution

my_hits3 <- findOverlaps(sig_hits_gr, gtf1)
my_hits4 <- data.frame(SNP = sig_hits_gr$Name[queryHits(my_hits3)],
                       Gene = gtf1$gene_id[subjectHits(my_hits3)],
                       Description = gtf1$description[subjectHits(my_hits3)])
my_hits4
         SNP           Gene
1 1-21470550 Zm00001d028055
2 1-21540889 Zm00001d028062
3 1-21603379 Zm00001d028066
4 1-22715750 Zm00001d028108
5 1-22640278 Zm00001d028103
6 1-21533453 Zm00001d028062
7 1-21694741 Zm00001d028073
                                                     Description
1                                                 Zm00001d028055
2 CASP-like protein 5A2 [Source:UniProtKB/Swiss-Prot;Acc:P0DI66]
3                                                 Zm00001d028066
4                                                 Zm00001d028108
5                                                 Zm00001d028103
6 CASP-like protein 5A2 [Source:UniProtKB/Swiss-Prot;Acc:P0DI66]
7                                                 Zm00001d028073

Bonus challenge: SNPs in coding regions

Which of the significant SNPs are within coding regions of genes? Hint: find rows of the genome annotation with the type “CDS”.

Solution

gtf2 <- gtf0[gtf0$type == "CDS"]
sig_hits_coding <- subsetByOverlaps(sig_hits_gr, gtf2)
sig_hits_coding
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        Name    log10P
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]        1  21904274      * |  1-21603379   8.51457
  [2]        1  23091083      * |  1-22715750   6.95814
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Identifying functional consequences of SNPs

Let’s get a list of all SNPs that are within our search regions. We’ll look for functional effects of these in order to find potential causative SNPs. We’ll keep these in a CollapsedVCF object.

snps_to_search <- subsetByOverlaps(mydata, search_regions)
snps_to_search
class: CollapsedVCF 
dim: 7535 198 
rowRanges(vcf):
  GRanges with 3 metadata columns: paramRangeID, REF, ALT
info(vcf):
  DataFrame with 2 columns: DP, MAF
info(header(vcf)):
       Number Type    Description           
   DP  1      Integer Total Depth           
   MAF 1      Float   Minor allele frequency
geno(vcf):
  List of length 1: GT
geno(header(vcf)):
      Number Type   Description
   GT 1      String Genotype   

We’ll need our reference genome sequence.

mygenome <- FaFile("data/Zm-B73-REFERENCE-GRAMENE-4.0.fa")

We have our genome annotation imported as a GRanges already, but that won’t do for the function we’re going to use. We need to import it instead as a TxDb object.

my_txdb <- makeTxDbFromGFF("data/Zm-B73-REFERENCE-GRAMENE-4.0_Zm00001d.2.gff3",
                           format = "gff3", organism = "Zea mays",
                           dataSource = "MaizeGDB")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... 
Warning in .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature = "exon", : 3327 exons couldn't be linked to a transcript so were dropped (showing
  only the first 6):
  seqid   start     end strand   ID                      Name
1     1  572435  572663      + <NA> Zm00001d022649_T001.exon1
2     1 1030351 1031090      + <NA> Zm00001d022650_T001.exon1
3     1 1540407 1540586      - <NA> Zm00001d022651_T001.exon1
4     1 1918444 1918644      - <NA> Zm00001d022652_T001.exon1
5     1 2170280 2170492      - <NA> Zm00001d022653_T001.exon2
6     1 2170757 2170983      - <NA> Zm00001d022653_T001.exon1
                          Parent Parent_type
1 transcript:Zm00001d022649_T001        <NA>
2 transcript:Zm00001d022650_T001        <NA>
3 transcript:Zm00001d022651_T001        <NA>
4 transcript:Zm00001d022652_T001        <NA>
5 transcript:Zm00001d022653_T001        <NA>
6 transcript:Zm00001d022653_T001        <NA>
OK

The chromosome names in the FASTA file are in “Chr1” format, whereas the chromosome names in the GFF and VCF files are in “1” format. We have to get everything to match. We can’t modify the reference genome, so we will modify the imported VCF and TxDb data.

seqinfo_vcf <- seqinfo(snps_to_search)
seqinfo_vcf
Seqinfo object with 1 sequence from 1 genome; no seqlengths:
  seqnames seqlengths isCircular genome
  1                NA         NA      1
seqnames(seqinfo_vcf) <- "Chr1"
seqinfo_vcf
Seqinfo object with 1 sequence from 1 genome; no seqlengths:
  seqnames seqlengths isCircular genome
  Chr1             NA         NA      1
seqinfo(snps_to_search, new2old = 1) <- seqinfo_vcf

rowRanges(snps_to_search)
GRanges object with 7535 ranges and 3 metadata columns:
             seqnames    ranges strand | paramRangeID            REF
                <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
  1-21147708     Chr1  21453721      * |     Region_1              G
  1-21147716     Chr1  21453729      * |     Region_1              C
  1-21147728     Chr1  21453741      * |     Region_1              C
  1-21147835     Chr1  21453848      * |     Region_1              G
  1-21147904     Chr1  21453917      * |     Region_1              C
         ...      ...       ...    ... .          ...            ...
  1-23383616     Chr1  23773254      * |     Region_2              C
  1-23383642     Chr1  23773280      * |     Region_2              C
  1-23383691     Chr1  23773329      * |     Region_2              T
  1-23383758     Chr1  23773396      * |     Region_2              C
  1-23383792     Chr1  23773430      * |     Region_2              A
                         ALT
             <CharacterList>
  1-21147708               T
  1-21147716               G
  1-21147728             T,G
  1-21147835               A
  1-21147904               T
         ...             ...
  1-23383616               T
  1-23383642               T
  1-23383691               A
  1-23383758               T
  1-23383792               G
  -------
  seqinfo: 1 sequence from 1 genome; no seqlengths
seqinfo_txdb <- seqinfo(my_txdb)
seqinfo_txdb
Seqinfo object with 108 sequences (2 circular) from an unspecified genome; no seqlengths:
  seqnames    seqlengths isCircular genome
  1                 <NA>       <NA>   <NA>
  2                 <NA>       <NA>   <NA>
  3                 <NA>       <NA>   <NA>
  4                 <NA>       <NA>   <NA>
  5                 <NA>       <NA>   <NA>
  ...                ...        ...    ...
  B73V4_ctg92       <NA>       <NA>   <NA>
  B73V4_ctg95       <NA>       <NA>   <NA>
  B73V4_ctg97       <NA>       <NA>   <NA>
  B73V4_ctg98       <NA>       <NA>   <NA>
  Pt                <NA>       TRUE   <NA>
seqnames(seqinfo_txdb)[1:10] <- paste0("Chr", seqnames(seqinfo_txdb)[1:10])
seqinfo_txdb
Seqinfo object with 108 sequences (2 circular) from an unspecified genome; no seqlengths:
  seqnames    seqlengths isCircular genome
  Chr1              <NA>       <NA>   <NA>
  Chr2              <NA>       <NA>   <NA>
  Chr3              <NA>       <NA>   <NA>
  Chr4              <NA>       <NA>   <NA>
  Chr5              <NA>       <NA>   <NA>
  ...                ...        ...    ...
  B73V4_ctg92       <NA>       <NA>   <NA>
  B73V4_ctg95       <NA>       <NA>   <NA>
  B73V4_ctg97       <NA>       <NA>   <NA>
  B73V4_ctg98       <NA>       <NA>   <NA>
  Pt                <NA>       TRUE   <NA>
seqinfo(my_txdb, new2old = 1:length(seqinfo_txdb)) <- seqinfo_txdb

TxDb objects

If you want to work a lot with genome annotations, see vignette("GenomicFeatures", package = "GenomicFeatures") to learn more. We can get a GRanges of all genes for example with the genes function:

genes(my_txdb)
  11 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 167 ranges and 1 metadata column:
        seqnames              ranges strand |     gene_id
           <Rle>           <IRanges>  <Rle> | <character>
     A2     Chr5   68025894-68027081      + |          A2
   ACK2     Chr2 243948909-243953867      - |        ACK2
   ADF3     Chr1 298705539-298708107      + |        ADF3
   ALS1     Chr5 167860552-167862743      + |        ALS1
   ANT2     Chr4 166706837-166709814      + |        ANT2
    ...      ...                 ...    ... .         ...
  rps16       Pt           3363-5604      - |       rps16
  rps18       Pt         67532-68044      + |       rps18
   rps2       Pt         31858-32568      + |        rps2
   rps3       Pt         81113-82518      - |        rps3
  ycf70       Pt         12720-15097      - |       ycf70
  -------
  seqinfo: 108 sequences (2 circular) from an unspecified genome; no seqlengths

You can also get GRanges for exons sorted into transcripts:

exonsBy(my_txdb, by = "tx")
GRangesList object of length 133940:
$`1`
GRanges object with 9 ranges and 3 metadata columns:
      seqnames      ranges strand |   exon_id              exon_name exon_rank
         <Rle>   <IRanges>  <Rle> | <integer>            <character> <integer>
  [1]     Chr1 44289-44947      + |         1 Zm00001d027230_T001...         1
  [2]     Chr1 45666-45803      + |         2 Zm00001d027230_T001...         2
  [3]     Chr1 45888-46133      + |         3 Zm00001d027230_T001...         3
  [4]     Chr1 46229-46342      + |         4 Zm00001d027230_T001...         4
  [5]     Chr1 46451-46633      + |         5 Zm00001d027230_T001...         5
  [6]     Chr1 47045-47262      + |         6 Zm00001d027230_T001...         6
  [7]     Chr1 47650-48111      + |         7 Zm00001d027230_T001...         7
  [8]     Chr1 48200-49247      + |         8 Zm00001d027230_T001...         8
  [9]     Chr1 49330-49837      + |         9 Zm00001d027230_T001...         9
  -------
  seqinfo: 108 sequences (2 circular) from an unspecified genome; no seqlengths

$`2`
GRanges object with 2 ranges and 3 metadata columns:
      seqnames        ranges strand |   exon_id              exon_name
         <Rle>     <IRanges>  <Rle> | <integer>            <character>
  [1]     Chr1 122120-122427      + |        10 Zm00001d027235_T001...
  [2]     Chr1 122500-122614      + |        11 Zm00001d027235_T001...
      exon_rank
      <integer>
  [1]         1
  [2]         2
  -------
  seqinfo: 108 sequences (2 circular) from an unspecified genome; no seqlengths

...
<133938 more elements>

Now we have our VCF with SNPs of interest, our reference genome, and our annotation all imported and matched up. Finally, we’re able to predict changes in amino acid sequences resulting from these SNPs.

pc <- predictCoding(snps_to_search, my_txdb, mygenome)
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 287 out-of-bound ranges located on sequences
  1218, 1219, 1220, 11099, 1221, 1223, 11105, 1263, 1264, 1265, 1266,
  11193, 11194, 11195, and 1269. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular() to get the
  lengths and circularity flags of the underlying sequences). You can use
  trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more
  information.
Warning in .predictCodingGRangesList(query, cache[["cdsbytx"]], seqSource, :
records with missing 'varAllele' were ignored
pc
GRanges object with 438 ranges and 15 metadata columns:
             seqnames    ranges strand | paramRangeID            REF
                <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
  1-21465774     Chr1  21771706      + |     Region_1              G
  1-21465774     Chr1  21771706      + |     Region_1              G
  1-21465774     Chr1  21771706      + |     Region_1              G
  1-21465774     Chr1  21771706      + |     Region_1              G
  1-21465774     Chr1  21771706      + |     Region_1              G
         ...      ...       ...    ... .          ...            ...
  1-23223510     Chr1  23609176      + |     Region_2              G
  1-23223510     Chr1  23609176      + |     Region_2              G
  1-23223611     Chr1  23609277      + |     Region_2              G
  1-23223611     Chr1  23609277      + |     Region_2              G
  1-23223611     Chr1  23609277      + |     Region_2              G
                         ALT      varAllele    CDSLOC    PROTEINLOC   QUERYID
             <CharacterList> <DNAStringSet> <IRanges> <IntegerList> <integer>
  1-21465774             A,C              A       253            85      1463
  1-21465774             A,C              A       253            85      1463
  1-21465774             A,C              A       253            85      1463
  1-21465774             A,C              C       253            85      1463
  1-21465774             A,C              C       253            85      1463
         ...             ...            ...       ...           ...       ...
  1-23223510               T              T      1810           604      6974
  1-23223510               T              T       634           212      6974
  1-23223611               C              C      1995           665      6975
  1-23223611               C              C      1911           637      6975
  1-23223611               C              C       735           245      6975
                    TXID             CDSID      GENEID   CONSEQUENCE
             <character>     <IntegerList> <character>      <factor>
  1-21465774        1218 12014,12015,12016        <NA> nonsynonymous
  1-21465774        1219 12014,12015,12016        <NA> nonsynonymous
  1-21465774        1220 12014,12015,12016        <NA> nonsynonymous
  1-21465774        1218 12014,12015,12016        <NA> nonsynonymous
  1-21465774        1219 12014,12015,12016        <NA> nonsynonymous
         ...         ...               ...         ...           ...
  1-23223510        1313 12481,12482,12483        <NA> nonsynonymous
  1-23223510        1322 12481,12482,12483        <NA> nonsynonymous
  1-23223611        1312 12481,12482,12483        <NA> synonymous   
  1-23223611        1313 12481,12482,12483        <NA> synonymous   
  1-23223611        1322 12481,12482,12483        <NA> synonymous   
                   REFCODON       VARCODON         REFAA         VARAA
             <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
  1-21465774            GTT            ATT             V             I
  1-21465774            GTT            ATT             V             I
  1-21465774            GTT            ATT             V             I
  1-21465774            GTT            CTT             V             L
  1-21465774            GTT            CTT             V             L
         ...            ...            ...           ...           ...
  1-23223510            GGC            TGC             G             C
  1-23223510            GGC            TGC             G             C
  1-23223611            ACG            ACC             T             T
  1-23223611            ACG            ACC             T             T
  1-23223611            ACG            ACC             T             T
  -------
  seqinfo: 1 sequence from 1 genome; no seqlengths

There are far fewer rows in pc than in snps_to_search, because only SNPs in protein coding regions are returned. We can see the types of mutation and how many of each were found.

table(pc$CONSEQUENCE)

      nonsense  nonsynonymous not translated     synonymous 
             1            242              4            191 

We have transcript numbers in the TXID column, but would like to convert them to transcript names. We can look up that data in the TxDb object.

my_txnames <- select(my_txdb, pc$TXID, c("TXID", "TXNAME"), "TXID")
'select()' returned many:1 mapping between keys and columns
head(my_txnames)
  TXID                         TXNAME
1 1218 transcript:Zm00001d028054_T001
2 1219 transcript:Zm00001d028054_T002
3 1220 transcript:Zm00001d028054_T003
4 1218 transcript:Zm00001d028054_T001
5 1219 transcript:Zm00001d028054_T002
6 1220 transcript:Zm00001d028054_T003
identical(my_txnames$TXID, as.integer(pc$TXID))
[1] TRUE
pc$TXNAME <- my_txnames$TXNAME

Challenge: data export

Look at ?predictCoding to determine what the various columns mean. Construct a data frame with the information that you find to be most relevant, and export it to CSV.

Solution

This may vary based on subjective opinions of what information is important. Note the use of unlist for PROTEINLOC, and the use of names, seqnames, and start to get GRanges-specific information.

df <- data.frame(SNP = names(pc),
                 Chromosome = seqnames(pc),
                 Position = start(pc),
                 Transcript = pc$TXNAME,
                 Consequence = pc$CONSEQUENCE,
                 AA_position = unlist(pc$PROTEINLOC),
                 Ref_AA = pc$REFAA,
                 Var_AA = pc$VARAA)
write.csv(df, file = "protein_coding_variants.csv", row.names = FALSE)

For human data

If you are working with human variants, you also have access to the SIFT and PolyPhen databases. See section 5 of vignette("VariantAnnotation").

Key Points

  • Genome annotations can either be stored as GRanges imported with rtracklayer, or as TxDb imported with GenomicFeatures.

  • Functions that find overlaps between GRanges objects can be used to identify genes near SNPs.

  • The predictCoding function in VariantAnnotation identifies amino acid changes caused by SNPs.