January 14, 2016

Day 2

Topics:

  • Read mapping and alignments
  • Presentation of different tools
  • Hands-on example: how to create a searchable database

Sequence search vs. Mapping

- BLAST-like searches Mapping
- Similarity search Short read aligner
- Originally for FASTA files For FASTQ files
- Reports hits up to a very low identity Few mismatches and gaps
- Local alignment BWA-SW/Bowtie global alignment, BWA-MEM local
- No standard output SAM/BAM format
- For long sequences For short sequences
- For few sequences For millions of reads
- DNA and protein DNA
- No paired end data Paired-end mapping
- Large database Small index

Sequence search - BLAST

  • Well known: BLAST
  • Not very fast and not made for mio of short reads

Sequence search - BLAST

Sequence search - BLAST

To create a BLAST database we download a large fasta file such as Uniprot:

cd ~/data
mkdir blast
cd blast
wget 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/
  knowledgebase/complete/uniprot_sprot.fasta.gz'

This is the reviewed UniProtKB found here: http://www.uniprot.org/downloads

Unzip the file to uniprot.fasta. Does it contain DNA or protein sequences?

Sequence search - BLAST

We use the tool ncbi-blast which can be installed with:

sudo apt-get install ncbi-blast+

And create the BLAST database with:

makeblastdb -in uniprot.fasta -dbtype prot -parse_seqids 
  -out uniprot.db

Which files did the command create?

Sequence search - BLAST

Now we can search with a sequence file (test_query.fasta) in the Uniprot database.
Since we have DNA, we have to use blastx.

blastx -num_threads 4 -evalue 1 -db uniprot.db -query 
  ../test_query.fasta -out blast_results.txt

Did you find a good hit?

Sequence search - RAPSearch2

An alternative to BLAST for short reads is RAPSearch

Let's download and install it:

cd ..
mkdir rapsearch
cd rapsearch
wget 'http://sourceforge.net/projects/rapsearch2/files/
  RAPSearch2.23_64bits.tar.gz'
tar -zxvf RAPSearch2.23_64bits.tar.gz
cd RAPSearch2.23_64bits
./install

Sequence search - RAPSearch2

  • RAPsearch uses a hash-table of 6-mers with a reduced alphabet to find seeds
  • Uses the same extension algorithm as used in BLAST to extend the seeds
  • Can use FASTQ as input, but does not use quality values
  • A lot faster than BLAST


Image source:
Yuzhen Ye, Jeong-Hyeon Choi and Haixu Tang. RAPSearch: a Fast Protein Similarity Search Tool for Short Reads. BMC Bioinformatics 2011, 12:159.



Sequence search - RAPSearch2

Similar to Blast we first have to create a database:

cd ..
prerapsearch -d ../blast/uniprot.fasta -n uniprot.db

Which files did the command create?

Now we can search the database with our FASTQ file. But let's take just a couple of reads to speed things up.

zcat ../A111AU_L002_R1_001.fastq.gz | head -n 4000 > first_part.fastq
rapsearch -q first_part.fastq -d uniprot.db -o rapsearch_results -e 1

Have a look at rapsearch_results.m8 and rapsearch_results.aln

Mapping - Bowtie2/BWA

  • Programs such as BWA and Bowtie are based on backward search with Burrows–Wheeler Transform (BWT)
  • Efficiently align short sequencing reads against a large reference sequence
  • Uses quality values
  • Supports paired-end data

We will use Bowtie here.

cd ..
mkdir bowtie
cd bowtie
wget 'http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/
  bowtie2-2.2.5-linux-x86_64.zip'
unzip bowtie2-2.2.5-linux-x86_64.zip

Mapping - Bowtie2/BWA

  • Since mapping programs are designed to map reads to a genome rather than a database, we will download a genome and create a database to map our FASTQ sample.

Let's see if we can find Chlamydomonas in the lake samples.

Go to http://www.ncbi.nlm.nih.gov/genome/?term=txid3055[Organism:noexp] and download the FASTA format for genome (top of the page). Move the file to the data folder and unpack it in the file clamy.fasta.

Mapping - Bowtie2/BWA

Now we can create the index for mapping:

cd bowtie
bowtie2-2.2.5/bowtie2-build ../clamy.fasta clamy.index

This we can use to search with our sequences.

bowtie2-2.2.5/bowtie2 -x clamy.index -U ../rapsearch/first_part.fastq 
  -S bowtie_results.sam

Mapping - Bowtie2/BWA

The result can be found in the so called SAM format:

head bowtie_results.sam
tail bowtie_results.sam

Most reads are unmapped, we can find a mapping read with 100 matches with:

grep '100M' bowtie_results.sam

Mapping - Bowtie2/BWA

Col Field Type Brief description
1 QNAME String Query template NAME
2 FLAG Int bitwise FLAG
3 RNAME String Reference sequence NAME
4 POS Int 1-based leftmost mapping POSition
5 MAPQ Int MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Ref. name of the mate/next read
8 PNEXT Int Position of the mate/next read
9 TLEN Int observed Template LENgth
10 SEQ String segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33

Questions?