《《代测序实践》PPT课件.pptx》由会员分享,可在线阅读,更多相关《《代测序实践》PPT课件.pptx(25页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、二代测序2012.9.20资源数据源:http:/www.ncbi.nlm.nih.gov/srahttp:/www.ebi.ac.uk/ena/数据下载方式AsperaFTP数据分析交流网站http:/ Oasis数据质量分析工具:Fastx-toolkit工具包二代数据可视化工具:IGV,Savant,samtools,Gbrowser二代数据结果输出格式SAM,BAMhttp:/ example:a line in SAM file using BWA:HWUSI-EAS172:628C8:4:1:1138:2718 16 chr20 46264803 37 76M *0 0 ACCCA
2、AGTAAAGTAAGCAATCAGGATTCCAAGAGTCCTCTGGGCGTTTATTGCGACCAAAATCCAGTGGGGAGTTC#?:?=:?ABC?5:8(6:*0:42C-=?C:D:D:(:)1=:DDDBD=:?DD-DDD;B;66=XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:1A42T24A6SAM主体部分1.QNAME,read 名字2.FLAG,bitwise flag,标识read map到染色体上的情况3.RNAME,染色体名字4.START,map到染色体上的第一个位置5.MAPPING QUA
3、LITY,mapping的质量6.CIGAR,比对结果情况描述(H,S,M)SAM主体部分7.MRNM,配对read的名字8.MPOS,配对序列的起始位点9.ISIZE,两个reads间最远碱基的距离10.SEQQuery:read调整到与参考基因组同链的序列11.reference QUAL,read的质量(ASCII-33)SAM FLAG0X0001 =1 the read is paired in sequencing0X0002 =2 the read is mapped in a proper pair0X0004 =4 the query sequence itself is u
4、nmapped0X0008 =8 the mate is unmapped0X0010 =16 strand of the query0X0020 =32 strand of the mate0X0040 =64 the read is the first read in a pair0X0080 =128 the read is the second read in a pair0X0100 =256 the alignment is not primary0X0200 =512 QC failure0X0400 =1024 optical or PCR duplicateSAM格式附加部分
5、NM Edit distance 编辑距离,与参考基因组的差异碱基数目MD mismatching positions/bases 错配的碱基或位置X0 最优匹配位置的数目X1 次优匹配位置的数目XN 参考基因组中模糊碱基的数目(N)XM 错配碱基的数目XO 打开的gap数目XG 打开的gap中延伸的碱基数目XT Type:Unique/Repat/N/Mate-sw XA 其他mapping位置报告二代数据分析流程Sra格式数据解压fastq-dump option-A/-accession赋予解压文件新的名字-split-3 分割双端测序数据Order1)fastq-dump-split-
6、3 SRR427121.lite.sraRead filterFastx-Toolkitn1)$fastx_quality_stats fastx_quality_stats-h-i INFILE-o OUTFILEu2)$fastq_quality_boxplot_graph.shINPUT.TXT-t TITLE-p-o OUTPUT3)$fastx_trimmer-h-f N-l N-z-v-i INFILE-o OUTFILEu4)$fastx_nucleotide_distribution_graph.sh-p-i INPUT.TXT-o OUTPUT-t TITLE5)$fastx
7、_trimmer-h-f N-l N-t N-m MINLEN-z-v-i INFILE-o OUTFILE6)$fastq_quality_trimmer-h-v-t N-l N-z-i INFILE-o OUTFILEhttp:/hannonlab.cshl.edu/fastx_toolkit/galaxy.htmlReads Quality Statsfastx_quality_stats -i in.fastq-o out.statShortgun reads trimTrimed Sitefastq_quality_boxplot_graph i out.stat o output
8、t titleFastx-toolkit 实践Order2)nohup fastx_quality_stats -i SRR427121_1.fastq-o SRR_1.stat-Q33&Order3.1)fastq_quality_boxplot_graph.sh-i SRR_1.stat-o SRR_1.png D:花SRR_1.pngOrder3.2)fastx_nucleotide_distribution_graph.sh-i*stat-o SRR_1_nucleotide_distribution D:花SRR_1_nucleotide_distribution.pngFastax
9、-toolkit结果分析Fastx-toolkit结果分析结果分析Trimed Sitefastq_quality_boxplot_graph i out.stat o output t titleTrim fastqfastq_quality_trimmer-t 20-l 15-i SRR427121_1.fastq-o ecoli_1.fq-Q33fastq_quality_trimmer-t 20-l 15-i SRR427121_2.fastq-o ecoli_2.fq-Q33Reference genome mapping:BWA1)建立索引bwa index-p prefix-a
10、algoType-c-p 建立的索引的名字-a 构建索引使用的算法,is试用的基因组长度2GB,bwtsw适合的基因组长度10MB-c构建color-space 索引,适合solid数据比对bwa index-p Ecoli-a is NC_000913.fna2)alnbwa aln-n maxDiff-o maxGapO-e maxGapE-d nDelTail-i nIndelEnd-k maxSeedDiff-l seedLen-t nThrds-cRN-M misMsc-O gapOsc-E gapEsc-q trimQual perl get_consesus_read.pl ec
11、oli_1.fq ecoli_2.fq ecoli_1_trim.fq ecoli_2_trim.fq ecoli_trim.fq&bwa aln-t 10./././chromosome/Ecoli./ecoli_1_trim.fq-f ecoli_1_trim.sai&bwa aln-t 10./././chromosome/Ecoli./ecoli_2_trim.fq-f ecoli_2_trim.sai&bwa aln-t 10./././chromosome/Ecoli./ecoli_trim.fq-f ecoli_trim.saiReference genome mapping:B
12、WA3)samsebwa samse-n maxOcc bwa samse./././chromosome/Ecoli ecoli_trim.sai./ecoli_trim.fq-f ecoli_trim.sam4)sampebwa sampe-a maxInsSize-o maxOcc-n maxHitPaired-N maxHitDis-P bwa samse./././chromosome/Ecoli ecoli_1_trim.sai./ecoli_1_trim.fq ecoli_2_trim.sai./ecoli_2_trim.fq-f ecoli_trim_paired.samRef
13、erence genome mapping:bowtie建立索引bowtie-build options*-f reference input files(fasta)-c reference from command line-C/-color color base(for solid)bowtie-build-f NC_000913.fna ecoli比对bowtie options*-1 -2|-12|Bowtie 必须文件 参数bowtie options*-1 -2|-12|-1 逗号分隔的文件 seg1-2 逗号分隔的文件 seq2-12 构建的以tab键分隔的文件-q fastq
14、格式文件-f fasta格式文件r 只有序列的文件,每行一条序列S single end reads-C color base文件比对-Q with f and C-Q1-Q2 combination with f-1 and C-integer-quals -solexa1.3-quals -solexa-quals -phred33-quals -phred64-qualsAlignment 参数选择-v 允许的最大mismatch的数目-l seed 的长度影响速度,l越大,速度越快-n seed 中允许的错配数-I 配对序列允许的最小插入长度-X 配对序列允许的最大插入长度-fr 默认
15、5335Output参数选择-k 限制为每个read输出的最大mapping 位置数目-a 报告全部的mapping 位置-m 不报告mapping位置大于m的read-M 随机报告mapping位置大于M的mapping结果-best 确保bowtie汇报最好的mapping(only for single end read)-strata 相当于best的备选库(only for single end read)输出sam及多线程-S output.sam-p 使用进程数目bowtie./././chromosome/ecoli-phred33-quals-p 16-1./ecoli_1_trim.fq-2./ecoli_2_trim.fq-S ecoli.samSAM is just the beginning