Genome assembly with SPAdes презентация

Содержание


Презентации» Образование» Genome assembly with SPAdes
Genome assembly with SPAdes
 Center for Algorithmic Biotechnology
 SPbUIntroductionWhy to assemble?Why to assemble?
 Sequencing data 
 Billions of short reads
 SequencingWhy to assemble?
 Sequencing data 
 Billions of short reads
 SequencingAssembly basicsAssembly in a perfect worldAssembly in real worldDe novo whole genome assemblyDe novo whole genome assemblyGenomic repeats
 TATTCTTCCACGTAGGGCCTTCCACGCTTCGGenomic repeats
 TATTCTTC
      CTTCCACG
  Genomic repeats
 TATTCTTC
      CTTCCACG
  Genomic repeats
 TATTCTTCCACGTAGG
 GGCCTTCCACGCTTCG
 
 
 TATTCTTCCACGCTTCG
 GGCCTTCCACGTAGGGenomic repeats
 
 
 
 TATTCTTCCACGTAGG
     Genomic repeats
 
 
 
 TATTCTTCCACGTAGG
     SPAdes assemblerSPAdes first steps
 spades.pySPAdes first steps
 spades.py
 spades.py --help
 spades.py --testSPAdes first steps
 spades.py
 spades.py --help
 spades.py --test
 -o <output_dir>Input data formats
 FASTA: .fasta / .fa
 FASTQ: .fastq / .fq
Input data options
 Unpaired reads
 Illumina unpaired
 -s single.fastq
 -s single1.fastqInput data options
 Paired-end reads
 Interlaced pairs in one file
 >left_read_id
Input data options
 Paired-end reads
 Interlaced pairs in one file
 --pe1-12Input data options
 Paired-end reads
 Interlaced pairs in one file
 --pe1-12SPAdes performance options
 Number of threads
 -t N
 Maximal available RAMPipeline options
 Run only assembler (input reads are already corrected orInput data options
 Mate-pair reads 
 Cannot be used separately
 InterlacedHybrid assembly options
 PacBio CLR 
 --pacbio pb.fastq
 Oxford Nanopore reads
Restarting SPAdes
 SPAdes / system crashed
 --continue -o your_output_dirGenome assembly evaluation with QUAST
 Center for Algorithmic Biotechnology
 SPbUIn realityWhich assembler to use?
 ABySS
 ALLPATHS-LG
 CLC
 IDBA-UD
 MaSuRCA
 MIRA
 Ray
Which assembler to use?
 Different technologies (Illumina, 454, IonTorrent, ...)
 GenomeThere is no best assemblerWhich assembler to use?
 Assemblathon 1 & 2
 Simulated and realAssembly evaluation
 Basic evaluation
 No extra input
 Very quick
 Reference-based evaluation
Basic statisticsContig sizes
 Number of contigsContig sizes
 Number of contigs
 Number of large contigs (i.e. >Contig sizes
 Number of contigs
 Number of large contigs (i.e. >Contig sizes
 Number of contigs
 Number of large contigs (i.e. >N50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allN50
 The maximum length X for which the collection of allL50
 The minimum number X such that X longest contigs coverL50
 The minimum number X such that X longest contigs coverN50-variations
 N25, N75
 L25, L75N50-variations
 N25, N75
 L25, L75N50-variations
 N25, N75
 L25, L50, L75N50-variations
 N25, N75
 L25, L50, L75
 Nx, LxOther
 Number of N’s per 100 kbpOther
 Number of N’s per 100 kbp
 GC %Other
 Number of N’s per 100 kbp
 GC %
 Distributions ofOtherReference-based metricsBasic reference statistics
 Reference length
 Reference GC %
 Number of chromosomesBasic reference statistics
 NGx, LGxBasic reference statistics
 NGx, LGxBasic reference statistics
 NGx, LGxAlignment statisticsAlignment statisticsAlignment statistics
 Genome fraction %Alignment statistics
 Genome fraction %
 Duplication ratioAlignment statistics
 Genome fraction %
 Duplication ratio
 Number of gapsAlignment statistics
 Genome fraction %
 Duplication ratio
 Number of gaps
 LargestAlignment statistics
 Genome fraction %
 Duplication ratio
 Number of gaps
 LargestAlignment statistics
 Genome fraction %
 Duplication ratio
 Number of gaps
 LargestAlignment statistics
 Genome fraction %
 Duplication ratio
 Number of gaps
 LargestMisassembliesMisassembliesThere is no best metricNA50NA50NA50NA50QUality ASsesment Tool 
 for Genome AssembliesQUAST
 Assembly statistics 
 Basic statistics
 Reference-based evaluation
 Simple de novoQUAST: console tool
 quast.py
 quast.py --helpQUAST basics
 quast.py
 quast.py --help
 quast.py contigs.fasta
 quast.py [options] contigs.fasta
 quast.pyReference options
 Reference genome
 -R reference.fasta
 Gene annotation
 -G genes.gff 
QUAST output
 Reports in different formats
 Plain text table
 Tab separatedContig alignment viewer
 All alignments for each contig
 Misassembly details 
Contig alignment viewerContig size viewer
 Contigs ordered from longest to shortest
 N50, N75Contig size viewerDe novo evaluationRead-based statistics
 Number of aligned/unaligned reads 
 % of assembly coveredRead-based statistics
 Number of aligned/unaligned reads 
 % of assembly coveredAnnotation-based statistics
 Number of ORFsAnnotation-based statistics
 Number of ORFs
 Number of gene/operon-like regions
 GeneMarkS (BorodovskyAnnotation-based statistics
 Number of ORFs
 Number of gene/operon-like regions
 GeneMarkS (BorodovskyThank you!
 Questions?



Слайды и текст этой презентации
Слайд 1
Описание слайда:
Genome assembly with SPAdes Center for Algorithmic Biotechnology SPbU


Слайд 2
Описание слайда:
Introduction

Слайд 3
Описание слайда:
Why to assemble?

Слайд 4
Описание слайда:
Why to assemble? Sequencing data Billions of short reads Sequencing errors Contaminants

Слайд 5
Описание слайда:
Why to assemble? Sequencing data Billions of short reads Sequencing errors Contaminants Assembly Corrects sequencing errors Much longer sequences Each genomic region is presented only once May introduce errors

Слайд 6
Описание слайда:
Assembly basics

Слайд 7
Описание слайда:
Assembly in a perfect world

Слайд 8
Описание слайда:
Assembly in real world

Слайд 9
Описание слайда:
De novo whole genome assembly

Слайд 10
Описание слайда:
De novo whole genome assembly

Слайд 11
Описание слайда:
Genomic repeats TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 12
Описание слайда:
Genomic repeats TATTCTTC CTTCCACG CACGTAGG GGCCTTCC CTTCCACG CACGCTTCG TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 13
Описание слайда:
Genomic repeats TATTCTTC CTTCCACG CACGTAGG GGCCTTCC CTTCCACG CACGCTTCG

Слайд 14
Описание слайда:
Genomic repeats TATTCTTCCACGTAGG GGCCTTCCACGCTTCG TATTCTTCCACGCTTCG GGCCTTCCACGTAGG

Слайд 15
Описание слайда:
Genomic repeats TATTCTTCCACGTAGG ACGTAGGGCCTT GCCTTCCACGCTTCG TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 16
Описание слайда:
Genomic repeats TATTCTTCCACGTAGG ACGTAGGGCCTT GCCTTCCACGCTTCG

Слайд 17
Описание слайда:
SPAdes assembler

Слайд 18
Описание слайда:
SPAdes first steps spades.py

Слайд 19
Описание слайда:
SPAdes first steps spades.py spades.py --help spades.py --test

Слайд 20
Описание слайда:
SPAdes first steps spades.py spades.py --help spades.py --test -o <output_dir>

Слайд 21
Описание слайда:
Input data formats FASTA: .fasta / .fa FASTQ: .fastq / .fq Gzipped: .gz

Слайд 22
Описание слайда:
Input data options Unpaired reads Illumina unpaired -s single.fastq -s single1.fastq -s single2.fastq ...

Слайд 23
Описание слайда:
Input data options Paired-end reads Interlaced pairs in one file >left_read_id ACGTGCAGG… >right_read_id GCTTCGAGG… Separate files file1.fastq file2.fastq >left_read_id >right_read_id ACGTGCAGG… GCTTCGAGG…

Слайд 24
Описание слайда:
Input data options Paired-end reads Interlaced pairs in one file --pe1-12 file.fastq Separate files --pe1-1 file1.fastq --pe1-2 file2.fastq

Слайд 25
Описание слайда:
Input data options Paired-end reads Interlaced pairs in one file --pe1-12 file.fastq Separate files --pe1-1 file1.fastq --pe1-2 file2.fastq --pe1-s unpaired.fastq

Слайд 26
Описание слайда:
SPAdes performance options Number of threads -t N Maximal available RAM (GB) SPAdes will terminate if exceeded -m M

Слайд 27
Описание слайда:
Pipeline options Run only assembler (input reads are already corrected or quality-trimmed) --only-assembler

Слайд 28
Описание слайда:
Input data options Mate-pair reads Cannot be used separately Interlaced pairs in one file --mp1-12 mp.fastq Separate files --mp1-1 mp1.fastq --mp1-2 mp2.fastq

Слайд 29
Описание слайда:
Hybrid assembly options PacBio CLR --pacbio pb.fastq Oxford Nanopore reads --nanopore nanopore_reads.fastq

Слайд 30
Описание слайда:
Restarting SPAdes SPAdes / system crashed --continue -o your_output_dir

Слайд 31
Описание слайда:
Genome assembly evaluation with QUAST Center for Algorithmic Biotechnology SPbU

Слайд 32
Описание слайда:
In reality

Слайд 33
Описание слайда:
Which assembler to use? ABySS ALLPATHS-LG CLC IDBA-UD MaSuRCA MIRA Ray SOAPdenovo SPAdes Velvet and many more...

Слайд 34
Описание слайда:
Which assembler to use? Different technologies (Illumina, 454, IonTorrent, ...) Genome type and size (bacteria, insects, mammals, plants, ...) Type of prepared libraries (single reads, paired-end, mate-pairs, combinations) Type of data (multicell, metagenomic, single-cell)

Слайд 35
Описание слайда:
There is no best assembler

Слайд 36
Описание слайда:
Which assembler to use? Assemblathon 1 & 2 Simulated and real datasets More than 30 teams competing Independent studies Papers (GAGE, GAGE-B, GABenchToB) Web-sites (nucleotid.es, …) Surveys Genome assembly evaluation tools QUAST GAGE

Слайд 37
Описание слайда:
Assembly evaluation Basic evaluation No extra input Very quick Reference-based evaluation A lot of metrics Very accurate De novo evaluation Advanced analysis of de novo assemblies

Слайд 38
Описание слайда:
Basic statistics

Слайд 39
Описание слайда:
Contig sizes Number of contigs

Слайд 40
Описание слайда:
Contig sizes Number of contigs Number of large contigs (i.e. > 1000 bp)

Слайд 41
Описание слайда:
Contig sizes Number of contigs Number of large contigs (i.e. > 1000 bp) Largest contig length

Слайд 42
Описание слайда:
Contig sizes Number of contigs Number of large contigs (i.e. > 1000 bp) Largest contig length Total assembly length

Слайд 43
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 44
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 45
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 46
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 47
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 48
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 49
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 50
Описание слайда:
N50 The maximum length X for which the collection of all contigs of length >= X covers at least 50% of the assembly

Слайд 51
Описание слайда:
L50 The minimum number X such that X longest contigs cover at least 50% of the assembly

Слайд 52
Описание слайда:
L50 The minimum number X such that X longest contigs cover at least 50% of the assembly

Слайд 53
Описание слайда:
N50-variations N25, N75 L25, L75

Слайд 54
Описание слайда:
N50-variations N25, N75 L25, L75

Слайд 55
Описание слайда:
N50-variations N25, N75 L25, L50, L75

Слайд 56
Описание слайда:
N50-variations N25, N75 L25, L50, L75 Nx, Lx

Слайд 57
Описание слайда:
Other Number of N’s per 100 kbp

Слайд 58
Описание слайда:
Other Number of N’s per 100 kbp GC %

Слайд 59
Описание слайда:
Other Number of N’s per 100 kbp GC % Distributions of GC % in small windows:

Слайд 60
Описание слайда:
Other

Слайд 61
Описание слайда:
Reference-based metrics

Слайд 62
Описание слайда:
Basic reference statistics Reference length Reference GC % Number of chromosomes

Слайд 63
Описание слайда:
Basic reference statistics NGx, LGx

Слайд 64
Описание слайда:
Basic reference statistics NGx, LGx

Слайд 65
Описание слайда:
Basic reference statistics NGx, LGx

Слайд 66
Описание слайда:
Alignment statistics

Слайд 67
Описание слайда:
Alignment statistics

Слайд 68
Описание слайда:
Alignment statistics Genome fraction %

Слайд 69
Описание слайда:
Alignment statistics Genome fraction % Duplication ratio

Слайд 70
Описание слайда:
Alignment statistics Genome fraction % Duplication ratio Number of gaps

Слайд 71
Описание слайда:
Alignment statistics Genome fraction % Duplication ratio Number of gaps Largest alignment length

Слайд 72
Описание слайда:
Alignment statistics Genome fraction % Duplication ratio Number of gaps Largest alignment length Number of unaligned contigs (full & partial)

Слайд 73
Описание слайда:
Alignment statistics Genome fraction % Duplication ratio Number of gaps Largest alignment length Number of unaligned contigs (full & partial) Number of mismatches/indels per 100 kbp

Слайд 74
Описание слайда:
Alignment statistics Genome fraction % Duplication ratio Number of gaps Largest alignment length Number of unaligned contigs (full & partial) Number of mismatches/indels per 100 kbp Number of genes/operons (full & partial)

Слайд 75
Описание слайда:
Misassemblies

Слайд 76
Описание слайда:
Misassemblies

Слайд 77
Описание слайда:
There is no best metric

Слайд 78
Описание слайда:
NA50

Слайд 79
Описание слайда:
NA50

Слайд 80
Описание слайда:
NA50

Слайд 81
Описание слайда:
NA50

Слайд 82
Описание слайда:
QUality ASsesment Tool for Genome Assemblies

Слайд 83
Описание слайда:
QUAST Assembly statistics Basic statistics Reference-based evaluation Simple de novo evaluation Available as a web-based and a command line tool quast.sf.net

Слайд 84
Описание слайда:
QUAST: console tool quast.py quast.py --help

Слайд 85
Описание слайда:
QUAST basics quast.py quast.py --help quast.py contigs.fasta quast.py [options] contigs.fasta quast.py -o out_dir contigs.fasta

Слайд 86
Описание слайда:
Reference options Reference genome -R reference.fasta Gene annotation -G genes.gff Operon annotation -O operons.gff

Слайд 87
Описание слайда:
QUAST output Reports in different formats Plain text table Tab separated values (Excel, Google Spreadsheets) Interactive HTML Plots (PDF/PNG/SVG) Nx, NGx, NAx Genes Cumulative length Interactive contig viewers (Icarus) Contig alignment viewer Contig size viewer

Слайд 88
Описание слайда:
Contig alignment viewer All alignments for each contig Misassembly details Contig ordering along the genome Overlaps / gaps

Слайд 89
Описание слайда:
Contig alignment viewer

Слайд 90
Описание слайда:
Contig size viewer Contigs ordered from longest to shortest N50, N75 (NG50, NG75) Filtration by contig size Gene prediction results Available without a reference

Слайд 91
Описание слайда:
Contig size viewer

Слайд 92
Описание слайда:
De novo evaluation

Слайд 93
Описание слайда:
Read-based statistics Number of aligned/unaligned reads % of assembly covered by reads

Слайд 94
Описание слайда:
Read-based statistics Number of aligned/unaligned reads % of assembly covered by reads Points with low coverage Points with multiple read clipping Points with incorrect insert sizes

Слайд 95
Описание слайда:
Annotation-based statistics Number of ORFs

Слайд 96
Описание слайда:
Annotation-based statistics Number of ORFs Number of gene/operon-like regions GeneMarkS (Borodovsky et al.) GlimmerHMM (Majoros et al.)

Слайд 97
Описание слайда:
Annotation-based statistics Number of ORFs Number of gene/operon-like regions GeneMarkS (Borodovsky et al.) GlimmerHMM (Majoros et al.) Number of conservative genes BUSCO (Simão et al.) CEGMA (Korf et al., no longer supported)

Слайд 98
Описание слайда:
Thank you! Questions?


Скачать презентацию на тему Genome assembly with SPAdes можно ниже:

Похожие презентации