genomic-medicine-sweden/nallo: Output
Introduction
This document describes the pipeline output files and the tools used to generate them.
Aligned reads
Minimap2 is used to map the reads to a reference genome. The aligned reads are sorted, merged and indexed using samtools. If the pipeline is run with phasing, the aligned reads will be happlotagged using the active phasing tool.
Path | Description | Alignment | Alignment & phasing |
---|---|---|---|
aligned_reads/minimap2/{sample}/*.bam |
Alignment file in bam format | ||
aligned_reads/minimap2/{sample}/*.bai |
Index of the corresponding bam file |
Path | Description | Alignment | Alignment & phasing |
---|---|---|---|
aligned_reads/{sample}/{sample}_haplotagged.bam |
BAM file with haplotags | ||
aligned_reads/{sample}/{sample}_haplotagged.bam.bai |
Index of the BAM file |
Assembly
Hifiasm is used to assemble genomes. The assembled haplotypes are then converted to fasta files using gfastats. A deconstructed version of dipcall is used to map the assembled haplotypes back to the reference genome.
Path | Description |
---|---|
assembly_haplotypes/gfastats/{sample}/*hap1.p_ctg.fasta.gz |
Assembled haplotype 1 |
assembly_haplotypes/gfastats/{sample}/*hap2.p_ctg.fasta.gz |
Assembled haplotype 2 |
assembly_haplotypes/gfastats/{sample}/*.assembly_summary |
Summary statistics |
assembly_variant_calling/dipcall/{sample}/*hap1.bam |
Assembled haplotype 1 mapped to the reference genome |
assembly_variant_calling/dipcall/{sample}/*hap1.bai |
Index of the corresponding BAM file for haplotype 1 |
assembly_variant_calling/dipcall/{sample}/*hap2.bam |
Assembled haplotype 2 mapped to the reference genome |
assembly_variant_calling/dipcall/{sample}/*hap2.bai |
Index of the corresponding BAM file for haplotype 2 |
Methylation pileups
Modkit is used to create methylation pileups, producing bedMethyl files for both haplotagged and ungrouped reads. Additionally, methylation information can be viewed in the BAM files, for example in IGV. When phasing is on, modkit outputs pileups per haplotype.
Path | Description | Alignment | Alignment & phasing |
---|---|---|---|
methylation/modkit/pileup/{sample}/*.modkit_pileup_phased_*.bed.gz |
bedMethyl file with summary counts from haplotagged reads | ||
methylation/modkit/pileup/{sample}/*.modkit_pileup_phased_ungrouped.bed.gz |
bedMethyl file for ungrouped reads | ||
methylation/modkit/pileup/{sample}/*.modkit_pileup.bed.gz |
bedMethyl file with summary counts from all reads | ||
methylation/modkit/pileup/{sample}/*.bed.gz.tbi |
Index of the corresponding bedMethyl file |
MultiQC
MultiQC generates an HTML report summarizing all samples' QC results and pipeline statistics.
Path | Description |
---|---|
multiqc/multiqc_report.html |
HTML report summarizing QC results |
multiqc/multiqc_data/ |
Directory containing parsed statistics |
multiqc/multiqc_plots/ |
Directory containing static report images |
Pipeline Information
Nextflow generates reports for troubleshooting, performance, and traceability.
Path | Description |
---|---|
pipeline_info/execution_report.html |
Execution report |
pipeline_info/execution_timeline.html |
Timeline report |
pipeline_info/execution_trace.txt |
Execution trace |
pipeline_info/pipeline_dag.dot |
Pipeline DAG in DOT format |
pipeline_info/pipeline_report.html |
Pipeline report |
pipeline_info/software_versions.yml |
Software versions used in the run |
Phasing
LongPhase, WhatsHap, or HiPhase are used for phasing.
Path | Description |
---|---|
aligned_reads/{sample}/{sample}_haplotagged.bam |
BAM file with haplotags |
aligned_reads/{sample}/{sample}_haplotagged.bam.bai |
Index of the BAM file |
phased_variants/{sample}/*.vcf.gz |
VCF file with phased variants |
phased_variants/{sample}/*.vcf.gz.tbi |
Index of the VCF file |
qc/phasing_stats/{sample}/*.blocks.tsv |
Phase block file |
qc/phasing_stats/{sample}/*.stats.tsv |
Phasing statistics file |
QC
FastQC, cramino, mosdepth, and somalier are used for read quality control.
FastQC
FastQC provides general quality metrics for sequenced reads, including information on quality score distribution, per-base sequence content (%A/T/G/C), adapter contamination, and overrepresented sequences. For more details, refer to the FastQC help pages.
Path | Description |
---|---|
qc/fastqc/{sample}/*_fastqc.html |
FastQC report containing quality metrics |
qc/fastqc/{sample}/*_fastqc.zip |
Zip archive with the FastQC report, data files, and plot images |
Mosdepth
Mosdepth is used to report quality control metrics such as coverage and GC content from alignment files.
Path | Description | With --target_regions |
Without --target_regions |
---|---|---|---|
qc/mosdepth/{sample}/*.mosdepth.global.dist.txt |
Cumulative distribution of bases covered for at least a given coverage value, across chromosomes and the whole genome | ||
qc/mosdepth/{sample}/*.mosdepth.summary.txt |
Mosdepth summary file | ||
qc/mosdepth/{sample}/*.mosdepth.region.dist.txt |
Cumulative distribution of bases covered for at least a given coverage value, across regions | ||
qc/mosdepth/{sample}/*.regions.bed.gz |
Depth per region | ||
qc/mosdepth/{sample}/*.regions.bed.gz.csi |
Index of the regions.bed.gz file |
Cramino
cramino is used to analyze both phased and unphased reads.
Path | Description |
---|---|
qc/cramino/phased/{sample}/*.arrow |
Read length and quality in Apache Arrow format |
qc/cramino/phased/{sample}/*.txt |
Summary information in text format |
qc/cramino/unphased/{sample}/*.arrow |
Read length and quality in Apache Arrow format |
qc/cramino/unphased/{sample}/*.txt |
Summary information in text format |
Somalier
somalier checks relatedness and sex.
Path | Description |
---|---|
pedigree/family/{family).ped |
PED file updated with somalier-inferred sex per family |
qc/somalier/relate/{project}/{project}.html |
HTML report |
qc/somalier/relate/{project}/{project}.pairs.tsv |
Information about sample pairs |
qc/somalier/relate/{project}/{project}.samples.tsv |
Information about individual samples |
DeepVariant
vcf_stats_report.py
from DeepVariant is used to generate a html report per sample.
Path | Description |
---|---|
qc/deepvariant_vcfstatsreport/{sample}/${sample}.visual_report.html |
Visual report of SNV calls from DeepVariant |
Variants
In general, annotated variant calls are output per family while unannotated calls are output per sample.
Paralogous genes
Paraphase is used to call paralogous genes.
Path | Description |
---|---|
paraphase/{sample}/*.bam |
BAM file with reads from analysed regions |
paraphase/{sample}/*.bai |
Index of the BAM file |
paraphase/{sample}/*.json |
Summary of haplotypes and variant calls |
paraphase/{sample}/{sample}_paraphase_vcfs/{sample}_{gene}_vcf.gz |
VCF file per gene |
paraphase/{sample}/{sample}_paraphase_vcfs/{sample}_{gene}_vcf.gz.tbi |
Index of the VCF file |
Repeats
TRGT is used to call repeats.
Path | Description | Call repeats | Call & annotate repeats |
---|---|---|---|
repeats/family/{family}/{family}_repeat_expansions.vcf.gz |
Merged VCF file per family | ||
repeats/family/{family}/{family}_repeat_expansions.vcf.gz.tbi |
Index of the VCF file | ||
repeats/sample/{sample}/{sample}_sorted.vcf.gz |
VCF file with called repeats for a sample | ||
repeats/sample/{sample}/{sample}_sorted.vcf.gz.tbi |
Index of the VCF file | ||
repeats/sample/{sample}/{sample}_spanning_sorted.bam |
BAM file with sorted spanning reads | ||
repeats/sample/{sample}/{sample}_spanning_sorted.bai |
Index of the BAM file |
Stranger is used to annotate repeats.
Path | Description | Call repeats | Call & annotate repeats |
---|---|---|---|
repeat_expansions/family/{family}/{family}_repeat_expansions_annotated.vcf.gz |
Merged, annotated VCF file per family | ||
repeat_expansions/family/{family}/{family}_repeat_expansions_annotated.vcf.gz.tbi |
Index of the VCF file |
SNVs
DeepVariant is used to call variants, while bcftools and GLnexus are used for merging variants.
Path | Description | Call SNVs | Call & annotate SNVs | Call, annotate and rank SNVs |
---|---|---|---|---|
snvs/sample/{sample}/{sample}_snv.vcf.gz |
VCF file containing called variants with alternative genotypes for a sample | |||
snvs/sample/{sample}/{sample}_snv.vcf.gz.tbi |
Index of the corresponding VCF file | |||
snvs/stats/sample/*.stats.txt |
Variant statistics | |||
qc/deepvariant_vcfstatsreport/{sample}/${sample}.visual_report.html |
Visual report of SNV calls from DeepVariant | |||
snvs/family/{family}/{family}_snv.vcf.gz |
VCF file containing called variants for all samples | |||
snvs/family/{family}/{family}_snv.vcf.gz.tbi |
Index of the corresponding VCF file |
Annotation
Echtvar and VEP are used for annotating SNVs, while CADD is used to annotate INDELs with CADD scores.
Path | Description | Call SNVs | Call & annotate SNVs | Call, annotate and rank SNVs |
---|---|---|---|---|
snvs/sample/{sample}/{sample}_snvs_annotated.vcf.gz |
VCF file containing annotated variants with alternative genotypes for a sample | |||
snvs/sample/{sample}/{sample}_snvs_annotated.vcf.gz.tbi |
Index of the annotated VCF file | |||
snvs/family/{family}/{family}_snvs_annotated.vcf.gz |
VCF file containing annotated variants per family | |||
snvs/family/{family}/{family}_snvs_annotated.vcf.gz.tbi |
Index of the annotated VCF file |
Ranking
GENMOD is used to rank the annotated SNVs and INDELs.
Path | Description | Call SNVs | Call & annotate SNVs | Call, annotate and rank SNVs |
---|---|---|---|---|
snvs/sample/{sample}/{sample}_snvs_annotated_ranked.vcf.gz |
VCF file with annotated and ranked variants for a sample | |||
snvs/sample/{sample}/{sample}_snvs_annotated_ranked.vcf.gz.tbi |
Index of the ranked VCF file | |||
snvs/family/{family}/{family}_snvs_annotated_ranked.vcf.gz |
VCF file with annotated and ranked variants per family | |||
snvs/family/{family}/{family}_snvs_annotated_ranked.vcf.gz.tbi |
Index of the ranked VCF file |
Filtering
Filter_vep and bcftools can be used to filter variants. These will be output if either of --filter_variants_hgnc_id
and --filter_snvs_expression
has been used, and only family VCFs are filtered.
Path | Description |
---|---|
snvs/{family}/{family}_*_filtered.vcf.gz |
VCF file with filtered variants for a family |
snvs/{family}/{family}_*_filtered.vcf.gz.tbi |
Index of the filtered VCF file |
Tip
Filtered variants are output alongside unfiltered variants as additional files.
SVs (and CNVs)
Severus or Sniffles are used to call structural variants, while HiFiCNV is used to call CNVs. HiFiCNV also produces copy number, depth, and MAF visualization tracks.
Variant merging strategies
SV and CNV calls are output unmerged per sample, while the family files are first merged between samples for SVs and CNVs separately, then the merged SV and CNV files are merged again, with priority given to coordinates from the SV calls.
Path | Description | Call SVs | Call CNVs | Call SVs & CNVs |
---|---|---|---|---|
svs/sample/{sample}/{sample}_svs.vcf.gz |
VCF file with SVs per sample | |||
svs/sample/{sample}/{sample}_svs.vcf.gz.tbi |
VCF file with SVs per sample | |||
svs/sample/{sample}/{sample}_cnvs.vcf.gz |
VCF file with CNVs per sample | |||
svs/sample/{sample}/{sample}_cnvs.vcf.gz.tbi |
VCF file with CNVs per sample | |||
svs/family/{family_id}/{family_id}_svs_merged.vcf.gz |
VCF file with merged SVs per family | |||
svs/family/{family_id}/{family_id}_svs_merged.vcf.gz.tbi |
Index of the merged VCF file | |||
svs/family/{family_id}/{family_id}_cnvs_svs_merged.vcf.gz |
VCF file with merged CNVs and SVs per family | |||
svs/family/{family_id}/{family_id}_cnvs_svs_merged.vcf.gz.tbi |
Index of the merged VCF file |
Annotation
SVDB and VEP are used to annotate structural variants.
Path | Description | Call & annotate SVs | Call & annotate SVs & CNVs |
---|---|---|---|
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated.vcf.gz |
VCF file with merged and annotated CNVs and SVs per family | ||
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated.vcf.gz.tbi |
Index of the merged VCF file | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated.vcf.gz |
VCF file with merged and annotated SVs per family | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated.vcf.gz.tbi |
Index of the merged VCF file |
Ranking
GENMOD is used to rank the annotated SVs.
Path | Description | Rank SVs | Rank SVs & CNVs |
---|---|---|---|
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated_ranked.vcf.gz |
VCF file with merged, annotated and ranked CNVs and SVs per family | ||
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated_ranked.vcf.gz.tbi |
Index of the merged VCF file | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated_ranked.vcf.gz |
VCF file with merged, annotated and ranked SVs per family | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated_ranked.vcf.gz.tbi |
Index of the merged VCF file |
Filtering
Filter_vep and bcftools can be used to filter variants. These will be output if either of --filter_variants_hgnc_id
and --filter_svs_expression
has been used, and only family VCFs are filtered.
Path | Description |
---|---|
svs/{family}/{family}_*_filtered.vcf.gz |
VCF file with filtered variants for a family |
svs/{family}/{family}_*_filtered.vcf.gz.tbi |
Index of the filtered VCF file |
Tip
Filtered variants are output alongside unfiltered variants as additional files.
Visualization Tracks
HiFiCNV is used to call CNVs, but it also produces copy number, depth, and MAF tracks that can be visualized in for example IGV.
Path | Description |
---|---|
visualization_tracks/{sample}/*.copynum.bedgraph |
Copy number in bedgraph format |
visualization_tracks/{sample}/*.depth.bw |
Depth track in BigWig format |
visualization_tracks/{sample}/*.maf.bw |
Minor allele frequencies in BigWig format |