genomic-medicine-sweden/nallo: Output
Introduction
This document describes the pipeline output files and the tools used to generate them.
Aligned reads
Minimap2 is used to map the reads to a reference genome. The aligned reads are sorted, merged and indexed using samtools. If the pipeline is run with phasing, the aligned reads will be haplotagged using the active phasing tool.
Path | Description | Alignment | Alignment & phasing |
---|---|---|---|
aligned_reads/minimap2/{sample}/{sample}_aligned.{bam,cram} |
Alignment file in BAM or CRAM format | ||
aligned_reads/minimap2/{sample}/{sample}_aligned.{bam.bai,cram.crai} |
Index of the corresponding alignment file |
Path | Description | Alignment | Alignment & phasing |
---|---|---|---|
aligned_reads/{sample}/{sample}_haplotagged.{bam,cram} |
Alignment file with haplotags in BAM or CRAM format | ||
aligned_reads/{sample}/{sample}_haplotagged.{bam.bai,cram.crai} |
Index of the alignment file |
Assembly
Hifiasm is used to assemble genomes. The assembled haplotypes are then aligned to the reference genome with minimap2, tagged with HP:1
for the "paternal" haplotype, and HP:2
for the "maternal" haplotype, before being merged together into one file with samtools. gfastats is used to convert the assembly to fasta format before alignment, and also outputs summary stats per haplotype.
Path | Description |
---|---|
assembly/sample/{sample}/{sample}_aligned_assembly.{bam,cram} |
Both assembled haplotypes mapped to the reference genome, merged and haplotagged (HP:1 /HP:2 ). |
assembly/sample/{sample}/{sample}_aligned_assembly.{bam.bai,cram.crai} |
Index of aligned assembly. |
assembly/stats/{sample}/{sample}_haplotype_1.assembly_summary |
Summary statistics for haplotype 1/paternal haplotype |
assembly/stats/${sample}/{sample}_haplotype_2.assembly_summary |
Summary statistics for haplotype 2/maternal haplotype |
Methylation pileups
Modkit is used to create methylation pileups, producing bedMethyl files for both haplotagged and ungrouped reads. Additionally, methylation information can be viewed in the BAM files, for example in IGV. When phasing is on, modkit outputs pileups per haplotype.
Path | Description | Alignment | Alignment & phasing |
---|---|---|---|
methylation/modkit/pileup/{sample}/*.modkit_pileup_1.bed.gz |
bedMethyl file with summary counts from haplotagged reads (haplotype 1) | ||
methylation/modkit/pileup/{sample}/*.modkit_pileup_2.bed.gz |
bedMethyl file with summary counts from haplotagged reads (haplotype 2) | ||
methylation/modkit/pileup/{sample}/*.modkit_pileup_ungrouped.bed.gz |
bedMethyl file for ungrouped reads | ||
methylation/modkit/pileup/{sample}/*.modkit_pileup.bed.gz |
bedMethyl file with summary counts from all reads | ||
methylation/modkit/pileup/{sample}/*.bed.gz.tbi |
Index of the corresponding bedMethyl files |
MultiQC
MultiQC generates an HTML report summarizing all samples' QC results and pipeline statistics.
Path | Description |
---|---|
multiqc/multiqc_report.html |
HTML report summarizing QC results |
multiqc/multiqc_data/ |
Directory containing parsed statistics |
multiqc/multiqc_plots/ |
Directory containing static report images |
Pipeline Information
Nextflow generates reports for troubleshooting, performance, and traceability.
Path | Description |
---|---|
pipeline_info/execution_report.html |
Execution report |
pipeline_info/execution_timeline.html |
Timeline report |
pipeline_info/execution_trace.txt |
Execution trace |
pipeline_info/pipeline_dag.dot |
Pipeline DAG in DOT format |
pipeline_info/pipeline_report.html |
Pipeline report |
pipeline_info/software_versions.yml |
Software versions used in the run |
Phasing
LongPhase, WhatsHap, or HiPhase are used for phasing.
Path | Description |
---|---|
aligned_reads/{sample}/{sample}_haplotagged.{bam,cram} |
BAM/CRAM file with haplotags |
aligned_reads/{sample}/{sample}_haplotagged.{bam.bai,cram.crai} |
Index of the BAM/CRAM file |
phased_variants/{sample}/*.vcf.gz |
VCF file with phased variants |
phased_variants/{sample}/*.vcf.gz.tbi |
Index of the VCF file |
qc/phasing_stats/{sample}/*.blocks.gtf.gz |
Phase block file |
qc/phasing_stats/{sample}/*.blocks.gtf.gz.tbi |
Index of block file |
qc/phasing_stats/{sample}/*.stats.tsv |
Phasing statistics file |
QC
FastQC, cramino, mosdepth, and somalier are used for read quality control.
FastQC
FastQC provides general quality metrics for sequenced reads, including information on quality score distribution, per-base sequence content (%A/T/G/C), adapter contamination, and overrepresented sequences. For more details, refer to the FastQC help pages.
Path | Description |
---|---|
qc/fastqc/{sample}/*_fastqc.html |
FastQC report containing quality metrics |
qc/fastqc/{sample}/*_fastqc.zip |
Zip archive with the FastQC report, data files, and plot images |
Mosdepth
Mosdepth is used to report quality control metrics such as coverage and GC content from alignment files.
Path | Description | With --target_regions |
Without --target_regions |
---|---|---|---|
qc/mosdepth/{sample}/{sample}.mosdepth.global.dist.txt |
Cumulative distribution of bases covered for at least a given coverage value, across chromosomes and the whole genome | ||
qc/mosdepth/{sample}/{sample}.mosdepth.summary.txt |
Mosdepth summary file | ||
qc/mosdepth/{sample}/{sample}.mosdepth.region.dist.txt |
Cumulative distribution of bases covered for at least a given coverage value, across regions | ||
qc/mosdepth/{sample}/{sample}.per-base.d4 |
Per-base depth in d4 format | ||
qc/mosdepth/{sample}/{sample}.regions.bed.gz |
Depth per region | ||
qc/mosdepth/{sample}/{sample}.regions.bed.gz.csi |
Index of the regions.bed.gz file |
Cramino
cramino is used to analyze both phased and unphased reads.
Path | Description |
---|---|
qc/cramino/phased/{sample}/*.arrow |
Read length and quality in Apache Arrow format |
qc/cramino/phased/{sample}/*.txt |
Summary information in text format |
qc/cramino/unphased/{sample}/*.arrow |
Read length and quality in Apache Arrow format |
qc/cramino/unphased/{sample}/*.txt |
Summary information in text format |
Somalier
somalier checks relatedness and sex.
Path | Description |
---|---|
pedigree/family/{family).ped |
PED file updated with somalier-inferred sex per family |
qc/somalier/relate/{project}/{project}.html |
HTML report |
qc/somalier/relate/{project}/{project}.pairs.tsv |
Information about sample pairs |
qc/somalier/relate/{project}/{project}.samples.tsv |
Information about individual samples |
Peddy
peddy checks relatedness and sex.
Path | Description |
---|---|
qc/peddy/{family}/{family}.peddy.ped |
PED file updated with peddy-inferred sex per family |
qc/peddy/{family}/{family}.html |
HTML report |
qc/peddy/{family}/{family}.vs.html |
HTML report of observed vs expected relatedness |
qc/peddy/{family}/{family}.sex_check.csv |
Comparison between reported sex (ped file) and that inferred from peddy |
qc/peddy/{family}/{family}.het_check.csv |
Het check does general QC including rate of het calls, allele-balance at het calls, mean and median depth, and a PCA projection onto thousand genomes. Incudes ancestry check |
qc/peddy/{family}/{family}.ped_check.csv |
Ped check compares the relatedness of 2 samples as reported in a .ped file to the relatedness inferred from the genotypes and ~25K sites in the genome |
qc/peddy/{family}/{family}.sex_check.png |
PNG comparison between reported sex (ped file) and that inferred from peddy |
qc/peddy/{family}/{family}.het_check.png |
PNG of heterozygosity check |
qc/peddy/{family}/{family}.ped_check.png |
PNG of the ped check comparison |
qc/peddy/{family}/{family}.ped_check.rel-difference.csv |
CSV file with the comparison between inferred and given relatedness |
DeepVariant
vcf_stats_report.py
from DeepVariant is used to generate a html report per sample.
Path | Description |
---|---|
qc/deepvariant_vcfstatsreport/{sample}/${sample}.visual_report.html |
Visual report of SNV calls from DeepVariant |
Variants
In general, annotated variant calls are output per family while unannotated calls are output per sample.
Paralogous genes
Paraphase is used to call paralogous genes.
Path | Description |
---|---|
paraphase/sample/{sample}/{sample}.paraphase.{bam,cram} |
BAM/CRAM file with reads from analyzed regions |
paraphase/sample/{sample}/{sample}.paraphase.{bam.bai,cram.crai} |
Index of the BAM/CRAM file |
paraphase/sample/{sample}/{sample}.paraphase.json |
Summary of haplotypes and variant calls |
paraphase/sample/{sample}_paraphase_vcfs/{sample}_{gene}_vcf.gz |
VCF file per gene |
paraphase/sample/{sample}_paraphase_vcfs/{sample}_{gene}_vcf.gz.tbi |
Index of the VCF file |
paraphase/family/{family_id}/{family_id}_paraphase_merged.vcf.gz |
VCF file from paraphase, merged by family |
paraphase/family/{family_id}/{family_id}_paraphase_merged.vcf.gz.tbi |
Index of the VCF file merged by family |
paraphase/family/{family_id}/{family_id}_merged.json |
Summary of haplotypes and variant calls, merged by family |
Repeats
TRGT or STRdust are used to call repeats. Stranger is used to annotate repeats.
Path | Description | STRdust, Call repeats only | TRGT, Call repeats only | TRGT, Call & annotate repeats |
---|---|---|---|---|
repeats/sample/{sample}/{sample}_{str_caller}.vcf.gz |
VCF file with called repeats for a sample | |||
repeats/sample/{sample}/{sample}_{str_caller}.vcf.gz.tbi |
Index of the VCF file | |||
repeats/sample/{sample}/{sample}_spanning_trgt.{bam,cram} |
BAM/CRAM file with sorted spanning reads | |||
repeats/sample/{sample}/{sample}_spanning_trgt.{bam.bai,cram.crai} |
Index of the BAM/CRAM file | |||
repeats/family/{family}/{family}_repeat_expansions.vcf.gz |
Merged VCF file per family | |||
repeats/family/{family}/{family}_repeat_expansions.vcf.gz.tbi |
Index of the VCF file | |||
repeats/family/{family}/{family}_repeat_expansions_annotated.vcf.gz |
Merged, annotated VCF file per family | |||
repeats/family/{family}/{family}_repeat_expansions_annotated.vcf.gz.tbi |
Index of the VCF file |
SNVs
DeepVariant is used to call variants, while bcftools and GLnexus are used for merging variants.
Path | Description | Call SNVs | Call & annotate SNVs | Call, annotate and rank SNVs |
---|---|---|---|---|
snvs/sample/{sample}/{sample}_snvs.vcf.gz |
VCF file containing called variants with alternative genotypes for a sample | |||
snvs/sample/{sample}/{sample}_snvs.vcf.gz.tbi |
Index of the corresponding VCF file | |||
snvs/stats/sample/*.stats.txt |
Variant statistics | |||
qc/deepvariant_vcfstatsreport/{sample}/${sample}.visual_report.html |
Visual report of SNV calls from DeepVariant | |||
snvs/family/{family}/{family}_snvs.vcf.gz |
VCF file containing called variants for all samples | |||
snvs/family/{family}/{family}_snvs.vcf.gz.tbi |
Index of the corresponding VCF file |
Annotation
Echtvar and VEP are used for annotating SNVs, while CADD is used to annotate INDELs with CADD scores.
Path | Description | Call SNVs | Call & annotate SNVs | Call, annotate and rank SNVs |
---|---|---|---|---|
snvs/sample/{sample}/{sample}_snvs_annotated.vcf.gz |
VCF file containing annotated variants with alternative genotypes for a sample | |||
snvs/sample/{sample}/{sample}_snvs_annotated.vcf.gz.tbi |
Index of the annotated VCF file | |||
snvs/family/{family}/{family}_snvs_annotated.vcf.gz |
VCF file containing annotated variants per family | |||
snvs/family/{family}/{family}_snvs_annotated.vcf.gz.tbi |
Index of the annotated VCF file |
Ranking
GENMOD is used to rank the annotated SNVs and INDELs.
Path | Description | Call SNVs | Call & annotate SNVs | Call, annotate and rank SNVs |
---|---|---|---|---|
snvs/sample/{sample}/{sample}_snvs_annotated_ranked.vcf.gz |
VCF file with annotated and ranked variants for a sample | |||
snvs/sample/{sample}/{sample}_snvs_annotated_ranked.vcf.gz.tbi |
Index of the ranked VCF file | |||
snvs/family/{family}/{family}_snvs_annotated_ranked.vcf.gz |
VCF file with annotated and ranked variants per family | |||
snvs/family/{family}/{family}_snvs_annotated_ranked.vcf.gz.tbi |
Index of the ranked VCF file |
Filtering
Filter_vep and bcftools can be used to filter variants. These will be output if either of --filter_variants_hgnc_id
and --filter_snvs_expression
has been used, and only family VCFs are filtered.
Path | Description |
---|---|
snvs/{family}/{family}_*_filtered.vcf.gz |
VCF file with filtered variants for a family |
snvs/{family}/{family}_*_filtered.vcf.gz.tbi |
Index of the filtered VCF file |
Tip
Filtered variants are output alongside unfiltered variants as additional files.
SVs (and CNVs)
Severus or Sniffles are used to call structural variants, while HiFiCNV is used to call CNVs. HiFiCNV also produces copy number, depth, and MAF visualization tracks.
Variant merging strategies
SV and CNV calls are output unmerged per sample, while the family files are first merged between samples for SVs and CNVs separately, then the merged SV and CNV files are merged again, with priority given to coordinates from the SV calls. SV calls are output for all callers, but only variants from one caller (set by --sv_caller
) are merged with CNVs, then annotated, ranked and filtered.
Path | Description | Call SVs | Call CNVs | Call SVs & CNVs | --publish_unannotated_family_svs |
---|---|---|---|---|---|
svs/sample/{sample}/{sample}_{sniffles,severus}_svs_merged.vcf.gz |
VCF file with SVs per sample | ||||
svs/sample/{sample}/{sample}_{sniffles,severus}_svs_merged.vcf.gz.tbi |
VCF file with SVs per sample | ||||
svs/sample/{sample}/{sample}_hificnv_cnvs_merged.vcf.gz |
VCF file with CNVs per sample | ||||
svs/sample/{sample}/{sample}_hificnv_cnvs_merged.vcf.gz.tbi |
VCF file with CNVs per sample | ||||
svs/family/{family_id}/{family_id}_${hifiasm,sniffles,severus}_{svs,cnvs}_merged.vcf.gz |
VCF file with merged SVs per family and caller | ||||
svs/family/{family_id}/{family_id}_${hifiasm,sniffles,severus}_{snvs,cnvs}_merged.vcf.gz.tbi |
Index of the merged VCF file | ||||
svs/family/{family_id}/{family_id}_cnvs_svs_merged.vcf.gz |
VCF file with merged CNVs and SVs per family | ||||
svs/family/{family_id}/{family_id}_cnvs_svs_merged.vcf.gz.tbi |
Index of the merged VCF file |
Annotation
SVDB and VEP are used to annotate structural variants.
Path | Description | Call & annotate SVs | Call & annotate SVs & CNVs |
---|---|---|---|
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated.vcf.gz |
VCF file with merged and annotated CNVs and SVs per family | ||
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated.vcf.gz.tbi |
Index of the merged VCF file | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated.vcf.gz |
VCF file with merged and annotated SVs per family | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated.vcf.gz.tbi |
Index of the merged VCF file |
Ranking
GENMOD is used to rank the annotated SVs.
Path | Description | Rank SVs | Rank SVs & CNVs |
---|---|---|---|
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated_ranked.vcf.gz |
VCF file with merged, annotated and ranked CNVs and SVs per family | ||
svs/family/{family_id}/{family_id}_cnvs_svs_merged_annotated_ranked.vcf.gz.tbi |
Index of the merged VCF file | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated_ranked.vcf.gz |
VCF file with merged, annotated and ranked SVs per family | ||
svs/family/{family_id}/{family_id}_svs_merged_annotated_ranked.vcf.gz.tbi |
Index of the merged VCF file |
Filtering
Filter_vep and bcftools can be used to filter variants. These will be output if either of --filter_variants_hgnc_id
and --filter_svs_expression
has been used, and only family VCFs are filtered.
Path | Description |
---|---|
svs/{family}/{family}_*_filtered.vcf.gz |
VCF file with filtered variants for a family |
svs/{family}/{family}_*_filtered.vcf.gz.tbi |
Index of the filtered VCF file |
Tip
Filtered variants are output alongside unfiltered variants as additional files.
Visualization Tracks
HiFiCNV is used to call CNVs, but it also produces copy number, depth, and MAF tracks that can be visualized in for example IGV.
Path | Description |
---|---|
visualization_tracks/{sample}/*.copynum.bedgraph |
Copy number in bedgraph format |
visualization_tracks/{sample}/*.depth.bw |
Depth track in BigWig format |
visualization_tracks/{sample}/*.maf.bw |
Minor allele frequencies in BigWig format |