Ascochyta rabiei SNP calling from WGRS pipeline

1 Experimental Design

DNA was extracted from 344 isolates of Ascochyta rabiei (collected in 2024-25) and prepared for whole-genome resequencing (WGRS) at the State Agricultural Biotechnology Centre (SABC) at Murdoch University (led by Prof. Rajeev Varshney). DNA libraries were prepared and sequenced on an MGI DNBSEQ-T7, producing 150 bp paired-end reads.

2 Aims

Compare variant detection pipelines that allow accurate identification of Multi-Locus-Haplotypes (MLH)
Identify MLHs descending from an original population and measure their frequencies
Asses the impact of host genotype on MLHs frequencies
Identify genomic loci and regions that drive isolate adaptation/evolution under different selection pressures

3 Methods

3.1 Overview of Analysis Pipeline

Data pre-processing:
1. Quality check
2. Adaptor trimming
3. Post-trim quality check
Mapping reads to a reference genome
Reads deduplication and read group addition
Variant calling and filtration
Population genetics analysis (MLH clustering)

Sequencing data processing, mapping and variant calling were performed on the QCIF Bunya (using Slurm scheduler, see documentation) and in GalaxyAU.

3.2 Data pre-processing

Install needed software in a conda environment on the HPC cluster (we will install a Miniforge distribution, which has mamba already installed - see mamba docs).

# download miniforge conda
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
# accept defaults and let conda initialise
# initialise conda
source ~/.bashrc
# add channels and set priorities
conda config --add channels conda-forge
conda config --append channels bioconda

# install extra packages to the base environment
mamba install -n base libgcc gnutls libuuid readline cmake git tmux libgfortran parallel mamba gawk pigz rename genozip autoconf sshpass gh
# install snippy (need to fix internet connection to gowonda2 - use patched netcheck in ~/bin)
# source ~/.proxy
CONDA_NAME=genomics
mamba create -n $CONDA_NAME snippy sra-tools bcbio-gff libgd xorg-libxpm \
                libpng libjpeg-turbo jpeg snpsift rename biobambam bwa-mem2 sambamba \
                libtiff genozip parallel qualimap multiqc bbmap fastp freebayes bedops 
# Clean extra space
# conda update -n base conda
conda clean -y --all

# Prepare a general array Slurm script
echo '#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%A.%a.log'"
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

set -Eeo pipefail
source ~/.bashrc
conda activate \$CONDA_NAME
cd \$SLURM_SUBMIT_DIR
gawk -v ARRAY_IND=\$SLURM_ARRAY_TASK_ID 'NR==ARRAY_IND' \$CMDS_FILE | bash" > ~/bin/array.slurm

echo '#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%j.log'"
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

set -Eeo pipefail
source ~/.bashrc
conda activate \$CONDA_NAME
cd \$SLURM_SUBMIT_DIR
bash \$CMDS_FILE" > ~/bin/serial_jobs_run.slurm

# Prepare a parallel jobs Slurm script
echo '
cat $CMDS_FILE | parallel' | cat <(head -n -1  ~/bin/serial_jobs_run.slurm) - > ~/bin/parallel_jobs_run.slurm

Download the raw .fastq.gz and reference genome files from CloudStor.

# Download genome files
REF_DIR=$HOME/data/reference_genomes/ArME14_v2_CCDM # on Bunya HPC
rclone copy -P --include "ArME14.fasta.gz*" --include "ArME14-genes.gff3*" GURS_shared:GRI2007-001RTX/ME14_reference_and_annotations/ArME14_v2_CCDM $REF_DIR
# create a working directory
WORK_DIR=/scratch/project/adna/A_rabiei
mkdir -p $WORK_DIR
cd $WORK_DIR
# download TECAN read files from GU Research Space
rclone copy -P GURS_shared:GRI2007-001RTX/A_rabiei_TECAN/AGRF_CAGRF23101109_22GHFLLT3 ./AGRF_CAGRF23101109_22GHFLLT3

Create a folder for our processing ($RUN_DIR) and prepare the reference genome (create accessory index files).

# start an interactive job on Bunya
alias start_interactive_job='salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=10 --mem=50G --job-name=interactive --time=05:00:00 --partition=general --account=a_agri_genomics srun --export=PATH,TERM,HOME,LANG --pty /bin/bash -l'
WORK_DIR=/scratch/project/adna/A_rabiei/
CONDA_NAME=genomics
conda activate $CONDA_NAME
# BBMAP_REF=$(find $CONDA_PREFIX -wholename "*resources/adapters.fa")
# Prepare the commands
RUN_DIR=$WORK_DIR/A_rabiei_TECAN_2022
mkdir -p $RUN_DIR/ref_genome && cd $RUN_DIR
REF_DIR=$HOME/data/reference_genomes/ArME14_v2_CCDM # New published genome from CCDM on Bunya HPC
# REF_DIR=$HOME/data/reference_genomes # on Awoonga HPC
# GENOME="$REF_DIR/ArME14"
GENOME="$RUN_DIR/ref_genome/ArME14_v2_CCDM"
# ln -s $REF_DIR/Ascochyta_rabiei_ArME14.scaffolds.fa $GENOME.fa 
# ln -s $REF_DIR/ArME14_all_annotations.updated.gff3 $GENOME.gff3 
pigz -cd  $REF_DIR/ArME14.fasta.gz > $GENOME.fa
pigz -cd  $REF_DIR/ArME14-genes.gff3.gz > $GENOME.gff3
gff2bed < $GENOME.gff3 > $GENOME.bed
bwa-mem2 index $GENOME.fa 
samtools faidx $GENOME.fa

3.3 Mapping to the updated reference genome

Reads were mapped to the A. rabiei reference genome assembled and annotated by the CCDM, Curtin University, (NCBI accession GCF_004011695.2) using bwa-mem2 v2.2.1 (Vasimuddin et al. 2019). The alignment files were then coordinate-sorted and PCR duplicates were marked using the bamsormadup command from BioBamBam2 v2.0.183 (Tischler and Leonard 2014).
Mapping quality was assessed with Qualimap v.2.2.2-dev (Okonechnikov et al. 2016) and consolidated along with quality-trimming measures into a single, interactive report for each batch using MultiQC v1.21 (Ewels and Magnusson 2016).

# setup workspace
CONDA_NAME="genomics" 
REF_DIR="/scratch/project/adna/A_rabiei/A_rabiei_TECAN_2022/ref_genome"
GENOME="$REF_DIR/ArME14_v2_CCDM"
WORK_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS"
FQ_DIR="$WORK_DIR/fungaldata"
ln -s $WORK_DIR/fungal/*.fq.gz $FQ_DIR/
ISOLATES="Ar0020|Ar0023|Ar0212|AR0210|AR0022|AR0128" # AR0242|AR0052|
find /scratch/project/adna/A_rabiei/AGRF_CAGRF20114478 -name "*.fastq.gz" | egrep -i $ISOLATES | parallel --dry-run --rpl  "{target} s:.+\/(.+?)_H3HGFDSX2_[CATG-]+_L002_R([12]).fastq.gz:\1.R\2.fq.gz:" ln -s {} $FQ_DIR/{target}
# _H3HGFDSX2_CCTCCGGTTG-TGGTGTTATG_L002_R2.fastq.gz

DATE=$(date +%d_%m_%Y)
RUN_DIR=$WORK_DIR/FB_SNP_calling_${DATE}
BAM_DIR="$RUN_DIR/aligned_reads"
mkdir -p $BAM_DIR $FQ_DIR/trimmed_reads/QC && cd $RUN_DIR
NCORES=12
MEM=96
WALLTIME="3:00:00"
RGPM="DNBseq_T7"
RGPL="MGI"
RGPU="E250038400"
RGCN="MU_SABC"


JOBNAME="Murdoch-fastp-bwa"


find $FQ_DIR/*.R1.fq.gz | parallel -k --dry-run --rpl "{file2} s:.R1:.R2:; uq()" --rpl "{sample} s:.+\/(.+?).R1.fq.gz:\1:"  "fastp -i {} -I {file2} --detect_adapter_for_pe -c -l 30 -p -w \$SLURM_CPUS_PER_TASK -z 7 -o $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz -O $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz -j $FQ_DIR/trimmed_reads/QC/{sample}.fastp.json && bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\tLB:{sample}\tPU:$RGPU\tPL:$RGPL\tPM:$RGPM\tCN:$RGCN\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=$BAM_DIR/{sample}.dedup.rg.csorted.bam.bai > $BAM_DIR/{sample}.dedup.rg.csorted.bam ; unset DISPLAY; qualimap bamqc -bam $BAM_DIR/{sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir $BAM_DIR/{sample}_bamqc"  > $RUN_DIR/$JOBNAME.cmds

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4)

3.4 Calling variants (using Freebayes)

We used Freebayes v1.3.5 (Garrison and Marth 2012) to assign variant probability scores and call variants. Notice that we used diploid mode (-p 2) despite A. rabiei being haploid, to be able to identify loci wrongly called as heterozygotes which indicates a mapping issue. We also added the flag --genotype-qualities to be able to filter the resulting vcf file based on genotype qualities (GQ).

WORK_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS"
RUN_DIR="$WORK_DIR/SNP_calling_04_04_2025"
FQ_DIR="$WORK_DIR/fungaldata"
GENOME="$WORK_DIR/A_rabiei_TECAN_2022/ref_genome/ArME14_v2_CCDM"
CONDA_NAME=genomics

BAM_DIR=$RUN_DIR/aligned_reads

# bring in BAM files (if not prepared in the previous step)
# select isolates
ISOLATES="Ar0020|Ar0023|Ar0212|AR0242|AR0052|AR0210|AR0022|AR0128"
find /scratch/project/adna/A_rabiei/AGRF_gatk_13_03_2025/aligned_reads -maxdepth 1 -name "*.rg.csorted.bam*" | egrep -i $ISOLATES | parallel ln -s {} $BAM_DIR/
ln -s /scratch/project/adna/A_rabiei/Murdoch_WGRS/GATK_Murdoch_WGRS_04_04_2025/aligned_reads/*.rg.csorted.bam* $BAM_DIR/

# Distributed freebayes (each node runs freebayes-parallel on one contig)
# download script
aria2c -c -x5 -d ~/bin https://raw.githubusercontent.com/freebayes/freebayes/master/scripts/split_ref_by_bai_datasize.py 
chmod +x ~/bin/split_ref_by_bai_datasize.py
mamba install -y -n $CONDA_NAME numpy scipy

# start_interactive_job
conda activate $CONDA_NAME
# fix library dependencies
find $CONDA_PREFIX -name "libtabixpp.so*" | parallel ln -s {} {.}.0
# ln -s $CONDA_PREFIX/lib/libtabixpp.so.1 $CONDA_PREFIX/lib/libtabixpp.so.0
# split each contig/chromosome to smaller 1e6 bits
# prepare BAM files
JOBNAME="prep_bams"
NCORES=2
MEM=16
WALLTIME="1:00:00"
# submit it as a Slurm job
echo "~/bin/split_ref_by_bai_datasize.py -s 1e6 -r $GENOME.fa.fai $(ls -1S $BAM_DIR/*.dedup.rg.csorted.bam | tail -n1) > $RUN_DIR/ArME14_target_1e6_regions_chr.tsv" > $JOBNAME.cmds
# submit the job 
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

JOBNAME="Murdoch_FB_diploid"
NCORES=6
MEM=32
WALLTIME="1:00:00"
# RUN_DIR=$WORK_DIR/SNP_calling_24_01_2025
PLOIDY=2
MIN_DP=7
# prepare commands
BAM_FILES=$( find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" | xargs )
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' $RUN_DIR/ArME14_target_1e6_regions_chr.tsv | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$SLURM_CPUS_PER_TASK -f $GENOME.fa --genotype-qualities -g 100000 -C $MIN_DP -p $PLOIDY $BAM_FILES > $RUN_DIR/FB_array_output/{}.combined.vcf" > $RUN_DIR/$JOBNAME.cmds
mkdir -p $RUN_DIR/FB_array_output
# exit interactive job
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4)

# merge variants
ls -1 $RUN_DIR/FB_array_output/ArME14_ctg_*.combined.vcf | parallel --dry-run "printf \"{}\t%s\t%s\t%s\t%s\n\" \$(cat {} | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' ) & ( QUAL > 20 )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' ) & ( QUAL > 30 )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') > {}.stats" > ${RUN_DIR}/freebayes-stats.cmds
JOBNAME="freebayes-merge"
echo "cat ${RUN_DIR}/freebayes-stats.cmds | parallel && cat $RUN_DIR/FB_array_output/ArME14_ctg_*.combined.vcf | vcffirstheader | vcfstreamsort -w 1000 | vcfuniq > A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf && bgzip A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf && tabix A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz" > $JOBNAME.cmds
# submit job to cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm
# combaine stats
cat <(printf "file\ttotal_snps\tDP${MIN_DP}_filtered_snps\tQUAL20_filtered_snps\tQUAL30_filtered_snps\n") <(cat $RUN_DIR/FB_array_output/ArME14_ctg_*.vcf.stats) > A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid_vcf_stats_$(date +%d_%m_%Y).txt

3.4.1 Create and filter final variant file

The variants that were called individually in each chromosome were merged into a single vcf file, which was gzipped, indexed. Variants were filtered using a combination of commands from SnpSift v5.1d (Ruden et al. 2012), BCFtools v1.17 (Li 2011; Danecek et al. 2021) and VCFtools v0.1.16 (Danecek et al. 2011), based on their total loci depth and quality, keeping only bi-allelic polymorphic SNP loci with a depth of at least 10 and not more than 100,000 reads covering the locus and a minimum Quality of 30 (10<DP<100000 & QUAL>30, based on EDA). In addition, each isolate’s genotype call was reset (recoded as missing, or ./.) if it had read depth (DP<10) or called as a heterozygote. Variant statistics were generated by BCFtools pre and post filter.

CONDA_NAME="genomics"
conda activate $CONDA_NAME
# Recode genotypes as missing if below a certain threshold, such as genotyping quality or depth (GQ:DP)  
# filter only polymorphic bi-allelic SNPs, using QUAL>20, 7<DP<100000

# filter Freebayes variants with SnpSift and vcftools (wipe any heterozygote genotype with DP<7 with bcftools)
QUAL=30 # 30
# MQ=30
MAX_DP=100000
MIN_DP=10
IND_DP=10
JOBNAME="Murdoch-wgrs-fb-filter"

echo "bcftools filter -S . -e \"GT=='het' | FMT/DP<$IND_DP\" $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz -O v | SnpSift filter \"( QUAL>=$QUAL ) & (DP<$MAX_DP) & ( countRef()>=1 & countVariant()>=1 )\" | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz 
vcftools --gzvcf $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz --recode --recode-INFO-all --minQ $QUAL --remove-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.snps.Q$QUAL.noHet.poly.recode.vcf.gz 
vcftools --gzvcf $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz --recode --recode-INFO-all --minQ $QUAL --keep-only-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.indels.Q$QUAL.noHet.poly.recode.vcf.gz "> $RUN_DIR/$JOBNAME.cmds

NCORES=6
MEM=32
WALLTIME="1:00:00"
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

#bcftools filter -S . -e "GT=='het' | FMT/DP<$MIN_DP" A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz -O v | SnpSift filter "( QUAL>=$QUAL ) & ( DP<=$MAX_DP ) & ( DP>=$MIN_DP ) & ( countRef()>=1 & countVariant()>=1 )" | vcftools --vcf - --recode --recode-INFO-all --minQ $QUAL --max-missing 0.75 --remove-indels -c | bgzip -o A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.GT75.noRep.noHet.poly.recode.vcf.gz && tabix A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.GT75.noRep.noHet.poly.recode.vcf.gz
# generate stats 
JOBNAME="bcftools_stats"
WALLTIME=2:00:00
MEM=32
NCORES=8
find . -name "*.vcf.gz" | parallel --dry-run "bcftools stats -s - {} > {.}.bcfstats.txt" > $JOBNAME.cmds
# send to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/parallel_jobs_run.slurm

An in-house R script (estimate_error_rates_vcf_files.R) was used to estimate the error rates based on the presence of duplicated samples.

3.4.2 Call SNPs with GATK

The SNPs were also called from the WGRS using gatk v4.3.0.0 (Van der Auwera et al. 2002; Heldenbrand et al. 2019) following this tutorial from Melbourne Bioinformatics. GATK can use a reference variant file as a guide for Base Quality Score Recalibration (BQSR, highly recommended) and will only call variants in those locations, so we need to generate that “Gold Standard” reference. It can be done by either of these options:
1. Use a set of variants called by CCDM from their assemblies (if it exists)
2. Use the variants generated by Freebayes (or Snippy) for the 2020 WGS batch sequenced at AGRF (which might introduce a bias towards those pre-selected sites)
3. Use GATK to call variants from the 2020 WGS batch sequenced at AGRF without BQSR, then use Variant Filtering (VQSR) to select a set of high-confidence variants and use them as the reference for subsequent recalibration and variant calling (in a bootstrap approach, as described in the GATK BQSR documentation)
We chose the 3rd option, as follows:

CONDA_NAME="genomics"
# install tools
mamba install -y -n $CONDA_NAME bwa-mem2 bowtie2 biobambam sambamba qualimap multiqc fastp gatk4
WORK_DIR="/home/ibar/adna/A_rabiei"
REF_DIR="/scratch/project/adna/A_rabiei/A_rabiei_TECAN_2022/ref_genome"
GENOME="$REF_DIR/ArME14_v2_CCDM"

# prepare working folder and reference genome
DATE=`date +%d_%m_%Y`
BATCH=AGRF
RUN="${BATCH}_gatk_${DATE}" # day of run was 02_02_2019
RUN_DIR=$WORK_DIR/${RUN}

mkdir -p $RUN_DIR/aligned_reads $RUN_DIR/$JOBNAME && cd $RUN_DIR

NCORES=2
MEM=8
WALLTIME="2:00:00"
JOBNAME="gatk-prep-genome"
# prepare genome
echo "picard CreateSequenceDictionary R=$GENOME.fa  O=$GENOME.dict; gatk IndexFeatureFile -I $REF_VCF" > $RUN_DIR/$JOBNAME.cmds
# send to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm 

# prepare commands to run mapping and initial gatk
# define variables 
NCORES=12
MEM=64
WALLTIME=2:00:00
RGPM="NovaSeq"
RGPL="ILLUMINA"
RGPU="H3HGFDSX2"
RGCN="AGRF"
FQ_DIR="/scratch/project/adna/A_rabiei/AGRF_snippy_05_03_2025"
JOBNAME="agrf-initial-gatk"

# Create the bwa-mem2 commands to align all read pairs and create GVCF files
find $FQ_DIR -name "*_R1.trimmed.fastq.gz" | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:" --rpl "{sample} s:.+\/(.+?)_R1.+:\1:"  "ALIGN_DIR=$RUN_DIR/aligned_reads && bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\tLB:{sample}\tPU:$RGPU\tPL:$RGPL\tPM:$RGPM\tCN:$RGCN\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa {} {file2} | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=\$ALIGN_DIR/{sample}.dedup.rg.csorted.bam.bai > \$ALIGN_DIR/{sample}.dedup.rg.csorted.bam; gatk --java-options \"-Xmx7g\" HaplotypeCaller -I \$ALIGN_DIR/{sample}.dedup.rg.csorted.bam -R $GENOME.fa -ERC GVCF -O $RUN_DIR/$JOBNAME/{sample}.g.vcf.gz"  > $RUN_DIR/$JOBNAME.cmds

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4)

# find which jobs failed
FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -sd,)
MEM=64
sbatch -a $FAILED_TASKS --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm 


# step 3  - Combine GCVF files and call variants
GCVF_FILES=$(find $RUN_DIR/agrf-initial-gatk -maxdepth 1 -name "*.g.vcf.gz" | gawk '{printf " -V %s", $1}')
JOBNAME="agrf-initial-gatk-combine-gvcf"
mkdir -p $RUN_DIR/output
echo "gatk --java-options \"-Xmx7g\" CombineGVCFs -R $GENOME.fa $GCVF_FILES -O $RUN_DIR/output/AGRF2020_cohort.g.vcf.gz
gatk IndexFeatureFile -I $RUN_DIR/output/AGRF2020_cohort.g.vcf.gz
gatk --java-options \"-Xmx7g\" GenotypeGVCFs -R $GENOME.fa -V $RUN_DIR/output/AGRF2020_cohort.g.vcf.gz -O $RUN_DIR/output/AGRF2020_cohort.gatk.vcf.gz"> $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# step 4a  - Filter variants
QUAL=30 # 30
MQ=40
MAX_DP=100000
MIN_DP=10
IND_DP=10

JOBNAME="agrf-initial-gatk-filter-snps"
echo "gatk SelectVariants -V $RUN_DIR/output/AGRF2020_cohort.gatk.vcf.gz -select-type SNP -O $RUN_DIR/output/AGRF2020_cohort.gatk.snps.vcf.gz 
gatk VariantFiltration -V $RUN_DIR/output/AGRF2020_cohort.gatk.snps.vcf.gz \
    -filter \"QD < 2.0\" --filter-name \"QD2\" \
    -filter \"QUAL < $QUAL.0\" --filter-name \"QUAL$QUAL\" \
    -filter \"SOR > 3.0\" --filter-name \"SOR3\" \
    -filter \"FS > 60.0\" --filter-name \"FS60\" \
    -filter \"MQ < $MQ.0\" --filter-name \"MQ$MQ\" \
    -filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \
    -filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \
    --genotype-filter-expression \"isHet == 1\" \
    --genotype-filter-name \"isHetFilter\" \
    --genotype-filter-expression \"DP < $IND_DP.0\" \
    --genotype-filter-name \"DP$IND_DP\" \
    --genotype-filter-expression \"GQ < $QUAL.0\" \
    --genotype-filter-name \"DP$QUAL\" \
    -O $RUN_DIR/output/AGRF2020_cohort.gatk.snps.filtered.vcf.gz
gatk SelectVariants \
  -V $RUN_DIR/output/AGRF2020_cohort.gatk.snps.filtered.vcf.gz \
  --set-filtered-gt-to-nocall \
  --max-fraction-filtered-genotypes 0.5 \
  -O $RUN_DIR/output/AGRF2020_cohort.gatk.snps.gt_filtered.vcf.gz" > $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# step 4b  - Filter variants
JOBNAME="agrf-initial-gatk-filter-indels"
echo "gatk SelectVariants -V $RUN_DIR/output/AGRF2020_cohort.gatk.vcf.gz -select-type INDEL -O $RUN_DIR/output/AGRF2020_cohort.gatk.indels.vcf.gz 
gatk VariantFiltration \
    -V $RUN_DIR/output/AGRF2020_cohort.gatk.indels.vcf.gz \
    -filter \"QD < 2.0\" --filter-name \"QD2\" \
    -filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \
    -filter \"FS > 200.0\" --filter-name \"FS200\" \
    -filter \"ReadPosRankSum < -20.0\" --filter-name \"ReadPosRankSum-20\" \
    --genotype-filter-expression \"isHet == 1\" \
    --genotype-filter-name \"isHetFilter\" \
    --genotype-filter-expression \"DP < 10.0\" \
    --genotype-filter-name \"DP10\" \
    --genotype-filter-expression \"GQ < 30.0\" \
    --genotype-filter-name \"DP30\" \
    -O $RUN_DIR/output/AGRF2020_cohort.gatk.indels.filtered.vcf.gz
gatk SelectVariants \
  -V $RUN_DIR/output/AGRF2020_cohort.gatk.indels.filtered.vcf.gz \
  --set-filtered-gt-to-nocall \
  --max-fraction-filtered-genotypes 0.5 \
  -O $RUN_DIR/output/AGRF2020_cohort.gatk.indels.gt_filtered.vcf.gz" > $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# Step 5 - Combine VCFs

JOBNAME="agrf-initial-gatk-combine-vcfs"
echo "picard MergeVcfs I=$RUN_DIR/output/AGRF2020_cohort.gatk.snps.gt_filtered.vcf.gz I=$RUN_DIR/output/AGRF2020_cohort.gatk.indels.gt_filtered.vcf.gz O=$RUN_DIR/output/AGRF2020_cohort.gatk.gt_filtered.combined.vcf.gz" > $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

Now we can run the full GATK pipeline.

WORK_DIR="/home/ibar/adna/A_rabiei"
RUN_DIR="$WORK_DIR/AGRF_gatk_13_03_2025"
CONDA_NAME="genomics"
REF_DIR="/scratch/project/adna/A_rabiei/A_rabiei_TECAN_2022/ref_genome"
GENOME="$REF_DIR/ArME14_v2_CCDM"
REF_VCF="$RUN_DIR/output/AGRF2020_cohort.gatk.gt_filtered.combined.vcf.gz"
BAM_DIR="$RUN_DIR/aligned_reads"
JOBNAME="agrf-gatk-bqsr"

mkdir -p $RUN_DIR/bqsr && cd $RUN_DIR

# step 1  - Build BQSR model and apply BQSR
find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" -size +1M | parallel --dry-run "gatk --java-options \"-Xmx7g\" BaseRecalibrator -I {} -R $GENOME.fa --known-sites $REF_VCF -O $RUN_DIR/bqsr/{/.}.recal_data.table; gatk --java-options \"-Xmx7g\" ApplyBQSR -I {} -R $GENOME.fa --bqsr-recal-file $RUN_DIR/bqsr/{/.}.recal_data.table -O bqsr/{/.}.bqsr.bam" > $RUN_DIR/$JOBNAME.cmds


# submit to the cluster
sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm


NCORES=2
MEM=16
WALLTIME="2:00:00"
JOBNAME="agrf-gatk-gvcf"
mkdir -p $RUN_DIR/output2
# step 2  - Create GCVF files per sample
find $RUN_DIR/bqsr -maxdepth 1 -name "*.rg.csorted.bqsr.bam" -size +1M | parallel --dry-run --rpl "{sample} s:.+\/(.+?).dedup.rg.csorted.bqsr.bam:\1:" "gatk --java-options \"-Xmx7g\" HaplotypeCaller -I {} -R $GENOME.fa -ERC GVCF -O output/{sample}.g.vcf.gz"> $RUN_DIR/$JOBNAME.cmds


# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

JOBNAME="gatk-combine-gvcf"
# step 3  - Combine GCVF files and call variants
GCVF_FILES=$(find $RUN_DIR/output -maxdepth 1 -name "AR*.g.vcf.gz" | gawk '{printf " -V %s", $1}')
echo "gatk --java-options \"-Xmx7g\" CombineGVCFs -R $GENOME.fa $GCVF_FILES -O $RUN_DIR/output2/AGRF2020_cohort.g.vcf.gz
gatk IndexFeatureFile -I $RUN_DIR/output2/AGRF2020_cohort.g.vcf.gz
gatk --java-options \"-Xmx7g\" GenotypeGVCFs -R $GENOME.fa -V $RUN_DIR/output2/AGRF2020_cohort.g.vcf.gz -O $RUN_DIR/output2/AGRF2020_cohort.gatk.vcf.gz"> $RUN_DIR/$JOBNAME.cmds

# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f4 -d " ")

# step 4a  - Filter variants (maybe ExcessHet > 10)
QUAL=30 # 30
MQ=40
GQ=30
MAX_DP=100000
MIN_DP=10
IND_DP=10
JOBNAME="agrf-gatk-filter-snps"
echo "gatk SelectVariants -V $RUN_DIR/output2/AGRF2020_cohort.gatk.vcf.gz -select-type SNP -O $RUN_DIR/output2/AGRF2020_cohort.gatk.snps.vcf.gz 
gatk VariantFiltration -V $RUN_DIR/output2/AGRF2020_cohort.gatk.snps.vcf.gz \
    -filter \"QD < 2.0\" --filter-name \"QD2\" \
    -filter \"QUAL < $QUAL.0\" --filter-name \"QUAL$QUAL\" \
    -filter \"SOR > 3.0\" --filter-name \"SOR3\" \
    -filter \"FS > 60.0\" --filter-name \"FS60\" \
    -filter \"MQ < $MQ.0\" --filter-name \"MQ$MQ\" \
    -filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \
    -filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \
    --genotype-filter-expression \"isHet == 1\" \
    --genotype-filter-name \"isHetFilter\" \
    --genotype-filter-expression \"DP < $IND_DP.0\" \
    --genotype-filter-name \"DP$IND_DP\" \
    --genotype-filter-expression \"GQ < $GQ.0\" \
    --genotype-filter-name \"GQ$GQ\" \
    -O $RUN_DIR/output2/AGRF2020_cohort.gatk.snps.filtered.vcf.gz
gatk SelectVariants \
  -V $RUN_DIR/output2/AGRF2020_cohort.gatk.snps.filtered.vcf.gz \
  --set-filtered-gt-to-nocall \
  --max-fraction-filtered-genotypes 0.5 \
  --exclude-filtered true \
  -O $RUN_DIR/output2/AGRF2020_cohort.gatk.snps.gt_filtered.vcf.gz" > $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm| cut -f4 -d " ")

# step 4b  - Filter variants
JOBNAME="agrf-gatk-filter-indels"
echo "gatk SelectVariants -V $RUN_DIR/output2/AGRF2020_cohort.gatk.vcf.gz -select-type INDEL -O $RUN_DIR/output2/AGRF2020_cohort.gatk.indels.vcf.gz 
gatk VariantFiltration \
    -V $RUN_DIR/output2/AGRF2020_cohort.gatk.indels.vcf.gz \
    -filter \"QD < 2.0\" --filter-name \"QD2\" \
    -filter \"QUAL < $QUAL.0\" --filter-name \"QUAL$QUAL\" \
    -filter \"FS > 200.0\" --filter-name \"FS200\" \
    -filter \"ReadPosRankSum < -20.0\" --filter-name \"ReadPosRankSum-20\" \
    --genotype-filter-expression \"isHet == 1\" \
    --genotype-filter-name \"isHetFilter\" \
    --genotype-filter-expression \"DP < $IND_DP.0\" \
    --genotype-filter-name \"DP$IND_DP\" \
    --genotype-filter-expression \"GQ < $GQ.0\" \
    --genotype-filter-name \"GQ$GQ\" \
    -O $RUN_DIR/output2/AGRF2020_cohort.gatk.indels.filtered.vcf.gz
gatk SelectVariants \
  -V $RUN_DIR/output2/AGRF2020_cohort.gatk.indels.filtered.vcf.gz \
  --set-filtered-gt-to-nocall \
  --max-fraction-filtered-genotypes 0.5 \
  -O $RUN_DIR/output2/AGRF2020_cohort.gatk.indels.gt_filtered.vcf.gz" > $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f4 -d " ")

# Step 5 - Combine VCFs

JOBNAME="agrf-gatk-combine-vcfs"
echo "picard MergeVcfs I=$RUN_DIR/output2/AGRF2020_cohort.gatk.snps.gt_filtered.vcf.gz I=$RUN_DIR/output2/AGRF2020_cohort.gatk.indels.gt_filtered.vcf.gz O=$RUN_DIR/output2/AGRF2020_cohort.gatk.gt_filtered.combined.vcf.gz" > $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f4 -d " ")

# clean output folders
rm -rf $RUN_DIR/output $RUN_DIR/aligned_reads $RUN_DIR/bqsr

CONDA_NAME="genomics"
# install tools
mamba install -y -n $CONDA_NAME bwa-mem2 bowtie2 biobambam sambamba qualimap multiqc fastp gatk4

# create working directory
WORK_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS"
RUN_DIR="$WORK_DIR/GATK_Murdoch_WGRS_04_04_2025"
mkdir -p $RUN_DIR/bqsr $RUN_DIR/gvcf $RUN_DIR/output && cd $RUN_DIR
# Set variables
GENOME="$REF_DIR/ArME14_v2_CCDM"
AGRF_GATK_DIR="/scratch/project/adna/A_rabiei/AGRF_gatk_13_03_2025"
REF_VCF="$AGRF_GATK_DIR/output/AGRF2020_cohort.gatk.gt_filtered.combined.vcf.gz"
BAM_DIR="$AGRF_GATK_DIR/aligned_reads"
JOBNAME="agrf-gatk-bqsr-gvcf"

# prepare AGRF batch (original isolates)
# select isolates
ISOLATES="Ar0020|Ar0023|Ar0212|AR0242|AR0052|AR0210|AR0022|AR0128"
find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" -size +1M | egrep -i $ISOLATES | parallel --dry-run --rpl "{sample} s:.+\/(.+?).dedup.rg.csorted.bam:\1:" "gatk --java-options \"-Xmx7g\" BaseRecalibrator -I {} -R $GENOME.fa --known-sites $REF_VCF -O $RUN_DIR/bqsr/{/.}.recal_data.table; gatk --java-options \"-Xmx7g\" ApplyBQSR -I {} -R $GENOME.fa --bqsr-recal-file $RUN_DIR/bqsr/{/.}.recal_data.table -O bqsr/{/.}.bqsr.bam; gatk --java-options \"-Xmx7g\" HaplotypeCaller -I $RUN_DIR/bqsr/{/.}.bqsr.bam -R $GENOME.fa -ERC GVCF -O $RUN_DIR/gvcf/{sample}.g.vcf.gz" > $RUN_DIR/$JOBNAME.cmds

NCORES=2
MEM=8
WALLTIME="3:00:00"
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")


# assign variables
RGPM="DNBseq_T7"
RGPL="MGI"
RGPU="E250038400"
RGCN="MU_SABC"
FQ_DIR="$WORK_DIR/fungaldata"
ln -s $WORK_DIR/fungal/*.fq.gz $FQ_DIR/
BAM_DIR="$RUN_DIR/aligned_reads"
mkdir -p $BAM_DIR $FQ_DIR/trimmed_reads/QC
JOBNAME="Murdoch-fastp-bwa"


find $FQ_DIR/*.R1.fq.gz | parallel -k --dry-run --rpl "{file2} s:.R1:.R2:; uq()" --rpl "{sample} s:.+\/(.+?).R1.fq.gz:\1:"  "fastp -i {} -I {file2} --detect_adapter_for_pe -c -l 30 -p -w \$SLURM_CPUS_PER_TASK -z 7 -o $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz -O $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz -j $FQ_DIR/trimmed_reads/QC/{sample}.fastp.json && bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\tLB:{sample}\tPU:$RGPU\tPL:$RGPL\tPM:$RGPM\tCN:$RGCN\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=$BAM_DIR/{sample}.dedup.rg.csorted.bam.bai > $BAM_DIR/{sample}.dedup.rg.csorted.bam"  > $RUN_DIR/$JOBNAME.cmds

NCORES=12
MEM=96
WALLTIME="5:00:00"

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

# step 0 - assess coverage of files
JOBNAME="Murdoch-gatk-coverage"
mkdir -p $RUN_DIR/coverage

find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" -size +1M | parallel --dry-run --rpl "{sample} s:.+\/(.+?).dedup.rg.csorted.bam:\1:" "samtools idxstats {} | gawk -vreadlen=150 '{len += \$2; nreads += \$3} END {cov=nreads * readlen / len; printf \"x%s\n\", cov}' > $RUN_DIR/coverage/{sample}.coverage" > $RUN_DIR/$JOBNAME.cmds

NCORES=12
MEM=64
WALLTIME="3:00:00"

# submit the job array to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/parallel_jobs_run.slurm
# summarise coverage
find $RUN_DIR/coverage -maxdepth 1 -name "*.coverage"  | parallel "printf \"{}\t %s\n\" \$(cat {})" > coverage_summary.txt


# step 1  - Build BQSR model and apply BQSR
JOBNAME="Murdoch-gatk-bqsr-gvcf"

find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" -size +1M | parallel --dry-run --rpl "{sample} s:.+\/(.+?).dedup.rg.csorted.bam:\1:" "gatk --java-options \"-Xmx7g\" BaseRecalibrator -I {} -R $GENOME.fa --known-sites $REF_VCF -O $RUN_DIR/bqsr/{/.}.recal_data.table; gatk --java-options \"-Xmx7g\" ApplyBQSR -I {} -R $GENOME.fa --bqsr-recal-file $RUN_DIR/bqsr/{/.}.recal_data.table -O $RUN_DIR/bqsr/{/.}.bqsr.bam; gatk --java-options \"-Xmx7g\" HaplotypeCaller -I $RUN_DIR/bqsr/{/.}.bqsr.bam -R $GENOME.fa -ERC GVCF -O $RUN_DIR/gvcf/{sample}.g.vcf.gz" > $RUN_DIR/$JOBNAME.cmds

NCORES=4
MEM=12
WALLTIME="3:00:00"

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

# step 3  - Combine GCVF files, call and filter variants
QUAL=30
MQ=30
DP=10
MAX_DP=100000
JOBNAME="Murdoch-gatk-combine-vcf"
GCVF_FILES=$(find $RUN_DIR/gvcf -maxdepth 1 -name "*.g.vcf.gz" | gawk '{printf " -V %s", $1}')
echo "gatk --java-options \"-Xmx7g\" CombineGVCFs -R $GENOME.fa $GCVF_FILES -O $RUN_DIR/gvcf/Murdoch_WGRS_cohort.g.vcf.gz
gatk IndexFeatureFile -I $RUN_DIR/gvcf/Murdoch_WGRS_cohort.g.vcf.gz
gatk --java-options \"-Xmx7g\" GenotypeGVCFs -R $GENOME.fa -V $RUN_DIR/gvcf/Murdoch_WGRS_cohort.g.vcf.gz -O $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.vcf.gz"> $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

# step 4  - Filter variants
JOBNAME="Murdoch-gatk-vcf-filter"
echo "bcftools filter -S . -e \"GT=='het' | FMT/DP<$DP\" $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.vcf.gz -O v | SnpSift filter \"( QUAL>=$QUAL ) & ( MQ>=$MQ ) & ( DP<=$MAX_DP ) & ( countRef()>=1 & countVariant()>=1 )\" | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.gt_filtered.vcf.gz 
vcftools --gzvcf $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.gt_filtered.vcf.gz --recode --recode-INFO-all --minQ $QUAL --remove-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.snps.gt_filtered.vcf.gz 
vcftools --gzvcf $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.gt_filtered.vcf.gz --recode --recode-INFO-all --minQ $QUAL --keep-only-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/output/Murdoch_WGRS_cohort.gatk.indels.gt_filtered.vcf.gz "> $RUN_DIR/$JOBNAME.cmds
# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

# Step 5 - generate stats 
JOBNAME="bcftools_stats"
WALLTIME=2:00:00
MEM=8
NCORES=4
find . -name "*.vcf.gz" | parallel --dry-run "bcftools stats -s - {} > {.}.bcfstats.txt" > $JOBNAME.cmds
# send to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/parallel_jobs_run.slurm | cut -f 4 -d " ")

# clean output folders
rm -r $RUN_DIR/output/Ar*.g.vcf.gz* $RUN_DIR/bqsr $RUN_DIR/aligned_reads

3.4.3 MutilQC

Quality metrics were collected from the raw read QC, alignment and variant calling steps and were consolidated into a single, interactive report for each batch using MultiQC v1.21 (Ewels and Magnusson 2016).

NCORES=8
MEM=32
WALLTIME="10:00:00"
JOBNAME="Multiqc_Murdoch_WGRS"
# multiqc report
MULTIQC_JOB=QC_$(date +%d_%m_%Y)
# submit it as a Slurm job
echo "multiqc --interactive --force -i $MULTIQC_JOB -o $MULTIQC_JOB ." > $JOBNAME.cmds
# submit the job 
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " " )
# Done!

# Copy html files to SharePoint
rclone copy -P --exclude "**/*.html" $RUN_DIR GRDC_rabiei:General/Projects/Hayley_PhD/Genetics/Murdoch_WGRS/SNP_calling_24_01_2025
# Copy html files to SharePoint
rclone copy -P --ignore-checksum --ignore-size --include "**/*.html" $RUN_DIR GRDC_rabiei:General/Projects/Hayley_PhD/Genetics/Murdoch_WGRS/SNP_calling_24_01_2025

3.5 Assessing the Sequencing Depth Effect on Clustering

What depth of sequencing is required to achieve good (accurate) clustering of the samples? (how do we measure this)

To do this we can use the same set of samples (and controls) used to determine the optimal pipeline, but resample the reads (x5, x10, x20, x30 coverage), which is equivalent to 700,000, 1,400,000, 2,800,000, 4,200,000 PE 150bp reads. It would be easiest to do it on the aligned BAM files (this can be done with BBtools’ reformat.sh or seqtk, see this BioStar thread)

3.6 Comparison of Alignment and SNP calling pipelines

We will test different alignment tools and parameters (BWA-MEM2, Bowtie2) with different variant callers (GATK, FreeBayes, Clair3) and a complete Nextflow pipeline implementing those (Sarek) to assess and identify the optimal SNP calling pipeline that will strike a good balance between sensitive SNP discovery and the lowest error rates (as measured by heterozygote calls and inconsistencies between technical replicates [AR0039, AR0052, AR0242] and low-coverage sampled [Ar23640, Ar23497, Ar23483]). To compare the VCF files we will use RTG tools.(Cleary et al. 2015). An example for this approach was described for another haploid fungi, Candida auris, as described by Li et al. (2023).

3.6.1 Compare alignment tools

We will test the following alignment options:
1. BWA-MEM2 with -B ranging from 4:6 and -O 6/7/8 (which includes default -B 4 -O 6).
2. Bowtie2 in local alignment mode: --local-fast, --local-sensitive (default), --local--very-sensitive.
3. Same as above, but in global alignment mode: --global-sensitive, etc.
Doing each step in parallel takes up too much space on the cluster. A better approach would be to align the reads and call variants with the different callers for each combination and then remove the bam files.

WORK_DIR="/scratch/ibar/A_rabiei"
WORK_DIR="/scratch/project/adna/A_rabiei"

# GENOME="$WORK_DIR/A_rabiei_TECAN_2022/ref_genome/ArME14_v2_CCDM"
CONDA_NAME="genomics"
# install tools
mamba install -y -n $CONDA_NAME bwa-mem2 bowtie2 biobambam sambamba qualimap multiqc fastp
# assign variables
REF_DIR="/scratch/project/adna/A_rabiei/A_rabiei_TECAN_2022/ref_genome"
GENOME="$REF_DIR/ArME14_v2_CCDM"
# build index for bowtie2
# conda activate $CONDA_NAME
# or run with a container (need to login to a compute node)
start_debug_interactive_job
apptainer exec $NXF_APPTAINER_CACHEDIR/depot.galaxyproject.org-singularity-bowtie2-2.4.4--py39hbb4e92a_0.img bowtie2-build $GENOME.fa $GENOME

MURDOCH_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS/fungal"
AGRF_DIR="/scratch/project/adna/A_rabiei/AGRF_CAGRF20114478"
# create a working folder
RUN_DIR="$WORK_DIR/SNP_calling_comparison"
FQ_DIR="$RUN_DIR/data"
mkdir -p $RUN_DIR/aligned_reads $RUN_DIR/trimmed_reads/QC $FQ_DIR && cd $RUN_DIR

# select isolates
ISOLATES="Ar23401_|Ar0020|Ar0023|Ar0212|AR0242|AR0052|AR0210|AR0022|AR0128|Ar23640|Ar23653_|Ar23497|Ar23483"

# Create symlinks to Murdoch reads
ls -1 $MURDOCH_DIR/*.fq.gz | egrep -i $ISOLATES | parallel ln -s {} $FQ_DIR/
# Create symlinks to AGRF reads
ls -1 $AGRF_DIR/*.fastq.gz | egrep -i $ISOLATES | parallel ln -s {} $FQ_DIR/

# standardise names
# AR0070_H3HGFDSX2_CTGACTCTAC-TGACGGCCGT_L002_R1.fastq.gz
cd $FQ_DIR
rename -v 's|.+\/(.+?)_H3HGFDSX2_[A-Z\-]+_L002_(R[0-9]).fastq.gz|\1.\2.fq.gz|' *.fastq.gz

cd $RUN_DIR

NCORES=12
MEM=64
WALLTIME="2:00:00"
JOBNAME="process_reads"
# create commands
find $FQ_DIR/*.R1.fq.gz | parallel -k --dry-run --rpl "{file2} s:.R1:.R2:; uq()" --rpl "{sample} s:.+\/(.+?).R1.fq.gz:\1:"  "fastp -i {} -I {file2} --detect_adapter_for_pe -c -l 30 -p -w \$SLURM_CPUS_PER_TASK -z 7 -o $RUN_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz -O $RUN_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz -j $RUN_DIR/trimmed_reads/QC/{sample}.fastp.json" > $JOBNAME.cmds

# submit to the cluster
sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm



NCORES=12
MEM=64
WALLTIME="1:00:00"
JOBNAME="bwa-mem-align-AGRF"
# Create the bwa-mem2 commands to align the reads to the genome.
parallel -k --dry-run --rpl "{file2} s:_R1:_R2:" --rpl "{sample} s:.+\/(.+?)_R1.+.gz:\1:"  "ALIGN_DIR=$RUN_DIR/aligned_reads/bwamem_B{2}_O{3}; mkdir -p \$ALIGN_DIR && bwa-mem2 mem -B {2} -O {3} -R \"@RG\tID:{1 sample}\tSM:{1 sample}\tLB:{1 sample}\tPU:H3HGFDSX2\tPL:ILLUMINA\tCN:AGRF\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa {1} {1 file2} | bamsormadup tmpfile=\${TMPDIR:-/scratch/project/adna/tmp/}/\${SLURM_JOB_NAME} inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] > \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam; unset DISPLAY; qualimap bamqc -bam \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir \$ALIGN_DIR/{1 sample}_bamqc" ::: $(find $RUN_DIR/trimmed_reads -name "AR*_R1.trimmed.fastq.gz") ::: 4 5 6 7 8 ::: 6 7 8 > $RUN_DIR/$JOBNAME.cmds
# submit it as a Slurm job array
sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME  ~/bin/array.slurm

JOBNAME="bwa-mem-align-Murdoch"
# Create the bwa-mem2 commands to align the reads to the genome.
parallel -k --dry-run --rpl "{file2} s:_R1:_R2:" --rpl "{sample} s:.+\/(.+?)_R1.+.gz:\1:"  "ALIGN_DIR=$RUN_DIR/aligned_reads/bwamem_B{2}_O{3}; mkdir -p \$ALIGN_DIR && bwa-mem2 mem -B {2} -O {3} -R \"@RG\tID:{1 sample}\tSM:{1 sample}\tLB:{1 sample}\tPU:E250038400\tPL:MGI-DNBSEQ-T7\tCN:Murdoch\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa {1} {1 file2} | bamsormadup tmpfile=\${TMPDIR:-/scratch/project/adna/tmp/}/\${SLURM_JOB_NAME} inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] > \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam; unset DISPLAY; qualimap bamqc -bam \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir \$ALIGN_DIR/{1 sample}_bamqc" ::: $(find $RUN_DIR/trimmed_reads -name "Ar*_R1.trimmed.fastq.gz") ::: 4 5 6 7 8 ::: 6 7 8 > $RUN_DIR/$JOBNAME.cmds
# submit it as a Slurm job array
sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME  ~/bin/array.slurm

NCORES=12
MEM=64
WALLTIME="5:00:00"
JOBNAME="bowtie2-align-Murdoch"
# Create the bowtie2 commands commands to repair all read pairs.
parallel -k --dry-run --rpl "{file2} s:_R1:_R2:" --rpl "{sample} s:.+/(.+?)_R1.+:\1:"  "ALIGN_DIR=$RUN_DIR/aligned_reads/bt2_{2}{3}; mkdir -p \$ALIGN_DIR  && bowtie2 --{2}{3} --rg-id {1 sample} --rg 'SM:{1 sample}' --rg 'LB:{1 sample}' --rg 'PU:E250038400' --rg 'PL:MGI-DNBSEQ-T7' --rg 'CN:Murdoch' -p \$[SLURM_CPUS_PER_TASK - 2] -x $GENOME -1 {1} -2 {1 file2} | bamsormadup tmpfile=\${TMPDIR:-/scratch/project/adna/tmp/}/\${SLURM_JOB_NAME} inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] > \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam; unset DISPLAY; qualimap bamqc -bam \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir \$ALIGN_DIR/{1 sample}_bamqc" ::: $(find $RUN_DIR/trimmed_reads -name "Ar*_R1.trimmed.fastq.gz") ::: "fast" "sensitive" "very-sensitive" ::: "-local" ""  > $RUN_DIR/$JOBNAME.cmds

# submit it as a Slurm job array
sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME  ~/bin/array.slurm


JOBNAME="bowtie2-align-AGRF"
# Create the bowtie2 commands commands to repair all read pairs.
parallel -k --dry-run --rpl "{file2} s:_R1:_R2:" --rpl "{sample} s:.+/(.+?)_R1.+:\1:"  "ALIGN_DIR=$RUN_DIR/aligned_reads/bt2_{2}{3}; mkdir -p \$ALIGN_DIR  && bowtie2 --{2}{3} --rg-id {1 sample} --rg 'SM:{1 sample}' --rg 'LB:{1 sample}' --rg 'PU:H3HGFDSX2' --rg 'PL:ILLUMINA' --rg 'CN:AGRF' -p \$[SLURM_CPUS_PER_TASK - 2] -x $GENOME -1 {1} -2 {1 file2} | bamsormadup tmpfile=\${TMPDIR:-/scratch/project/adna/tmp/}/\${SLURM_JOB_NAME} inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] > \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam; unset DISPLAY; qualimap bamqc -bam \$ALIGN_DIR/{1 sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir \$ALIGN_DIR/{1 sample}_bamqc" ::: $(find $RUN_DIR/trimmed_reads -name "AR*_R1.trimmed.fastq.gz") ::: "fast" "sensitive" "very-sensitive" ::: "-local" ""  > $RUN_DIR/$JOBNAME.cmds

# submit it as a Slurm job array
sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME  ~/bin/array.slurm

#  Multi-bamqc: create/copy a file describing the sample names (sample_info.txt) to the parent folder and use it in qualimap 

find `pwd` -name "*_bamqc" | gawk '{sample_name=gensub(/.+\/(.+)_bamqc/, "\\1", $1); printf "%s\t%s\n", sample_name, $1}' > multibamqc.samples
 

QUALIMAP_MULTI=$( echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; unset DISPLAY ; find `pwd` -name \"*_bamqc\" | gawk '{sample_name=gensub(/.+\/(.+)_bamqc/, \"\\\1\", \$1); printf \"%s\t%s\n\", sample_name, \$1}' > multibamqc.samples ; qualimap multi-bamqc --java-mem-size=16G -d multibamqc.samples -outformat PDF:HTML -outdir ${BATCH}_bamqc " | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=5:00:00 -N ${QUALIMAP_JOB:0:11} | egrep -o "^[0-9]+")

3.6.2 Compare variant calling tools

Next, we call variants with different variant callers (GATK, FreeBayes, Clair3) and a complete Nextflow pipeline implementing those (Sarek).

3.7 Run Variant Calling with Selected Pipeline (Murdoch 2024)

The selected pipeline (using bwa-mem2 for alignment and freebayes in diploid mode for variant calling (see sections 3.3 and 3.4) was then used to call SNPs from the entire population.

3.7.1 Process reads and align to the genome

See detailed methods in Section 3.3.

# setup workspace
CONDA_NAME="genomics" 
mamba install -n $CONDA_NAME mosdepth # megadepth 
REF_DIR="/scratch/project/adna/A_rabiei/A_rabiei_TECAN_2022/ref_genome"
GENOME="$REF_DIR/ArME14_v2_CCDM"
WORK_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS"
FQ_DIR="$WORK_DIR/fungal344"
# _H3HGFDSX2_CCTCCGGTTG-TGGTGTTATG_L002_R2.fastq.gz

# DATE=$(date +%d_%m_%Y)
RUN_DIR="$WORK_DIR/Batch2024_FB_SNP_calling"
BAM_DIR="$RUN_DIR/aligned_reads"
mkdir -p $BAM_DIR $FQ_DIR/combined_libs/trimmed_reads/QC && cd $RUN_DIR

NCORES=8
MEM=32
WALLTIME="3:00:00"
JOBNAME="MU-batch24-combine-reads"

# merge files from duplicated libraries
ls -1 $FQ_DIR/*.R1.fq.gz | egrep "E[0-9]+" | sed -r 's:.+\/(.+?).E[0-9]+.R([12]).fq.gz:\1:' | uniq | parallel -a - --dry-run "cat $FQ_DIR/{1}.*.R{2}.fq.gz > $FQ_DIR/combined_libs/{1}.R{2}.fq.gz" ::: 1 2 > $JOBNAME.cmds

# submit the job array to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/parallel_jobs_run.slurm | cut -f 4 -d " ")


# copy across control isolates reads from WGS 2020 batch
ISOLATES="Ar0020|Ar0023|Ar0212|AR0210|AR0022|AR0128" # AR0242|AR0052|
find /scratch/project/adna/A_rabiei/AGRF_CAGRF20114478 -name "*.fastq.gz" | egrep -i $ISOLATES | parallel --rpl  "{target} s:.+\/(.+?)_H3HGFDSX2_[CATG-]+_L002_R([12]).fastq.gz:\1.R\2.fq.gz:" ln -s {} $FQ_DIR/combined_libs/{target}

FQ_DIR="$FQ_DIR/combined_libs"

JOBNAME="MU-batch24-fastp-bwa"
NCORES=12
MEM=96
WALLTIME="3:00:00"
RGPM="DNBseq_T7"
RGPL="MGI"
RGPU="E250076899" # also E250076916
RGCN="MU_SABC"

find $FQ_DIR/*.R1.fq.gz | parallel -k --dry-run --rpl "{file2} s:.R1:.R2:; uq()" --rpl "{sample} s:.+\/(.+?).R1.fq.gz:\1:"  "fastp -i {} -I {file2} --detect_adapter_for_pe -c -l 30 -p -w \$SLURM_CPUS_PER_TASK -z 7 -o $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz -O $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz -j $FQ_DIR/trimmed_reads/QC/{sample}.fastp.json -h $FQ_DIR/trimmed_reads/QC/{sample}.fastp.html && bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\tLB:{sample}\tPU:$RGPU\tPL:$RGPL\tPM:$RGPM\tCN:$RGCN\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=$BAM_DIR/{sample}.dedup.rg.csorted.bam.bai > $BAM_DIR/{sample}.dedup.rg.csorted.bam ; unset DISPLAY ; qualimap bamqc -bam $BAM_DIR/{sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir $BAM_DIR/{sample}_bamqc; mosdepth -t \$SLURM_CPUS_PER_TASK -x -n $BAM_DIR/{sample}_bamqc/{sample} $BAM_DIR/{sample}.dedup.rg.csorted.bam"  > $RUN_DIR/$JOBNAME.cmds

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

# extract coverage information
find aligned_reads -name "genome_results.txt" | xargs grep "mean coverageData" | sed 's/X$//' | gawk '{sum+=$NF}END{print "# Mean Coverage =", sum/NR}' > MU-batch24-coverage-summary.txt
find aligned_reads -name "genome_results.txt" | xargs grep "mean coverageData" >> MU-batch24-coverage-summary.txt

3.7.2 Calling variants using Freebayes

See detailed methods in Section 3.4.

WORK_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS"
RUN_DIR="$WORK_DIR/Batch2024_FB_SNP_calling"
FQ_DIR="$WORK_DIR/fungal344"
REF_DIR="/scratch/project/adna/A_rabiei/A_rabiei_TECAN_2022/ref_genome"
GENOME="$REF_DIR/ArME14_v2_CCDM"

CONDA_NAME="genomics"

BAM_DIR="$RUN_DIR/aligned_reads"

# bring in BAM files (if not prepared in the previous step)
# select isolates
# ISOLATES="Ar0020|Ar0023|Ar0212|AR0210|AR0022|AR0128" # AR0242|AR0052|
# find /scratch/project/adna/A_rabiei/AGRF_gatk_13_03_2025/aligned_reads -maxdepth 1 -name "*.rg.csorted.bam*" | egrep -i $ISOLATES | parallel ln -s {} $BAM_DIR/
# ln -s /scratch/project/adna/A_rabiei/Murdoch_WGRS/GATK_Murdoch_WGRS_04_04_2025/aligned_reads/*.rg.csorted.bam* $BAM_DIR/

# Distributed freebayes (each node runs freebayes-parallel on one contig)
# start_interactive_job
conda activate $CONDA_NAME
# fix library dependencies
find $CONDA_PREFIX -name "libtabixpp.so*" | parallel ln -s {} {.}.0
# ln -s $CONDA_PREFIX/lib/libtabixpp.so.1 $CONDA_PREFIX/lib/libtabixpp.so.0
# split each contig/chromosome to smaller 1e6 bits
# prepare BAM files
JOBNAME="prep_bams"
NCORES=2
MEM=16
WALLTIME="1:00:00"
# submit it as a Slurm job
echo "~/bin/split_ref_by_bai_datasize.py -s 1e6 -r $GENOME.fa.fai $(ls -1S $BAM_DIR/*.dedup.rg.csorted.bam | tail -n1) > $RUN_DIR/ArME14_target_1e6_regions_chr.tsv" > $JOBNAME.cmds
# submit the job 
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

JOBNAME="Batch2024_WGRS_FB_diploid"
NCORES=10
MEM=64
WALLTIME="10:00:00"
# RUN_DIR=$WORK_DIR/SNP_calling_24_01_2025
PLOIDY=2
MIN_DP=7
MAX_DP=100000
# create a folder
mkdir -p $RUN_DIR/FB_array_output
# prepare commands
BAM_FILES=$( find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" | xargs )
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' $RUN_DIR/ArME14_target_1e6_regions_chr.tsv | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$SLURM_CPUS_PER_TASK -f $GENOME.fa --genotype-qualities -g $MAX_DP -C $MIN_DP -p $PLOIDY $BAM_FILES > $RUN_DIR/FB_array_output/{}.combined.vcf" > $RUN_DIR/$JOBNAME.cmds

# exit interactive job
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -sd,)


# merge VCFs
JOBNAME="freebayes-merge"
echo "cat $RUN_DIR/FB_array_output/ArME14_ctg_*.combined.vcf | vcffirstheader | vcfstreamsort -w 1000 | vcfuniq | bgzip -@ \$SLURM_CPUS_PER_TASK -c > A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz; bcftools index A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz" > $JOBNAME.cmds
# submit job to cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

3.7.3 Create and filter final variant file

See detailed methods in Section 3.4.1.

CONDA_NAME="genomics"
# Recode genotypes as missing if below a certain threshold, such as genotyping quality or depth (GQ:DP)  
# filter only polymorphic bi-allelic SNPs, using QUAL>20, 7<DP<100000

# filter Freebayes variants with SnpSift and vcftools (wipe any heterozygote genotype with DP<7 with bcftools)
QUAL=30 # 30
MQ=30
MAX_DP=100000
MIN_DP=10
IND_DP=7
JOBNAME="Batch2024-wgrs-fb-filter"

echo "bcftools filter -S . -e \"GT=='het' | FMT/DP<$MIN_DP\" $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz -O v | SnpSift filter \"( QUAL>=$QUAL ) & ( countRef()>=1 & countVariant()>=1 )\" | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz; bcftools index $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz 
vcftools --gzvcf $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz --recode --recode-INFO-all --minQ $QUAL --remove-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.snps.Q$QUAL.noHet.poly.recode.vcf.gz ; bcftools index $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.snps.Q$QUAL.noHet.poly.recode.vcf.gz
vcftools --gzvcf $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz --recode --recode-INFO-all --minQ $QUAL --keep-only-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c  > $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.indels.Q$QUAL.noHet.poly.recode.vcf.gz; bcftools index $RUN_DIR/A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.indels.Q$QUAL.noHet.poly.recode.vcf.gz "> $RUN_DIR/$JOBNAME.cmds

NCORES=6
MEM=32
WALLTIME="1:00:00"
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

#bcftools filter -S . -e "GT=='het' | FMT/DP<$MIN_DP" A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz -O v | SnpSift filter "( QUAL>=$QUAL ) & ( DP<=$MAX_DP ) & ( DP>=$MIN_DP ) & ( countRef()>=1 & countVariant()>=1 )" | vcftools --vcf - --recode --recode-INFO-all --minQ $QUAL --max-missing 0.75 --remove-indels -c | bgzip -o A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.GT75.noRep.noHet.poly.recode.vcf.gz && tabix A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.GT75.noRep.noHet.poly.recode.vcf.gz
# generate stats 
JOBNAME="bcftools_stats"
WALLTIME=2:00:00
MEM=32
NCORES=8
find . -name "*.vcf.gz" | parallel --dry-run "bcftools stats -s - {} > {.}.bcfstats.txt" > $JOBNAME.cmds
# send to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/parallel_jobs_run.slurm | cut -f 4 -d " ")

An in-house R script (estimate_error_rates_vcf_files.R) was used to estimate the error rates based on the presence of duplicated samples.

3.7.4 MutilQC

See detailed methods in Section 3.4.3.

WORK_DIR="/scratch/project/adna/A_rabiei/Murdoch_WGRS"
RUN_DIR="$WORK_DIR/Batch2024_FB_SNP_calling"
cd $RUN_DIR
NCORES=8
MEM=32
WALLTIME="10:00:00"
JOBNAME="Batch2024-wgrs-multiqc"
# link fastp reports
ln -s $FQ_DIR/trimmed_reads/QC $RUN_DIR/
# multiqc report
MULTIQC_JOB=QC_$(date +%d_%m_%Y)
# Prepare MultiQC command
echo "multiqc --interactive -z --force -i $MULTIQC_JOB -o $MULTIQC_JOB ." > $JOBNAME.cmds
# submit the job 
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")
# Done!
mkdir -p slurm_logs
mv *.log slurm_logs/
# Copy files to SharePoint
rclone copy -P --exclude "**/*.html" $RUN_DIR GRDC_rabiei:General/Projects/Hayley_PhD/Genetics/Murdoch_WGRS/Batch2024_FB_SNP_calling
# Copy html files to SharePoint
rclone copy -P --ignore-checksum --ignore-size --include "**/*.html" $RUN_DIR GRDC_rabiei:General/Projects/Hayley_PhD/Genetics/Murdoch_WGRS/Batch2024_FB_SNP_calling

3.8 Run Variant Calling with Selected Pipeline (AGRF 2021)

The selected pipeline (using bwa-mem2 for alignment and freebayes in diploid mode for variant calling (see sections 3.3 and 3.4) was then used to call SNPs from the entire population.

3.8.1 Process reads and align to the genome

See detailed methods in Section 3.3.

# setup workspace
CONDA_NAME="genomics" 
mamba install -n $CONDA_NAME mosdepth # megadepth 
WORK_DIR="/scratch/project/adna/A_rabiei"
REF_DIR="/scratch/project/adna/A_rabiei/ref_genome"
rclone copy -P GRDC_rabiei:General/GRI2007-001RTX-data/A_rabiei_TECAN_2022/ref_genome $REF_DIR
GENOME="$REF_DIR/ArME14_v2_CCDM"

FQ_DIR="/scratch/project/adna/A_rabiei/AGRF_snippy_05_03_2025"
rclone copy -P --include "*.trimmed.fastq.gz" GRDC_rabiei:General/GRI2007-001RTX-data/A_rabiei_WGS/Variant_Calling/AGRF_snippy_05_03_2025 $FQ_DIR
# _H3HGFDSX2_CCTCCGGTTG-TGGTGTTATG_L002_R2.fastq.gz

# DATE=$(date +%d_%m_%Y)
RUN_DIR="$WORK_DIR/AGRF2021_FB_SNP_calling"
BAM_DIR="$RUN_DIR/aligned_reads"
mkdir -p $BAM_DIR $FQ_DIR/trimmed_reads/QC && cd $RUN_DIR

JOBNAME="AGRF2021-bwa-mem2"
NCORES=12
MEM=96
WALLTIME="3:00:00"
RGPM="NovaSeq"
RGPL="ILLUMINA"
RGPU="H3HGFDSX2"
RGCN="AGRF"


find $FQ_DIR -name "*_R1.trimmed.fastq.gz" | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:" --rpl "{sample} s:.+\/(.+?)_R1.+:\1:" "bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\tLB:{sample}\tPU:$RGPU\tPL:$RGPL\tPM:$RGPM\tCN:$RGCN\" -t \$[SLURM_CPUS_PER_TASK - 2] $GENOME.fa {} {file2} | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=$BAM_DIR/{sample}.dedup.rg.csorted.bam.bai > $BAM_DIR/{sample}.dedup.rg.csorted.bam ; unset DISPLAY ; qualimap bamqc -bam $BAM_DIR/{sample}.dedup.rg.csorted.bam --java-mem-size=32G -c -gff $GENOME.bed -outdir $BAM_DIR/{sample}_bamqc; mosdepth -t \$SLURM_CPUS_PER_TASK -x -n $BAM_DIR/{sample}_bamqc/{sample} $BAM_DIR/{sample}.dedup.rg.csorted.bam"  > $RUN_DIR/$JOBNAME.cmds

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

# extract coverage information
find aligned_reads -name "genome_results.txt" | xargs grep "mean coverageData" | sed 's/X$//' | gawk '{sum+=$NF}END{print "# Mean Coverage =", sum/NR}' > AGRF2021-coverage-summary.txt
find aligned_reads -name "genome_results.txt" | xargs grep "mean coverageData" >> AGRF2021-coverage-summary.txt

3.8.2 Calling variants using Freebayes

See detailed methods in Section 3.4.

WORK_DIR="/scratch/project/adna/A_rabiei"
REF_DIR="/scratch/project/adna/A_rabiei/ref_genome"
RUN_DIR="$WORK_DIR/AGRF2021_FB_SNP_calling"
FQ_DIR="/scratch/project/adna/A_rabiei/AGRF_snippy_05_03_2025"

GENOME="$REF_DIR/ArME14_v2_CCDM"

CONDA_NAME="genomics"

BAM_DIR="$RUN_DIR/aligned_reads"

# bring in BAM files (if not prepared in the previous step)
# select isolates
# ISOLATES="Ar0020|Ar0023|Ar0212|AR0210|AR0022|AR0128" # AR0242|AR0052|
# find /scratch/project/adna/A_rabiei/AGRF_gatk_13_03_2025/aligned_reads -maxdepth 1 -name "*.rg.csorted.bam*" | egrep -i $ISOLATES | parallel ln -s {} $BAM_DIR/
# ln -s /scratch/project/adna/A_rabiei/Murdoch_WGRS/GATK_Murdoch_WGRS_04_04_2025/aligned_reads/*.rg.csorted.bam* $BAM_DIR/

# Distributed freebayes (each node runs freebayes-parallel on one contig)
# start_interactive_job
conda activate $CONDA_NAME
# fix library dependencies
find $CONDA_PREFIX -name "libtabixpp.so*" | parallel ln -s {} {.}.0
# ln -s $CONDA_PREFIX/lib/libtabixpp.so.1 $CONDA_PREFIX/lib/libtabixpp.so.0
# split each contig/chromosome to smaller 1e6 bits
# prepare BAM files
JOBNAME="prep_ref"
NCORES=2
MEM=16
WALLTIME="1:00:00"
# submit it as a Slurm job
echo "~/bin/split_ref_by_bai_datasize.py -s 1e6 -r $GENOME.fa.fai $(ls -1S $BAM_DIR/*.dedup.rg.csorted.bam | tail -n1) > $RUN_DIR/ArME14_target_1e6_regions_chr.tsv" > $JOBNAME.cmds
# submit the job 
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

JOBNAME="AGRF2021_WGRS_FB_diploid"
NCORES=10
MEM=64
WALLTIME="10:00:00"
# RUN_DIR=$WORK_DIR/SNP_calling_24_01_2025
PLOIDY=2
MIN_DP=7
MAX_DP=100000
# create a folder
mkdir -p $RUN_DIR/FB_array_output
# prepare commands
BAM_FILES=$( find $BAM_DIR -maxdepth 1 -name "*.rg.csorted.bam" | xargs )
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' $RUN_DIR/ArME14_target_1e6_regions_chr.tsv | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$SLURM_CPUS_PER_TASK -f $GENOME.fa --genotype-qualities -g $MAX_DP -C $MIN_DP -p $PLOIDY $BAM_FILES > $RUN_DIR/FB_array_output/{}.combined.vcf" > $RUN_DIR/$JOBNAME.cmds

# exit interactive job
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $RUN_DIR/$JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | cut -f 4 -d " ")

FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -sd,)


# merge VCFs
JOBNAME="freebayes-merge"
echo "cat $RUN_DIR/FB_array_output/ArME14_ctg_*.combined.vcf | vcffirstheader | vcfstreamsort -w 1000 | vcfuniq | bgzip -@ \$SLURM_CPUS_PER_TASK -c > A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.vcf.gz; bcftools index A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.vcf.gz" > $JOBNAME.cmds
# submit job to cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

3.8.3 Create and filter final variant file

See detailed methods in Section 3.4.1.

CONDA_NAME="genomics"
# Recode genotypes as missing if below a certain threshold, such as genotyping quality or depth (GQ:DP)  
# filter only polymorphic bi-allelic SNPs, using QUAL>20, 7<DP<100000

# filter Freebayes variants with SnpSift and vcftools (wipe any heterozygote genotype with DP<7 with bcftools)
QUAL=30 # 30
MQ=30
MAX_DP=100000
MIN_DP=10
IND_DP=7
JOBNAME="AGRF2021-wgs-fb-filter"

echo "bcftools filter -S . -e \"GT=='het' | FMT/DP<$MIN_DP\" $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.vcf.gz -O v | SnpSift filter \"( QUAL>=$QUAL ) & ( countRef()>=1 & countVariant()>=1 )\" | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz; bcftools index $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz 
vcftools --gzvcf $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz --recode --recode-INFO-all --minQ $QUAL --remove-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c > $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.snps.Q$QUAL.noHet.poly.recode.vcf.gz ; bcftools index $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.snps.Q$QUAL.noHet.poly.recode.vcf.gz
vcftools --gzvcf $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.Q$QUAL.noHet.poly.recode.vcf.gz --recode --recode-INFO-all --minQ $QUAL --keep-only-indels -c | bgzip -@ \$SLURM_CPUS_PER_TASK -c  > $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.indels.Q$QUAL.noHet.poly.recode.vcf.gz; bcftools index $RUN_DIR/A_rabiei_AGRF_WGS2021_ArME14_v2.bwa2.fb.diploid.indels.Q$QUAL.noHet.poly.recode.vcf.gz "> $RUN_DIR/$JOBNAME.cmds

NCORES=6
MEM=32
WALLTIME="1:00:00"
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")

#bcftools filter -S . -e "GT=='het' | FMT/DP<$MIN_DP" A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.vcf.gz -O v | SnpSift filter "( QUAL>=$QUAL ) & ( DP<=$MAX_DP ) & ( DP>=$MIN_DP ) & ( countRef()>=1 & countVariant()>=1 )" | vcftools --vcf - --recode --recode-INFO-all --minQ $QUAL --max-missing 0.75 --remove-indels -c | bgzip -o A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.GT75.noRep.noHet.poly.recode.vcf.gz && tabix A_rabiei_2024_Murdoch_WGRS_ArME14_v2.bwa2.fb.diploid.Q$QUAL.GT75.noRep.noHet.poly.recode.vcf.gz
# generate stats 
JOBNAME="bcftools_stats"
WALLTIME=2:00:00
MEM=32
NCORES=8
find . -name "*.vcf.gz" | parallel --dry-run "bcftools stats -s - {} > {.}.bcfstats.txt" > $JOBNAME.cmds
# send to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/parallel_jobs_run.slurm | cut -f 4 -d " ")

An in-house R script (estimate_error_rates_vcf_files.R) was used to estimate the error rates based on the presence of duplicated samples.

3.8.4 MutilQC

See detailed methods in Section 3.4.3.

WORK_DIR="/scratch/project/adna/A_rabiei"
REF_DIR="/scratch/project/adna/A_rabiei/ref_genome"
RUN_DIR="$WORK_DIR/AGRF2021_FB_SNP_calling"
cd $RUN_DIR
NCORES=8
MEM=32
WALLTIME="10:00:00"
JOBNAME="AGRF2021-wgrs-multiqc"
# link fastp reports
mkdir -p $RUN_DIR/QC
find $FQ_DIR -name "*.json" | xargs ln -s {} $RUN_DIR/QC/
# ln -s $FQ_DIR/trimmed_reads/QC $RUN_DIR/
# multiqc report
MULTIQC_JOB=$JOBNAME_$(date +%d_%m_%Y)
# Prepare MultiQC command
echo "multiqc --interactive -z --force -i $MULTIQC_JOB -o $MULTIQC_JOB ." > $JOBNAME.cmds
# submit the job 
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$RUN_DIR/$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " ")
# Done!
mkdir -p logs
mv *.log logs/
# Copy files to SharePoint
# Copy html files to SharePoint
rclone copy -P --ignore-checksum --ignore-size --include "**/*.html" $RUN_DIR GRDC_rabiei:General/GRI2007-001RTX-data/A_rabiei_WGS/Variant_Calling/AGRF2021_FB_SNP_calling
# copy rest of files
rclone copy -P --exclude "**/*.html" $RUN_DIR GRDC_rabiei:General/GRI2007-001RTX-data/A_rabiei_WGS/Variant_Calling/AGRF2021_FB_SNP_calling

4 General information

This document was last updated on 25 June 2025 using R Markdown (built with R version 4.2.1 (2022-06-23 ucrt)). The source code for this website can be found on https://github.com/IdoBar/HPC_SNP_calling_documentation.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com, R Markdown: The Definitive Guide and Rmarkdown cheatsheet.

Bibliography

Cleary JG, Braithwaite R, Gaastra K, et al. (2015) Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines. 023754. doi: 10.1101/023754

Danecek P, Auton A, Abecasis G, et al. (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158. doi: 10.1093/bioinformatics/btr330

Danecek P, Bonfield JK, Liddle J, et al. (2021) Twelve years of SAMtools and BCFtools. GigaScience 10:giab008. doi: 10.1093/gigascience/giab008

Ewels P, Magnusson M (2016) MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048. doi: 10.1093/bioinformatics/btw354

Garrison E, Marth G (2012) Haplotype-based variant detection from short-read sequencing.

Heldenbrand JR, Baheti S, Bockol MA, et al. (2019) Recommendations for performance optimizations when using GATK3.8 and GATK4. BMC Bioinformatics 20:557. doi: 10.1186/s12859-019-3169-7

Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993. doi: 10.1093/bioinformatics/btr509

Li X, Muñoz JF, Gade L, et al. (2023) Comparing genomic variant identification protocols for Candida auris. Microb Genom 9:000979. doi: 10.1099/mgen.0.000979

Okonechnikov K, Conesa A, GarcC-a-Alcalde F (2016) Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32:292–294. doi: 10.1093/bioinformatics/btv566

Ruden DM, Cingolani P, Patel VM, et al. (2012) Using Drosophila Melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Drosophila melanogaster 3:35. doi: 10.3389/fgene.2012.00035

Tischler G, Leonard S (2014) Biobambam: Tools for read pair collation based algorithms on BAM files. Source Code for Biology and Medicine 9:13. doi: 10.1186/1751-0473-9-13

Van der Auwera GA, Carneiro MO, Hartl C, et al. (2002) From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics

Vasimuddin Md, Misra S, Li H, Aluru S (2019) Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp 314–324

Ascochyta rabiei SNP calling from WGRS pipeline

Ido Bar & Hayley Wilson

25 June 2025

1 Experimental Design

2 Aims

3 Methods

3.1 Overview of Analysis Pipeline

3.2 Data pre-processing

3.3 Mapping to the updated reference genome

3.4 Calling variants (using Freebayes)

3.4.1 Create and filter final variant file

3.4.2 Call SNPs with GATK

3.4.3 MutilQC

3.5 Assessing the Sequencing Depth Effect on Clustering

3.6 Comparison of Alignment and SNP calling pipelines

3.6.1 Compare alignment tools

3.6.2 Compare variant calling tools

3.7 Run Variant Calling with Selected Pipeline (Murdoch 2024)

3.7.1 Process reads and align to the genome

3.7.2 Calling variants using Freebayes

3.7.3 Create and filter final variant file

3.7.4 MutilQC

3.8 Run Variant Calling with Selected Pipeline (AGRF 2021)

3.8.1 Process reads and align to the genome

3.8.2 Calling variants using Freebayes

3.8.3 Create and filter final variant file

3.8.4 MutilQC

4 General information

Bibliography