Experimental Design

Aims

  • Identify genes that play a role in the fasting during migration of Humpback whales (Megaptera novaeangliae) through differential gene expression between North and South migrating whales

Objectives

  1. Identify genes that play a role in the fasting during migration of Humpback whales (Megaptera novaeangliae) through differential gene expression between North and South migrating whales
  2. Assemble the Megaptera novaeangliae adipose tissue transcriptome

Analysis Pipeline

General overview:

  1. Data pre-processing:
    1. Quality check
    2. Adaptor trimming
    3. Post-trim quality check
    4. Error correction (?)
    5. Combine reads from sequencing lanes
  2. De Novo transcriptome assembly of the reads (?)
  3. Homology-based functional annotation of transcripts
  4. Generate counts table per gene (based on reference genome or assembled transcriptome)
  5. Differential expression analysis
  6. Summary statistics and visualisation

Methods

RNA Extraction and Sequencing

RNA was extracted from adipose tissue of 22 North-migrating male humpback whales (Megaptera novaeangliae) and 12 South-migrating males. RNA was extracted using the Qiazol method and kit. The RNA was sent for Illumina paired-end sequencing, generating 100bp reads (Ramaciotti Genome Centre, Sydney).

The reads were downloaded to the QCIF High Performance Computing (HPC) cluster (Bunya) for bioinformatics processing and analyses, following the guidelines by Harvard Informatics (see link) (Freedman et al. 2021).

These included error-correction with Rcorrect v1.5.0 (Song and Florea 2015) of the reads from each set and removal of “unfixable” reads. (?)

Reference Transcriptome Annotation

The most recent reference genome of M. novaeangliae (Carminati et al. 2024) was downloaded along with the predicted gene models and annotations from a Dryad repository (doi:10.5061/dryad.dv41ns271, for some reason it is not included in the NCBI submission GCA_041834305.1).

Assemble reference-guided transcriptome

The reads were aligned to the reference genome with HISAT (Kim et al. 2019), followed by the reconstruction of transcripts with StringTie (or PsiCLASS). Transcript and gene counts were generated from the alignments using featurecounts (Liao et al. 2014), HTSeq-count (Anders et al. 2015) or Ballgown, as detailed in Pertea et al. (2016).

CONDA_NAME="ref-trans" # genomics
mamba create -n $CONDA_NAME hisat2 stringtie htseq psiclass subread gffcompare biobambam fastp rseqc
WORKDIR="/home/ibar/adna/sandbox/OTE14085"
FQ_DIR="$WORKDIR/combined_reads"
GENOME="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.RepeatMasked.fasta" # HumpbackWhale_Final_Genome_forNCBI.fasta # 
GFF="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.annotation.gff"
gzip -cd $GFF.gz > $GFF
gzip -cd $GENOME.gz > $GENOME
mkdir -p $FQ_DIR/trimmed_reads/QC && cd $FQ_DIR

# process the reads
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="Mnova-fastp"
# find $WORKDIR/combined_reads -maxdepth 1 -name "*_R1.fq.gz" | parallel --dry-run
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.fq.gz:\1:" --rpl "{file2} s:_R1:_R2:" "fastp -i {} -I {file2} --detect_adapter_for_pe -c -l 30 -p -w \$SLURM_CPUS_PER_TASK -z 7 -o $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz -O $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz -j $FQ_DIR/trimmed_reads/QC/{sample}.fastp.json" ::: $(ls -1 $FQ_DIR/*R1.fq.gz) > $JOBNAME.cmds 

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# align the reads to the genome
mkdir -p $WORKDIR/Mnova_ref_based_assembly/aligned_reads $WORKDIR/Mnova_ref_based_assembly/assembly && cd $WORKDIR/Mnova_ref_based_assembly
CONDA_NAME="ref-trans" # genomics
NCORES=10
MEM=64
WALLTIME=2:00:00
JOBNAME="hisat-build"
# prepare the genome index 
echo "extract_splice_sites.py $GFF > $GFF.ss && extract_exons.py $GFF > $GFF.exon && hisat2-build -p \$[SLURM_CPUS_PER_TASK] --ss $GFF.ss --exon $GFF.exon $GENOME $WORKDIR/Mnova_genome/GIU3625-index" > $JOBNAME.cmds 
# submit the job to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# align the reads to the genome
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="hisat-align"
# find $WORKDIR/combined_reads -maxdepth 1 -name "*_R1.fq.gz" | parallel --dry-run
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.trimmed.fastq.gz:\1:" --rpl "{file2} s:_R1:_R2:" "hisat2 --dta -p \$[SLURM_CPUS_PER_TASK] -x $WORKDIR/Mnova_genome/GIU3625-index -1 {} -2 {file2} | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=aligned_reads/{sample}.dedup.csorted.bam.bai > aligned_reads/{sample}.dedup.csorted.bam" ::: $(ls -1 $FQ_DIR/trimmed_reads/*_R1.trimmed.fastq.gz | grep -v "26S23") > $JOBNAME.cmds 

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# find which jobs failed
FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -s -d ',')

# rerun failed tasks
sbatch -a $FAILED_TASKS --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm 
# assemble transcripts with StringTie
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="stringtie"
parallel --dry-run --rpl "{sample} s:.+/(.+).dedup.csorted.bam:\1:" "stringtie {} --rf -l {sample} -p \$[SLURM_CPUS_PER_TASK] -G $GFF -o assembly/{sample}.gtf" ::: $(ls -1 aligned_reads/*.dedup.csorted.bam | grep -v "26S23") > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# find which jobs failed
FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -s -d ',')

# merge all transcripts from the different samples
find assembly -name "*.gtf" > mergelist.txt
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="stringtie-merge"

echo "stringtie --merge -p \$[SLURM_CPUS_PER_TASK] -G $GFF -o stringtie_merged.gtf mergelist.txt; gffcompare -r $GFF -G -o merged stringtie_merged.gtf" > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | gawk '{print $4}')
    
# Estimate transcfript abundance with Ballgown
JOBNAME="ballgown"

parallel --dry-run --rpl "{sample} s:.+/(.+).dedup.csorted.bam:\1:" "mkdir -p ballgown/{sample}; stringtie -e -B  -p \$[SLURM_CPUS_PER_TASK] -G stringtie_merged.gtf -o ballgown/{sample}/{sample}.gtf {}" ::: $(ls -1 aligned_reads/*.dedup.csorted.bam | grep -v "26S23") > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

Annotate transcriptome with BLAST

The reference transcriptome was annotated against the non-redundant nucleotide database of the NCBI (nt) to achieve more accurate species-specific annotations. BLASTn was run with blast-nf, a Nextflow implementation that uses a “split-combine” approach to split the input query (entire transcriptome or proteome) to smaller “chunks” that are run in parallel on the HPC cluster.

WORKDIR="/home/ibar/adna/sandbox/OTE14085/Mnova_genome"
# unzip transcript and protein files
pigz -cd $WORKDIR/GIU3625_Humpback_whale.transcript.fasta.gz > $WORKDIR/GIU3625_Humpback_whale.transcript.fasta
JOBNAME=blastn_tax_nf
mkdir -p $WORKDIR/Annotation/$JOBNAME && cd $WORKDIR/Annotation/$JOBNAME
# run Nextflow Blast pipeline
~/bin/nextflow-22.11.1-edge-all run ~/etc/tools/blast-nf/main.nf --app blastn --dbDir /scratch/project/adna/tools/ncbi_db --dbName nt --query $WORKDIR/GIU3625_Humpback_whale.transcript.fasta --outfmt "6 std stitle staxids ssciname scomname" --options '-evalue 1e-10 -max_target_seqs 20' --chunkSize 500 --outdir  $WORKDIR/Annotation/$JOBNAME/results --outfileName GIU3625_Humpback_whale.transcript.nt_tax.outfmt6  -c ~/.nextflow/bunya.config -profile bunya,apptainer -with-tower

Considering that we’re dealing with a mammalian transcriptome (similar to well-annotated mammal species), it may be useful to annotate the transcripts against the refseq database (which only contains curated gene transcripts).

# setup environment
CONDA_NAME="genomics"
mamba install -n $CONDA_NAME rcorrector agat ncbi-datasets-cli pbgzip google-cloud-sdk awscli
# download recent BLAST databases
start_interactive_job
cd /scratch/project/adna/tools/ncbi_db # on Bunya
apptainer exec -B /home/ibar/adna/tools  -B $TMPDIR:/temp $NXF_SINGULARITY_CACHEDIR/ncbi-blast-latest.img update_blastdb.pl --decompress --verbose --source gcp --num_threads $SLURM_CPUS_PER_TASK swissprot refseq_protein refseq_rna

WORKDIR="/home/ibar/adna/sandbox/OTE14085/Mnova_genome"
# unzip transcript and protein files
pigz -cd $WORKDIR/GIU3625_Humpback_whale.transcript.fasta.gz > $WORKDIR/GIU3625_Humpback_whale.transcript.fasta
JOBNAME="nf-blastn-refseq"
DB="refseq_rna"
mkdir -p $WORKDIR/Annotation/$JOBNAME && cd $WORKDIR/Annotation/$JOBNAME
# run Nextflow Blast pipeline
~/bin/nextflow-22.11.1-edge-all run ~/etc/tools/blast-nf/main.nf --app blastn --dbDir /scratch/project/adna/tools/ncbi_db --dbName $DB --query $WORKDIR/GIU3625_Humpback_whale.transcript.fasta --outfmt "6 std stitle staxids ssciname scomname" --options '-evalue 1e-10 -max_target_seqs 20' --chunkSize 500 --outdir  $WORKDIR/Annotation/$JOBNAME/results --outfileName GIU3625_Humpback_whale.transcript.refseq.tax.outfmt6  -c ~/.nextflow/bunya.config -profile bunya,apptainer -with-tower

Annotate proteome with DIAMOND

Despite using blast-nf, BLASTp was painfully slow (taking ~80 hours for a chunk of 500 proteins) and therefore an alternative approach was taken using DIAMOND (Buchfink et al. 2015), which offers similar accuracy for protein homology searches and produces an identical output to BLASTp, at a much greater speed (up to x1,000 faster). We used the Nextflow-enabled version of DIAMOND (reducing the time to 1.5h for the entire proteome).

start_long_interactive_job # start interactive job on Bunya
WORKDIR="/home/ibar/adna/sandbox/OTE14085/Mnova_genome"

# unzip protein fasta
pigz -cd $WORKDIR/GIU3625_Humpback_whale.protein.fasta.gz > $WORKDIR/GIU3625_Humpback_whale.protein.fasta
# create an output folder
JOBNAME='nf_diamond_blastp_tax'
mkdir -p $WORKDIR/Annotation/$JOBNAME
cd $WORKDIR/Annotation/$JOBNAME

# mkdir -p $WORKDIR/Annotation/Annotation_results/$JOBNAME
~/bin/nextflow-22.11.1-edge-all run ~/adna/tools/nf-blast/nf-blast.nf -work-dir $TMPDIR/$JOBNAME/work -profile bunya,apptainer,diamond_tax --query $WORKDIR/GIU3625_Humpback_whale.protein.fasta --app 'diamond blastp' --db ~/adna/tools/ncbi_db/nr --diamondOpts '--very-sensitive -e 1e-10 -k 20' --chunkSize 10000 --outDir  $WORKDIR/Annotation/$JOBNAME/results --out GIU3625_Humpback_whale.protein.dmnd_blastp.nr_tax.outfmt6  -c ~/.nextflow/bunya.config -with-tower

Considering that we’re dealing with a mammalian proteome (similar to well-annotated mammal species), it may be useful to annotate the proteins against the swissprot or refseq_protein databases (which contain curated proteins only).

start_long_interactive_job # start interactive job on Bunya
WORKDIR="/home/ibar/adna/sandbox/OTE14085/Mnova_genome"

# unzip protein fasta
pigz -cd $WORKDIR/GIU3625_Humpback_whale.protein.fasta.gz > $WORKDIR/GIU3625_Humpback_whale.protein.fasta
# create an output folder
JOBNAME='nf-diamondblastp-refseq'
DB='refseq_protein' #swissprot
mkdir -p $WORKDIR/Annotation/$JOBNAME
cd $WORKDIR/Annotation/$JOBNAME

# mkdir -p $WORKDIR/Annotation/Annotation_results/$JOBNAME
~/bin/nextflow-22.11.1-edge-all run ~/adna/tools/nf-blast/nf-blast.nf -work-dir $TMPDIR/$JOBNAME/work -profile bunya,apptainer,diamond_tax --query $WORKDIR/GIU3625_Humpback_whale.protein.fasta --app 'diamond blastp' --db ~/adna/tools/ncbi_db/$DB --diamondOpts '--very-sensitive -e 1e-10 -k 20' --chunkSize 10000 --outDir  $WORKDIR/Annotation/$JOBNAME/results --out GIU3625_Humpback_whale.protein.dmnd_blastp.refseq_protein.tax.outfmt6  -c ~/.nextflow/bunya.config -with-tower

Functional annotation of proteins

The predicted proteins in the transcriptome were further annotated using InterProScan (or FA-nf) to assign protein families, motifs and ontologies to assist with transcript-to-gene annotation. Notice the issues mentioned above with SignalP and TMHMM and a new one for Phobius (see discussions and solutions on GitHub and BioStar)

IPSCAN_VERSION=5.66-98.0
NCORES=12
MEM=96
WALLTIME=50:00:00
# annotate proteins 
WORKDIR="/home/ibar/adna/sandbox/OTE14085/Mnova_genome"
cd $WORKDIR/Annotation
JOBNAME="Mnova_prot_ipscan"
mkdir -p $WORKDIR/Annotation/$JOBNAME
# prepare the commands (don't forget to remove the asterisk at the end of the proteins!)
ls -1 $WORKDIR/GIU3625_Humpback_whale.protein.fasta | parallel --dry-run "sed 's/[*]//g' {} > \$TMPDIR/{/} ; apptainer exec -B /home/ibar/scratch/tools -B $WORKDIR/Annotation/$JOBNAME:/output -B \$TMPDIR:/temp $NXF_SINGULARITY_CACHEDIR/interproscan_${IPSCAN_VERSION}.sif /opt/interproscan/interproscan.sh -i /temp/{/} -d /output -pa -dp -goterms -f TSV -T /temp -cpu \$SLURM_CPUS_PER_TASK && gawk '\$4~/PANTHER/' $WORKDIR/Annotation/$JOBNAME/{/}.tsv > $WORKDIR/Annotation/$JOBNAME/{/.}.panther.tsv" > $JOBNAME.cmds

# Prepare a general array Slurm script
echo '#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%A.%a.log'"
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

set -Eeo pipefail
source ~/.bashrc
conda activate \$CONDA_NAME
cd \$SLURM_SUBMIT_DIR
srun gawk -v ARRAY_IND=\$SLURM_ARRAY_TASK_ID 'NR==ARRAY_IND' \$CMDS_FILE | bash" > ~/bin/array.slurm
chmod +x ~/bin/array.slurm # make it available from any folder
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%10 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=base  ~/bin/array.slurm


# send results back to Research Space
rclone copy -P $WORKDIR/Annotation/$JOBNAME GURS_shared:Erika_Whale_fasting_genomics/OTE14085/Mnova_genome/Annotation/$JOBNAME # /Annotation_results/ipscan

FA-nf pipeline is not yet working!

WORKDIR="/home/ibar/adna/sandbox/OTE14085/Mnova_genome"
cd $WORKDIR/Annotation
JOBNAME="Mnova_prot_FA-nf"
mkdir -p $WORKDIR/Annotation/$JOBNAME

# download databases
cd ~/adna/tools/FA-nf
# with Nextflow pipeline
# ~/bin/nextflow-22.11.1-edge-all run download.nf --config params.download.config -profile bunya,apptainer -c ~/.nextflow/bunya.config &> download.logfile

# manually download dbs
mkdir -p ~/adna/tools/FA-nf/DBs/kegg && cd ~/adna/tools/FA-nf/DBs/kegg

aria2c -x5 https://www.genome.jp/ftp/db/kofam/ko_list.gz && pigz -d ko_list.gz
aria2c -x5 https://www.genome.jp/ftp/db/kofam/profiles.tar.gz && tar xzf profiles.tar.gz

# download and build FA-NF container
start_long_interactive_job # start interactive job
apptainer pull $NXF_SINGULARITY_CACHEDIR/guigolab.fa-nf.0.4.0.sif docker://guigolab/fa-nf:0.4.0
# Process KEGG db
mkdir -p ~/adna/tools/FA-nf/DBs/kegg/ko_store
apptainer exec -B /home/ibar/adna/tools -B $WORKDIR/Annotation/$JOBNAME:/output -B \$TMPDIR:/temp $NXF_SINGULARITY_CACHEDIR/guigolab.fa-nf.0.4.0.sif bulkDownloadKEGG.pl ko_list ko_store

# download GO database
cd ~/adna/tools/FA-nf/DBs
aria2c -x5 http://www.geneontology.org/ontology/gene_ontology.obo

~/bin/nextflow-22.11.1-edge-all run ~/adna/tools/FA-nf/download.nf --config ~/adna/tools/FA-nf/params.download.config -profile bunya,apptainer -work-dir $TMPDIR/$JOBNAME/work --proteinFile $WORKDIR/GIU3625_Humpback_whale.protein.fasta --gffFile $WORKDIR/GIU3625_Humpback_whale.annotation.gff.gz --speciesName "M.novaeangliae" --dbname "Mnova"

These annotation tables were uploaded to Griffith Research Space.


rclone copy -P -L --exclude "**work/*" --exclude "**nextflow/*" --exclude "**.nextflow*" Mnova_genome GURS_shared:Erika_Whale_fasting_genomics/OTE14085/Mnova_genome

De-Novo Transcriptome Assembly and Annotation

Since the alignment rates to the reference transcriptome were very low (~20-30%), we assembled a De-Novo transcriptome assembly from the samples that should provide a better representation of the actual transcripts. To reduce the computational load, a subset of 25% of the reads from each samples was used.

# setup environment
CONDA_NAME="genomics"
mamba install -n $CONDA_NAME rcorrector agat ncbi-datasets-cli pbgzip # can be in the base environment
# download the reference genome
cd $HOME/etc/tools/Trinity # ~/aDNA/tools 
# install additional utilities
git clone https://github.com/harvardinformatics/TranscriptomeAssemblyTools.git

# create our working directory and enter it
WORKDIR="/home/ibar/adna/sandbox/OTE14085"
mkdir -p $WORKDIR/combined_reads
cd $WORKDIR
# download the reference genome (only genome is currently available)
mkdir -p $WORKDIR/Mnova_genome && cd $WORKDIR/Mnova_genome
datasets download genome accession GCA_041834305.1 --include gff3,rna,cds,protein,genome,seq-report 
unzip ncbi_dataset.zip
# download the reference genome of the Minke whale
mkdir -p $WORKDIR/Minke_genome && cd $WORKDIR/Minke_genome
datasets download genome accession GCF_949987535.1 --include gff3,rna,cds,protein,genome,seq-report
unzip ncbi_dataset.zip

# prepare slurm script
JOBNAME="combine-reads"
NCORES=8
MEM="16G"
WALLTIME=1:00:00

echo '#!/bin/bash --login'"
#SBATCH --job-name=$JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%j.log
#SBATCH --cpus-per-task=$NCORES
#SBATCH --mem=${MEM}
#SBATCH --time=$WALLTIME
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

source ~/.bashrc
set -Eeo pipefail
cd \$SLURM_SUBMIT_DIR

srun parallel --rpl \"{sample} s:.+/([^_]+)_.+.fastq.gz:\1:\" -k \"cat $WORKDIR/{1 sample}*_R{2}*.fastq.gz > $WORKDIR/combined_reads/{1 sample}_R{2}.fq.gz\" ::: \$(find $WORKDIR -name \"*_L007_R1*.fastq.gz\") ::: 1 2 " > $JOBNAME.slurm
# submit the job 
sbatch $JOBNAME.slurm

WORKDIR="/home/ibar/adna/sandbox/OTE14085" # Bunya
CONDA_NAME='genomics'
JOBNAME="Sample_reads"
SAMPLING_RATE=0.5
NCORES=12
MEM=64
WALLTIME=1:00:00
# export TOWER_ACCESS_TOKEN=eyJ0aWQiOiA0NDYxfS5hNzkzY2M3NTI1YjZkZGFlMjVlZjJiZmIxNTU1MDAyMGE4YjNjZWQ0 # only needed if it is not already set in ~/.bashrc
# export NXF_VER=22.04.5
# prepare the folder and input files
mkdir -p $WORKDIR/combined_reads/subsample_$SAMPLING_RATE && cd $WORKDIR
# subsample 10% from each input file and combine for assembly
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.fq.gz:\1:" --rpl "{input} s:_R1:_R#:" "reformat.sh in={input} out=$WORKDIR/combined_reads/subsample_$SAMPLING_RATE/sampled_{sample}_R#.fq.gz samplerate=$SAMPLING_RATE sampleseed=\$RANDOM vpair ow" ::: $(ls -1 $WORKDIR/combined_reads/*R1.fq.gz) > $JOBNAME.cmds 

# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%20 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm


# upload files to Research Space
rclone copy -P -L --exclude "**work/*" --exclude "**nextflow/*" --exclude "**.nextflow*" Mnova_genome GURS_shared:Erika_Whale_fasting_genomics/OTE14085/Mnova_genome

Transcriptome Assembly with TransPi

A De-Novo transcriptome was assembled from the reads using TransPi. In brief, TransPi performs a complete transcriptome assembly, basic annotation and assessment pipeline using well-established methods (Rivera-Vicéns et al. 2022). Detailed information on setting up and configuring TransPi, including detailed code snippets can be found at https://idobar.github.io/QX_bioinfo_analysis.

WORKDIR="/home/ibar/adna/sandbox/OTE14085" # Bunya
CONDA_NAME="genomics"
SAMPLING_RATE=0.5
NCORES=12
MEM=64
WALLTIME=1:00:00

# export TOWER_ACCESS_TOKEN=eyJ0aWQiOiA0NDYxfS5hNzkzY2M3NTI1YjZkZGFlMjVlZjJiZmIxNTU1MDAyMGE4YjNjZWQ0 # only needed if it is not already set in ~/.bashrc
# export NXF_VER=22.04.5
# prepare the folder and input files
cd $WORKDIR
# prepare the folder and input files
JOBNAME="combine_sampled_reads"
mkdir -p $WORKDIR/Mnova_TransPi_assembly/input_files && cd $WORKDIR/Mnova_TransPi_assembly
parallel --dry-run "cat $WORKDIR/combined_reads/subsample_$SAMPLING_RATE/sampled_*_R{}.fq.gz > input_files/Mnova_sampled$SAMPLING_RATE_{}.fq.gz" ::: 1 2 > $JOBNAME.cmds
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%20 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm
# download BUSCO lineage db
cd /home/ibar/adna/tools/TransPi/DBs/busco_db/
grep "mammalia" /home/ibar/adna/tools/TransPi/conf/busV4list.txt | cut -f2 -d";" | parallel "aria2c -x5 {} && tar xf {} && rm {}"
# start TransPi (Bunya) [optional tools and databases configured in nextflow.config]
nextflow-22.11.1-edge-all run /home/ibar/adna/tools/TransPi/TransPi.nf -with-tower --all -profile apptainer,TransPiContainer,bunya -c /home/ibar/.nextflow/bunya.config --k 37,47,53,59,67,73 --maxReadLen 105 --pipeInstall /scratch/project/adna/tools/TransPi --busco4db /home/ibar/adna/tools/TransPi/DBs/busco_db/mammalia_odb10 --reads "./input_files/Mnova_sampled$SAMPLING_RATE_[1,2].fq.gz" -resume # --allBuscos --buscoDist

Transcriptome Assembly with nf-core/denovotranscript

Considering the difficulties to run TransPi and the inability to choose which assemblers to use, we tried a more mature and recent Nextflow pipeline, nf-core/denovotranscript (v1.1.0).

WORKDIR="/home/ibar/adna/sandbox/OTE14085" # Bunya
CONDA_NAME='genomics'
SAMPLING_RATE=0.25
SRT=1000000
JOBNAME="sample_reads-$SAMPLING_RATE"
NCORES=12
MEM=64
WALLTIME=1:00:00
# export TOWER_ACCESS_TOKEN=eyJ0aWQiOiA0NDYxfS5hNzkzY2M3NTI1YjZkZGFlMjVlZjJiZmIxNTU1MDAyMGE4YjNjZWQ0 # only needed if it is not already set in ~/.bashrc
# export NXF_VER=22.04.5
# prepare the folder and input files
mkdir -p $WORKDIR/combined_reads/subsample_$SAMPLING_RATE && cd $WORKDIR
# subsample 10% from each input file and combine for assembly
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.fq.gz:\1:" --rpl "{input} s:_R1:_R#:" "reformat.sh in={input} out=$WORKDIR/combined_reads/subsample_$SAMPLING_RATE/sampled_{sample}_R#.fq.gz samplerate=$SAMPLING_RATE sampleseed=\$RANDOM vpair ow" ::: $(ls -1 $WORKDIR/combined_reads/*R1.fq.gz) > $JOBNAME.cmds 
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%20 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm

# update nextflow (v24.04.4) and nf-core tools (v3.0.2)
# aria2c -x5 -d ~/bin/ https://github.com/nextflow-io/nextflow/releases/download/v24.04.4/nextflow-24.04.4-all && chmod +x ~/bin/nextflow-24.04.4-all
mamba update -n base nf-core # update via conda
# download Trinity recent image
aria2c -x5 -d $NXF_APPTAINER_CACHEDIR https://data.broadinstitute.org/Trinity/TRINITY_SINGULARITY/trinityrnaseq.v2.15.2.simg 


mkdir -p $WORKDIR/Mnova_denovotranscript_assembly && cd $WORKDIR/Mnova_denovotranscript_assembly
# create a sample sheet (remove outlier 26S23)
echo "sample,fastq_1,fastq_2" > Mnova_samplesheet.csv
find $WORKDIR/combined_reads/subsample_$SAMPLING_RATE/ -name "sampled_*R1.fq.gz" | gawk 'BEGIN{i=0;j=0}{read2 = gensub("_R1", "_R2", "g")}; $0 ~ /sampled_[0-9]+N[0-9]+/{i++; printf "EM_REP%s,%s,%s\n", i, $1, read2}$0 ~ /sampled_[0-9]+S[0-9]+/{j++; printf "LM_REP%s,%s,%s\n", j, $1, read2}' | sort | grep -v "26S23" >> Mnova_samplesheet.csv

# start interactive job (to be able to pull the images with apptainer)
start_interactive_job
# pull the pipeline and images
nf-core pipelines download -r 1.1.0 -s singularity -x none nf-core/denovotranscript -o $NXF_APPTAINER_CACHEDIR
#nextflow pull nf-core/denovotranscript

# prepare slurm script
JOBNAME="Mnova_denovotranscript"
NCORES=4
MEM="16G"
WALLTIME=200:00:00

echo '#!/bin/bash --login'"
#SBATCH --job-name=$JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%j.log
#SBATCH --cpus-per-task=$NCORES
#SBATCH --mem=${MEM}
#SBATCH --time=$WALLTIME
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

source ~/.bashrc
set -Eeo pipefail
cd \$SLURM_SUBMIT_DIR

srun ~/bin/nextflow-24.04.4-all run nf-core/denovotranscript -r 1.1.0 --input Mnova_samplesheet.csv --transrate_reference $WORKDIR/Minke_genome/ncbi_dataset/data/GCF_949987535.1/protein.faa --fasta $WORKDIR/Mnova_genome/HumpbackWhale_Final_Genome_forNCBI.fasta --outdir assembly_results --assemblers 'trinity_no_norm,rnaspades' --ss rf --extra_trinity_args '--SS_lib_type RF' -with-tower -profile apptainer,bunya -c /home/ibar/.nextflow/bunya.config -resume" > $JOBNAME.slurm

# submit the job 
sbatch $JOBNAME.slurm
# --busco_lineage mammalia

Repeat the assembly using 5M paired-end reads from each sample (following the recommendations of the Oyster River Protocol detailed at MacManes (2018))

WORKDIR="/home/ibar/adna/sandbox/OTE14085" # Bunya
CONDA_NAME='genomics'
SRT=5000000
JOBNAME="sample_reads-5M"
NCORES=6
MEM=12
WALLTIME=0:20:00
# export TOWER_ACCESS_TOKEN=eyJ0aWQiOiA0NDYxfS5hNzkzY2M3NTI1YjZkZGFlMjVlZjJiZmIxNTU1MDAyMGE4YjNjZWQ0 # only needed if it is not already set in ~/.bashrc
# export NXF_VER=22.04.5
# prepare the folder and input files
mkdir -p $WORKDIR/combined_reads/$JOBNAME && cd $WORKDIR/combined_reads/$JOBNAME
# subsample 5M reads from each input file and combine for assembly
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.fq.gz:\1:" --rpl "{input} s:_R1:_R#:" "reformat.sh in={input} out=$WORKDIR/combined_reads/$JOBNAME/sampled_{sample}_R#.fq.gz srt=$SRT sampleseed=\$RANDOM vpair ow" ::: $(ls -1 $WORKDIR/combined_reads/*R1.fq.gz | grep -v "26S23") > $JOBNAME.cmds 
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%20 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm

# Run Rcorrect on EM and LM samples separately
cd $WORKDIR/combined_reads/$JOBNAME
R1_FILES_EM=$(find $WORKDIR/combined_reads/$JOBNAME -name "*_R1.fq.gz" | egrep "_[0-9]+N[0-9]+_"| paste -sd,)
R2_FILES_EM=$(find $WORKDIR/combined_reads/$JOBNAME -name "*_R2.fq.gz" | egrep "_[0-9]+N[0-9]+_"| paste -sd,)
# submit job
echo "mkdir -p $WORKDIR/combined_reads/$JOBNAME/corrected_reads_EM; run_rcorrector.pl -1 $R1_FILES_EM -2 $R2_FILES_EM -t \$SLURM_CPUS_PER_TASK -od $WORKDIR/combined_reads/$JOBNAME/corrected_reads_EM -tmpd \$TMPDIR" > $JOBNAME-rcorrect.cmds

R1_FILES_LM=$(find $WORKDIR/combined_reads/$JOBNAME -name "*_R1.fq.gz" | egrep "_[0-9]+S[0-9]+_"| paste -sd,)
R2_FILES_LM=$(find $WORKDIR/combined_reads/$JOBNAME -name "*_R2.fq.gz" | egrep "_[0-9]+S[0-9]+_"| paste -sd,)
# submit job
echo "mkdir -p $WORKDIR/combined_reads/$JOBNAME/corrected_reads_LM; run_rcorrector.pl -1 $R1_FILES_LM -2 $R2_FILES_LM -t \$SLURM_CPUS_PER_TASK -od $WORKDIR/combined_reads/$JOBNAME/corrected_reads_LM -tmpd \$TMPDIR -verbose " >> $JOBNAME-rcorrect.cmds

# submit the job array to the cluster
NCORES=8
MEM=64
WALLTIME=10:00:00
sbatch -a 1-$(cat $JOBNAME-rcorrect.cmds | wc -l)%10 --job-name=$JOBNAME-rcorrect --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,PATH=$PATH,CMDS_FILE=$JOBNAME-rcorrect.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm

# remove unfixable reads (not sure if it's needed)
mkdir -p $WORKDIR/combined_reads/sample_reads-5M/unfixrm $WORKDIR/combined_reads/sample_reads-5M/unfixrm
find $WORKDIR/combined_reads/sample_reads-5M/ -name "unfixrm_sampled_*R1.cor.fq.gz" | parallel --dry-run --rpl "{base} s:.+\/(.+)_R1.+$:\1:" --rpl "{read2} s:_R1:_R2:" "python  ~/etc/tools/TranscriptomeAssemblyTools/utilities/FilterUncorrectabledPEfastq.py -1 {} -2 {read2} -s {base} " > $JOBNAME-rm-unfix.cmds

MEM=32
WALLTIME=2:00:00

sbatch -a 1-$(cat $JOBNAME-rm-unfix.cmds | wc -l)%10 --job-name=$JOBNAME-rm-unfix --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,PATH=$PATH,CMDS_FILE=$JOBNAME-rm-unfix.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm

Now we can run the assembly pipeline (make sure to specify --min_kmer_cov 2 --min_contig_length 300 for Trinity to reduce the number of temporary files generated)
Note that we needed to change the ORP_TRANSRATE module to use the ORP Docker container and to make sure it activates the internal orp conda environment within the container and disable binding the /home folder.

WORKDIR="/scratch/project/adna/sandbox/OTE14085" # Bunya
# update nextflow (v24.04.4) and nf-core tools (v3.0.2)
# aria2c -x5 -d ~/bin/ https://github.com/nextflow-io/nextflow/releases/download/v24.04.4/nextflow-24.04.4-all && chmod +x ~/bin/nextflow-24.04.4-all
mamba update -n base nf-core # update via conda
# download Trinity recent image
aria2c -x5 -d $NXF_APPTAINER_CACHEDIR https://data.broadinstitute.org/Trinity/TRINITY_SINGULARITY/trinityrnaseq.v2.15.2.simg 

JOBNAME="Mnova_denovotranscript2"
mkdir -p $WORKDIR/$JOBNAME && cd $WORKDIR/$JOBNAME
# create a sample sheet (remove outlier 26S23)
echo "sample,fastq_1,fastq_2" > ${JOBNAME}.samplesheet.csv
# need to choose whether to use the .cor files or the unfixrm files (not sure if it's needed)
find $WORKDIR/combined_reads/sample_reads-5M -name "unfixrm_sampled_*R1.cor.fq.gz"  | gawk 'BEGIN{i=0;j=0}{read2 = gensub("_R1", "_R2", "g")}; $0 ~ /sampled_[0-9]+N[0-9]+/{i++; printf "EM_REP%s,%s,%s\n", i, $1, read2}$0 ~ /sampled_[0-9]+S[0-9]+/{j++; printf "LM_REP%s,%s,%s\n", j, $1, read2}' | grep -v "26S23" >> ${JOBNAME}.samplesheet.csv

# start interactive job (to be able to pull the images with apptainer)
start_interactive_job
# pull the pipeline and images
nf-core pipelines download -r 1.1.0 -s singularity -x none nf-core/denovotranscript -o $NXF_APPTAINER_CACHEDIR
exit
#nextflow pull nf-core/denovotranscript

# run from the login node
~/bin/nextflow-24.04.4-all run nf-core/denovotranscript -r 1.1.0 --input ${JOBNAME}.samplesheet.csv --transrate_reference $WORKDIR/Minke_genome/ncbi_dataset/data/GCF_949987535.1/protein.faa --fasta $WORKDIR/Mnova_genome/HumpbackWhale_Final_Genome_forNCBI.fasta --outdir assembly_results --assemblers 'trinity,rnaspades' --ss rf --extra_trinity_args '--SS_lib_type RF --min_kmer_cov 2 --min_contig_length 300 --full_cleanup' -with-tower -profile apptainer,bunya -c /home/ibar/.nextflow/bunya.config -resume

# prepare slurm script
JOBNAME="Mnova_denovotranscript2"
NCORES=4
MEM="16G"
WALLTIME=120:00:00

echo '#!/bin/bash --login'"
#SBATCH --job-name=$JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%j.log
#SBATCH --cpus-per-task=$NCORES
#SBATCH --mem=${MEM}
#SBATCH --time=$WALLTIME
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

source ~/.bashrc
set -Eeo pipefail
cd \$SLURM_SUBMIT_DIR

srun ~/bin/nextflow-24.04.4-all run nf-core/denovotranscript -r 1.1.0 --input ${JOBNAME}.samplesheet.csv --transrate_reference $WORKDIR/Minke_genome/ncbi_dataset/data/GCF_949987535.1/protein.faa --fasta $WORKDIR/Mnova_genome/HumpbackWhale_Final_Genome_forNCBI.fasta --outdir assembly_results --assemblers 'trinity,rnaspades' --ss rf --extra_trinity_args '--SS_lib_type RF --min_kmer_cov 2 --min_contig_length 300 --full_cleanup' -with-tower -profile apptainer,bunya -c /home/ibar/.nextflow/bunya.config -resume" > $JOBNAME.slurm

# submit the job 
sbatch $JOBNAME.slurm
# --busco_lineage mammalia
# upload results to Erika's Sharepoint
rclone copy -P -L --exclude "**work/*" --exclude "**nextflow/*" --exclude "**.log*" $WORKDIR/$JOBNAME Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/$JOBNAME
rm -rf $WORKDIR/$JOBNAME/.nextflow* $WORKDIR/$JOBNAME/work

Or with the Oyster River Protocol pipeline (MacManes 2018).

# prepare slurm script
WORKDIR="/scratch/project/adna/sandbox/OTE14085"
JOBNAME="Mnova_orp_pipeline"
NCORES=24
MEM=320
WALLTIME=200:00:00
mkdir -p $WORKDIR/$JOBNAME/input_files && cd $WORKDIR/$JOBNAME
seq 1 2 | parallel "cat $WORKDIR/combined_reads/sample_reads-5M/sampled_*R{}.fq.gz > input_files/sample_reads-5M_R{}.fq.gz"

echo '#!/bin/bash --login'"
#SBATCH --job-name=$JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%j.log
#SBATCH --cpus-per-task=$NCORES
#SBATCH --mem=${MEM}G
#SBATCH --time=$WALLTIME
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

source ~/.bashrc
set -Eeo pipefail
cd \$SLURM_SUBMIT_DIR

apptainer exec -B /scratch/project/adna --no-mount /home $NXF_APPTAINER_CACHEDIR/docker.io-macmaneslab-orp-2.3.3.img bash -c \"cd $WORKDIR/$JOBNAME && sed 's|--config \\\${MAKEDIR}/software/config.ini|--config \\\${BUSCO_CONFIG_FILE}|g;  s|\\\${MAKEDIR}/busco_dbs|\\\${BUSCODB}|g; s|\\\${MAKEDIR}version.txt|\\\${MAKEDIR}/version.txt|g' /home/orp/Oyster_River_Protocol/oyster.mk > ./oyster.mk && chmod +x ./oyster.mk && source /home/orp/Oyster_River_Protocol/software/anaconda/install/bin/activate orp && unset -f which &&  ./oyster.mk TPM_FILT=1 CPU=\$SLURM_CPUS_PER_TASK MEM=$MEM MAKEDIR=/home/orp/Oyster_River_Protocol BUSCODB=/scratch/project/adna/tools/TransPi/DBs/busco_db  READ1=$WORKDIR/$JOBNAME/input_files/sample_reads-5M_R1.fq.gz READ2=$WORKDIR/$JOBNAME/input_files/sample_reads-5M_R2.fq.gz RUNOUT=Mnova_trans_orp\" " >  $JOBNAME.slurm

sbatch $JOBNAME.slurm

The ORP failed to create the SNAP index needed to run the TransRate step due to a very large assembly size. We therefore created a modified and updated version of ORP (that should be able to handle larger assemblies).

# prepare slurm script
WORKDIR="/scratch/project/adna/sandbox/OTE14085"
JOBNAME="Mnova_orp_pipeline"
NCORES=24
MEM=320
WALLTIME=200:00:00
mkdir -p $WORKDIR/$JOBNAME/input_files && cd $WORKDIR/$JOBNAME
# download the container from Dockerhub
start_debug_interactive_job
apptainer pull --dir $NXF_APPTAINER_CACHEDIR docker.io-idobarlab-orp-2.3.3-p1.img docker://idobarlab/orp:2.3.3-p1 && exit
# concatenate input files
seq 1 2 | parallel "cat $WORKDIR/combined_reads/sample_reads-5M/sampled_*R{}.fq.gz > input_files/sample_reads-5M_R{}.fq.gz"

echo '#!/bin/bash --login'"
#SBATCH --job-name=$JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=%x.%j.log
#SBATCH --cpus-per-task=$NCORES
#SBATCH --mem=${MEM}G
#SBATCH --time=$WALLTIME
#SBATCH --account=a_agri_genomics
#SBATCH --partition=general

source ~/.bashrc
set -Eeo pipefail
cd \$SLURM_SUBMIT_DIR

apptainer exec -B /scratch/project/adna --no-mount /home $NXF_APPTAINER_CACHEDIR/docker.io-idobarlab-orp-2.3.3-p1.img bash -c \"cd $WORKDIR/$JOBNAME && source /home/orp/Oyster_River_Protocol/software/anaconda/install/bin/activate orp && unset -f which &&  /home/orp/Oyster_River_Protocol/oyster.mk TPM_FILT=1 CPU=\$SLURM_CPUS_PER_TASK MEM=$MEM MAKEDIR=/home/orp/Oyster_River_Protocol BUSCODB=/scratch/project/adna/tools/TransPi/DBs/busco_db  READ1=$WORKDIR/$JOBNAME/input_files/sample_reads-5M_R1.fq.gz READ2=$WORKDIR/$JOBNAME/input_files/sample_reads-5M_R2.fq.gz RUNOUT=Mnova_trans_orp\" " >  $JOBNAME.slurm

sbatch $JOBNAME.slurm
Assess the quality of assemblies

The transcriptome assemblies were assessed and compared using rnaQuast v2.2.3 (Bushmanova et al. 2016) and TransRate v1.0.3 (Smith-Unna et al. 2016), as suggested in the Trinity documentation and by Behera et al. (2021).

The transcripts were compared to the minke whale (Balaenoptera acutorostrata) proteins, which were obtained from NCBI (Accession GCF_949987535.1) (Brownlow et al. 2024)

CONDA_NAME='rnaseq'
mamba create -n $CONDA_NAME transrate ncbi-datasets-cli detonate
WORKDIR="/scratch/project/adna/sandbox/OTE14085" # Bunya
# mkdir -p $WORKDIR/references && cd $WORKDIR/references
#datasets download genome accession GCF_902806645.1 --include gff3,rna,cds,protein,genome --filename GCF_902806645.1.zip
# unzip GCF_902806645.1.zip
REF_PROT="$WORKDIR/Minke_genome/ncbi_dataset/data/GCF_949987535.1/protein.faa"
# $WORKDIR/SRO_TransPi_assembly/results/assemblies
# install the ORP version of transrate
cd ~/etc/tools 
wget https://github.com/macmanes-lab/Oyster_River_Protocol/raw/master/software/orp-transrate.tar.gz
tar xzf orp-transrate.tar.gz

# or use the full ORP container
cd $NXF_APPTAINER_CACHEDIR
apptainer pull docker://macmaneslab/orp:2.3.3 --dir $NXF_APPTAINER_CACHEDIR
# check strandness of the denovotranscript assembly
ASSEMBLY="Mnova_denovotranscript2"
JOBNAME="check-strandness-$ASSEMBLY"
NCORES=8
MEM=24
WALLTIME=2:00:00
RNASEQ_DIR="$WORKDIR/combined_reads/subsample_0.25"


cd $WORKDIR/$ASSEMBLY
# find ./assembly_results/trinity -name "pooled_reads*.fa.gz" 

echo "apptainer exec -B $WORKDIR:/scratch --no-mount /home  $NXF_APPTAINER_CACHEDIR/docker.io-macmaneslab-orp-2.3.3.img bash -c \"source /home/orp/Oyster_River_Protocol/software/anaconda/install/bin/activate orp_trinity && unset -f which && cd /scratch/$ASSEMBLY && sed -r 's|MAKEDIR :=.+|MAKEDIR := /home/orp/Oyster_River_Protocol|; s|MAKEDIR}version|MAKEDIR}/version|' /home/orp/Oyster_River_Protocol/strandeval.mk > ./strandeval.mk && chmod +x ./strandeval.mk &&  ./strandeval.mk main ASSEMBLY=$WORKDIR/$ASSEMBLY/assembly_results/evigene/okayset/all_assembled.okay.mrna CPU=\$SLURM_CPUS_PER_TASK READ1=/scratch/combined_reads/subsample_0.25/sampled_8S23_R1.fq.gz READ2=/scratch/combined_reads/subsample_0.25/sampled_8S23_R2.fq.gz RUNOUT=${ASSEMBLY}_strandness\" " >  $JOBNAME.slurm
# submit the job array to the cluster
sbatch $JOBNAME.slurm

# check strandness of the TransPi assembly
ASSEMBLY="Mnova-TransPi"
JOBNAME="check-strandness-$ASSEMBLY"
NCORES=8
MEM=24
WALLTIME=2:00:00


cd $WORKDIR/Mnova_TransPi_assembly
find ./results/assemblies/ -name "subsampled_Mnova.*.fa" | egrep -v "k[0-9]+" | parallel --dry-run --rpl "{ass} s:.+/subsampled_Mnova\.(.+)\.fa:\1:" "apptainer exec -B $WORKDIR:/scratch --no-mount /home  $NXF_APPTAINER_CACHEDIR/docker.io-macmaneslab-orp-2.3.3.img bash -c \"source /home/orp/Oyster_River_Protocol/software/anaconda/install/bin/activate orp_trinity && unset -f which && cd /scratch/Mnova_TransPi_assembly && sed -r 's|MAKEDIR :=.+|MAKEDIR := /home/orp/Oyster_River_Protocol|; s|MAKEDIR}version|MAKEDIR}/version|' /home/orp/Oyster_River_Protocol/strandeval.mk > ./strandeval.mk && chmod +x ./strandeval.mk &&  ./strandeval.mk main ASSEMBLY={} CPU=\$SLURM_CPUS_PER_TASK READ1=/scratch/Mnova_TransPi_assembly/input_files/subsampled_Mnova_1.fq.gz READ2=/scratch/Mnova_TransPi_assembly/input_files/subsampled_Mnova_2.fq.gz RUNOUT=${ASSEMBLY}_strandness\" " >  $JOBNAME.cmds
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%10 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,PATH=$PATH,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=base ~/bin/array.slurm

NCORES=24
MEM=96
WALLTIME=100:00:00

#mkdir -p $WORKDIR/SRO_TransPi_assembly/results/transrate
# cp /home/ibar/adna/Oyster_QX_Disease/SRO_TransPi_assembly/work/27/72b07a2131e97971f644da74fc9dc6/unfixrm_SRO_QX_no_rRNA.R1.fq /home/ibar/adna/Oyster_QX_Disease/SRO_TransPi_assembly/work/2b/f20013a2ae897e646817eac5d0997c/SRO_QX_no_rRNA.R1.fq /home/ibar/adna/Oyster_QX_Disease/SRO_TransPi_assembly/work/e5/25ba5d1510380c3631ee6c18bf2e89/SRO_control_no_rRNA.R1.fq /home/ibar/adna/Oyster_QX_Disease/SRO_TransPi_assembly/work/81/5bd2ffe863bf0ee37388694b4ad254/unfixrm_SRO_control_no_rRNA.R1.fq $WORKDIR/SRO_TransPi_assembly/input_files/

# use no_rRNA.R1.fq files as input 
JOBNAME="transrate_no_rRNA"
OUTDIR=$WORKDIR/SRO_TransPi_assembly/results/$JOBNAME # or OUTDIR=~/scratch/sandbox/SRO_TransPi_assembly/$JOBNAME
mkdir -p $OUTDIR
parallel --dry-run --rpl "{sample} s:.+/(.+)_no_rRNA.R1.fq:\1:" --rpl "{read2} s:.R1:.R2:" "mkdir -p $OUTDIR/{sample}; ASSEMBLIES=\$(ls -1 $WORKDIR/SRO_TransPi_assembly/results/assemblies/{sample}*[!k0-9].fa | gawk -v ORS=\",\" '1' | sed 's/,$//'); $HOME/etc/tools/orp-transrate/transrate --assembly \$ASSEMBLIES --left {} --right {read2} --threads \$SLURM_CPUS_PER_TASK --reference $REF_PROT --output $OUTDIR/{sample}" ::: $(ls -1 $WORKDIR/SRO_TransPi_assembly/input_files/*SRO*no_rRNA.R1.fq) > $JOBNAME.cmds
# --merge-assemblies {sample}.merged.fa 
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%10 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ${WORKDIR}/array.slurm

# use normalised (.norm.fq) files as input 
JOBNAME="transrate_norm"
OUTDIR=$WORKDIR/SRO_TransPi_assembly/results/$JOBNAME # or OUTDIR=$HOME/scratch/sandbox/SRO_TransPi_assembly/$JOBNAME
mkdir -p $OUTDIR
parallel --dry-run --rpl "{sample} s:.+/\w+-(.+).norm.fq:\1:" --rpl "{read2} s:left-:right-:" "mkdir -p $OUTDIR/{sample}; ASSEMBLIES=\$(ls -1 $WORKDIR/SRO_TransPi_assembly/results/assemblies/{sample}*[!k0-9].fa | gawk -v ORS=\",\" '1' | sed 's/,$//'); $HOME/etc/tools/orp-transrate/transrate --assembly \$ASSEMBLIES --left {} --right {read2} --threads \$SLURM_CPUS_PER_TASK --reference $REF_PROT --output $OUTDIR/{sample}" ::: $(find $WORKDIR/SRO_TransPi_assembly/work -type f -name left-*SRO*.norm.fq) > $JOBNAME.cmds
# --merge-assemblies {sample}.merged.fa 
# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l)%10 --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ${WORKDIR}/array.slurm
Annotate the assembly

The reference transcriptome was annotated against the non-redundant nucleotide database of the NCBI (nt) to achieve more accurate species-specific annotations. BLASTn was run with blast-nf, a Nextflow implementation that uses a “split-combine” approach to split the input query (entire transcriptome or proteome) to smaller “chunks” that are run in parallel on the HPC cluster.
Considering that we’re dealing with a mammalian transcriptome (similar to well-annotated mammalian species), it may be useful to annotate the transcripts against the refseq database (which only contains curated gene transcripts).

# setup environment
CONDA_NAME="genomics"
mamba install -n $CONDA_NAME rcorrector agat ncbi-datasets-cli pbgzip google-cloud-sdk awscli
# download recent BLAST databases (if needed)
start_interactive_job
cd /scratch/project/adna/tools/ncbi_db # on Bunya
apptainer exec -B /home/ibar/adna/tools  -B $TMPDIR:/temp $NXF_SINGULARITY_CACHEDIR/ncbi-blast-latest.img update_blastdb.pl --decompress --verbose --source gcp --num_threads $SLURM_CPUS_PER_TASK swissprot refseq_protein refseq_rna
exit # quit interactive job
WORKDIR="/scratch/project/adna/sandbox/OTE14085/Mnova_denovotranscript2"

JOBNAME="nf-blastn-refseq"
DB="refseq_rna"
mkdir -p $WORKDIR/Annotation/$JOBNAME && cd $WORKDIR/Annotation/$JOBNAME
# run Nextflow Blast pipeline
~/bin/nextflow-22.11.1-edge-all run ~/etc/tools/blast-nf/main.nf --app blastn --dbDir /scratch/project/adna/tools/ncbi_db --dbName $DB --query $WORKDIR/assembly_results/evigene/okayset/all_assembled.okay.mrna --outfmt "6 std stitle staxids ssciname scomname" --options '-evalue 1e-10 -max_target_seqs 20' --chunkSize 2500 --outdir  $WORKDIR/Annotation/$JOBNAME/results --outfileName Mnova_adipose_all_assembled.okay.mrna.refseq.tax.outfmt6  -c ~/.nextflow/bunya.config -profile bunya,apptainer -with-tower
Annotate proteome with DIAMOND

Despite using blast-nf, BLASTp was painfully slow (taking ~80 hours for a chunk of 500 proteins) and therefore an alternative approach was taken using DIAMOND (Buchfink et al. 2015), which offers similar accuracy for protein homology searches and produces an identical output to BLASTp, at a much greater speed (up to x1,000 faster). We used the Nextflow-enabled version of DIAMOND (reducing the time to 1.5h for the entire proteome).
Considering that we’re dealing with a mammalian proteome (similar to well-annotated mammalian species), it may be useful to annotate the proteins against the swissprot or refseq_protein databases (which contain curated proteins only).

# start_long_interactive_job # start interactive job on Bunya
WORKDIR="/scratch/project/adna/sandbox/OTE14085/Mnova_denovotranscript2"

# create an output folder
JOBNAME='nf-diamondblastp-refseq'
DB='refseq_protein' #swissprot
mkdir -p $WORKDIR/Annotation/$JOBNAME
cd $WORKDIR/Annotation/$JOBNAME

# mkdir -p $WORKDIR/Annotation/Annotation_results/$JOBNAME
~/bin/nextflow-22.11.1-edge-all run /scratch/project/adna/tools/nf-blast/nf-blast.nf -work-dir /scratch/project/adna/tmp/$JOBNAME/work -profile bunya,apptainer,diamond_tax --query $WORKDIR/assembly_results/evigene/okayset/all_assembled.okay.aa --app 'diamond blastp' --db /scratch/project/adna/tools/ncbi_db/$DB --diamondOpts '--very-sensitive -e 1e-10 -k 20' --chunkSize 10000 --outDir  $WORKDIR/Annotation/$JOBNAME/results --out Mnova_adipose_all_assembled.okay.prots.dmnd_blastp.refseq_protein.tax.outfmt6  -c ~/.nextflow/bunya.config -with-tower
Functional annotation of proteins

The predicted proteins in the transcriptome were further annotated using InterProScan (or FA-nf) to assign protein families, motifs and ontologies to assist with transcript-to-gene annotation. Notice the issues mentioned above with SignalP and TMHMM and a new one for Phobius (see discussions and solutions on GitHub and BioStar)

IPSCAN_VERSION=5.66-98.0
NCORES=12
MEM=96
WALLTIME=50:00:00
# annotate proteins 
WORKDIR="/scratch/project/adna/sandbox/OTE14085/Mnova_denovotranscript2"
cd $WORKDIR/Annotation
JOBNAME="Mnova_prot_ipscan"
mkdir -p $WORKDIR/Annotation/$JOBNAME
# prepare the commands (don't forget to remove the asterisk at the end of the proteins!)
ls -1 $WORKDIR/assembly_results/evigene/okayset/all_assembled.okay.aa | parallel --dry-run "sed 's/[*]//g' {} > \$TMPDIR/{/} ; apptainer exec -B /home/ibar/scratch/tools -B $WORKDIR/Annotation/$JOBNAME:/output -B \$TMPDIR:/temp $NXF_SINGULARITY_CACHEDIR/interproscan_${IPSCAN_VERSION}.sif /opt/interproscan/interproscan.sh -i /temp/{/} -d /output -pa -dp -goterms -f TSV -T /temp -cpu \$SLURM_CPUS_PER_TASK && gawk '\$4~/PANTHER/' $WORKDIR/Annotation/$JOBNAME/{/}.tsv > $WORKDIR/Annotation/$JOBNAME/{/.}.panther.tsv" > $JOBNAME.cmds

# submit the job to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=base ~/bin/serial_jobs_run.slurm


# upload results to SharePoint
rclone copy -P -L --exclude "**work/*" --exclude "**nextflow/*" --exclude "**.nextflow*" --exclude "**/*.html" $WORKDIR/Annotation Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/Mnova_denovotranscript2/Annotation
# repeat to copy html files
rclone copy -P -L --ignore-checksum --ignore-size --include "**/*.html" $WORKDIR/Annotation Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/Mnova_denovotranscript2/Annotation

Gene Expression Analysis

The chosen most representative transcriptome assembly was used as reference to quantify gene/transcript abundance using Salmon v1.10.3 (Patro et al. 2017; Love et al. 2018).

CONDA_NAME=picard
mamba install -n $CONDA_NAME kallisto salmon 
WORKDIR="/home/ibar/adna/sandbox/OTE14085"

TRANS="$WORKDIR/Mnova_denovotranscript2/assembly_results/evigene/okayset/all_assembled.okay.mrna"
GENOME="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.RepeatMasked.fasta"
GENEMAP="$WORKDIR/Mnova_denovotranscript2/assembly_results/e"


cd $WORKDIR
# Create a reference (combined transcriptome and genome as decoy) 

JOBNAME="Salmon_index"
NCORES=12
MEM=64
WALLTIME=10:00:00
echo "grep \"^>\" $GENOME | cut -d \" \" -f 1 | sed -e 's/>//g' > decoys.txt; cat $TRANS $GENOME | pigz > Mnova_transgenome.fasta.gz; salmon index -t Mnova_transgenome.fasta.gz -d decoys.txt -p \$SLURM_CPUS_PER_TASK -i Mnova_transgenome.index" > $JOBNAME.cmds 

# submit the job array to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# run Salmon quant
DATE=`date +%d_%m_%Y`
JOBNAME="Salmon_quant_$DATE"
WORKDIR="/home/ibar/adna/sandbox/OTE14085"
RUNDIR="$WORKDIR/Gene_expression/$JOBNAME"
mkdir -p $RUNDIR && cd $RUNDIR
# create tx2gene map
echo "head -n1 $WORKDIR/Mnova_denovotranscript2/assembly_results/tx2gene/all_assembled.tx2gene.tsv > all_assembled.okay.tx2gene.tsv
grep ">" $TRANS | gawk '{printf "%s\t\n", $1}' | sed -r 's/>//' > all_assembled.okay.headers
grep -f all_assembled.okay.headers $WORKDIR/Mnova_denovotranscript2/assembly_results/tx2gene/all_assembled.tx2gene.tsv >> all_assembled.okay.tx2gene.tsv" > tx2gene-map.cmds
sbatch --job-name=tx2gene-map --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=tx2gene-map.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# Prepare Salmon commands (probably better to use --numGibbsSamples 50 instead of 500 Bootstraps)
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.fq.gz:\1:" --rpl "{read2} s:_R1:_R2:" "salmon quant -i $WORKDIR/Mnova_genome/Mnova_transgenome.index -l A -1 {} -2 {read2} -o {sample}_salmon_eq --numBootstraps 500 --useVBOpt --seqBias --dumpEq -p \$SLURM_CPUS_PER_TASK --geneMap all_assembled.okay.tx2gene.tsv" ::: $(ls -1 $WORKDIR/combined_reads/*_R1.fq.gz ) > $JOBNAME.cmds

# submit the job array to the cluster
sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm

# upload results to SharePoint
rclone copy -P -L --exclude "**/*.html" $WORKDIR/Gene_expression Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/Gene_expression
# repeat to copy html files
rclone copy -P -L --ignore-checksum --ignore-size --include "**/*.html" $WORKDIR/Gene_expression Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/Gene_expression

Expression-based clustering of transcripts was performed with a modified version of RapClust (Trapnell et al. 2013).

CONDA_NAME=genomics
conda activate $CONDA_NAME
mamba install mcl # need to install mcl and rapclust graalpy-graalvm
# install updated version of rapclust
pip install git+https://github.com/IdoBar/RapClust@patch-2
# mamba install -n $CONDA_NAME kallisto salmon rapclust
WORKDIR="/home/ibar/adna/sandbox/OTE14085"
cd $WORKDIR/Gene_expression/Salmon_quant_20_05_2025

# unzip the eq_classes files
# find . -name eq_classes.txt.gz | parallel  "gzip -cd {} > {.}"

JOBNAME="Mnova_rapclust"
NCORES=12
MEM=64
WALLTIME=10:00:00
# prepare config file for RapClust
echo "conditions:
  - Mnova_EM
  - Mnova_LM
samples:" > $JOBNAME.yaml
find . -maxdepth 1 -type d -name  "*N*_salmon_eq" | sort | grep -v "16S23" | gawk -F "/" 'BEGIN{printf "  Mnova_EM:\n"}{printf "    - %s\n", $2}' >> $JOBNAME.yaml

find . -maxdepth 1 -type d -name  "*S*_salmon_eq" | sort | grep -v "16S23" | gawk -F "/" 'BEGIN{printf "  Mnova_LM:\n"}{printf "    - %s\n", $2}' >> $JOBNAME.yaml
echo "outdir: $JOBNAME" >> $JOBNAME.yaml

# prepare SLURM script 
echo "find . -name eq_classes.txt.gz | parallel  \"gzip -cd {} > {.}\"; RapClust --config $JOBNAME.yaml" > $JOBNAME.cmds
# submit the job array to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

All the generated data was copied to a cloud-based storage (Griffith Research Space and SharePoint)

WORKDIR=/home/ibar/adna/Oyster_QX_Disease
rclone copy -P -L --exclude "**work/*" --exclude "**nextflow/*" --exclude "**.nextflow*" --exclude "**/*.html" $WORKDIR Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085

Then repeat with these flags to copy html files (a known SharePoint-specific issue)

rclone copy -P -L --ignore-checksum --ignore-size --include "**/*.html" $WORKDIR Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/

General information

This document was last updated at 2025-05-21 16:55:28 using R Markdown (built with R version 4.2.1 (2022-06-23 ucrt)). The source code for this webpage can be found at https://github.com/IdoBar/Mnova_transcriptome_pipeline (or via the GitHub logo at the top right corner of this page).

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com, R Markdown: The Definitive Guide and Rmarkdown cheatsheet.


Bibliography

Anders S, Pyl PT, Huber W (2015) HTSeq—a python framework to work with high-throughput sequencing data. Bioinformatics 31:166–169. doi: 10.1093/bioinformatics/btu638
Behera S, Voshall A, Moriyama EN (2021) Plant Transcriptome Assembly: Review and Benchmarking. Bioinformatics
Brownlow A, Davison NJ, Morin PA, et al. (2024) The genome sequence of the minke whale, Balaenoptera acutorostrata (Lacépède, 1804). Wellcome Open Res 9:706. doi: 10.12688/wellcomeopenres.23367.1
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nature Methods 12:59–60. doi: 10.1038/nmeth.3176
Bushmanova E, Antipov D, Lapidus A, et al. (2016) rnaQUAST: A quality assessment tool for de novo transcriptome assemblies. Bioinformatics 32:2210–2212. doi: 10.1093/bioinformatics/btw218
Carminati M-V, Gashi VL, Li R, et al. (2024) Novel Megaptera novaeangliae (Humpback whale) haplotype chromosome-level reference genome. Sci Data 11:1113. doi: 10.1038/s41597-024-03922-9
Freedman AH, Clamp M, Sackton TB (2021) Error, noise and bias in de novo transcriptome assemblies. Mol Ecol Resour 21:18–29. doi: 10.1111/1755-0998.13156
Kim D, Paggi JM, Park C, et al. (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37:907–915. doi: 10.1038/s41587-019-0201-4
Liao Y, Smyth GK, Shi W (2014) featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30:923–930. doi: 10.1093/bioinformatics/btt656
Love MI, Soneson C, Patro R (2018) Swimming downstream: Statistical analysis of differential transcript usage following Salmon quantification. F1000Research 7:952. doi: 10.12688/f1000research.15398.1
MacManes MD (2018) The Oyster River Protocol: A multi-assembler and kmer approach for de novo transcriptome assembly. PeerJ 6:e5428. doi: 10.7717/peerj.5428
Patro R, Duggal G, Love MI, et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417–419. doi: 10.1038/nmeth.4197
Pertea M, Kim D, Pertea GM, et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protocols 11:1650–1667. doi: 10.1038/nprot.2016.095
Rivera-Vicéns RE, Garcia-Escudero CA, Conci N, et al. (2022) TransPi—a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly. Molecular Ecology Resources 22:2070–2086. doi: 10.1111/1755-0998.13593
Smith-Unna R, Boursnell C, Patro R, et al. (2016) TransRate: Reference-free quality assessment of de novo transcriptome assemblies. Genome Res 26:1134–1144. doi: 10.1101/gr.196469.115
Song L, Florea L (2015) Rcorrector: Efficient and accurate error correction for Illumina RNA-seq reads. GigaScience 4:48. doi: 10.1186/s13742-015-0089-y
Trapnell C, Hendrickson DG, Sauvageau M, et al. (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46–53. doi: 10.1038/nbt.2450