Experimental Design

Aims

  • Identify genes that play a role in metabolism and body condition during migration of Humpback whales (Megaptera novaeangliae) through differential gene expression between North and South migrating whales

Objectives

  1. Identify genes that play a role in metabolism and body condition during migration of Humpback whales (Megaptera novaeangliae) through differential gene expression between North and South migrating whales
  2. Assemble the Megaptera novaeangliae adipose tissue transcriptome

Analysis Pipeline

General overview:

  1. Data pre-processing:
    1. Quality check
    2. Adaptor trimming
    3. Post-trim quality check
    4. Error correction (?)
    5. Combine reads from sequencing lanes
  2. De Novo transcriptome assembly of the reads (?)
  3. Homology-based functional annotation of transcripts
  4. Generate counts table per gene (based on reference genome or assembled transcriptome)
  5. Differential expression analysis
  6. Summary statistics and visualisation

Methods

RNA Extraction and Sequencing

RNA was extracted from adipose tissue of 22 North-migrating male humpback whales (Megaptera novaeangliae) and 12 South-migrating males. RNA was extracted using the Qiazol method and kit. The RNA was sent for Illumina paired-end sequencing, generating 100bp reads (Ramaciotti Genome Centre, Sydney).

The reads were downloaded to the QCIF High Performance Computing (HPC) cluster (Bunya) for bioinformatics processing and analyses, following the guidelines by Harvard Informatics (see link) (Freedman et al. 2021).

These included error-correction with Rcorrect v1.5.0 (Song and Florea 2015) of the reads from each set and removal of “unfixable” reads. (?)

Reference Transcriptome Annotation

The most recent reference genome of M. novaeangliae (Carminati et al. 2024) was downloaded along with the predicted gene models and annotations from a Dryad repository (doi:10.5061/dryad.dv41ns271, for some reason it is not included in the NCBI submission GCA_041834305.1).

CONDA_NAME="ref-trans" # genomics
mamba create -n $CONDA_NAME hisat2 stringtie htseq psiclass subread gffcompare biobambam fastp rseqc gff2bed samtools gffread

WORKDIR="/home/ibar/adna/sandbox/OTE14085"
GENOME="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.RepeatMasked.fasta"
GFF="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.annotation.gff"

cd $WORKDIR

CORES=8
MEM=16
WALLTIME=2:00:00
JOBNAME="prep-genome"
echo "gzip -cd $GFF.gz > $GFF
gzip -cd $GENOME.gz > $GENOME
samtools faidx $GENOME
gff2bed < $GFF > ${GFF%.*}.bed
gffread -E $GFF -T -o ${GFF%.*}.gtf
extract_splice_sites.py ${GFF%.*}.gtf > ${GFF%.*}.ss 
extract_exons.py ${GFF%.*}.gtf > ${GFF%.*}.exon" > $JOBNAME.cmds 
# submit the job to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | gawk '{print $4}')

Assemble reference-guided transcriptome

The reads were aligned to the reference genome with HiSAT2 v2.2.1 (Kim et al. 2019), followed by the reconstruction of transcripts with StringTie v3.0.0. Transcript and gene counts were generated from the alignments using Ballgown (implemented in StringTie v3.0.0), as detailed in Pertea et al. (2016).

CONDA_NAME="ref-trans" # genomics
mamba create -n $CONDA_NAME hisat2 stringtie htseq psiclass subread gffcompare biobambam fastp rseqc gffread samtools gff2bed
WORKDIR="/home/ibar/adna/sandbox/OTE14085"
FQ_DIR="$WORKDIR/combined_reads"
GENOME="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.RepeatMasked.fasta" # HumpbackWhale_Final_Genome_forNCBI.fasta # 
# index the genome
GFF="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.annotation.gff"


mkdir -p $FQ_DIR/trimmed_reads/QC && cd $FQ_DIR

# process the reads
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="Mnova-fastp"
# find $WORKDIR/combined_reads -maxdepth 1 -name "*_R1.fq.gz" | parallel --dry-run
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.fq.gz:\1:" --rpl "{file2} s:_R1:_R2:" "fastp -i {} -I {file2} --detect_adapter_for_pe -c -l 30 -p -w \$SLURM_CPUS_PER_TASK -z 7 -o $FQ_DIR/trimmed_reads/{sample}_R1.trimmed.fastq.gz -O $FQ_DIR/trimmed_reads/{sample}_R2.trimmed.fastq.gz -j $FQ_DIR/trimmed_reads/QC/{sample}.fastp.json" ::: $(ls -1 $FQ_DIR/*R1.fq.gz) > $JOBNAME.cmds 

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# align the reads to the genome
ASS="Mnova_ref_based_assembly"
mkdir -p $WORKDIR/$ASS/aligned_reads $WORKDIR/$ASS/assembly && cd $WORKDIR/$ASS
CONDA_NAME="ref-trans" # genomics
NCORES=10
MEM=64
WALLTIME=2:00:00
JOBNAME="hisat-build"
# prepare the genome index 
echo "hisat2-build -p \$[SLURM_CPUS_PER_TASK] --ss ${GFF%.*}.ss --exon ${GFF%.*}.exon $GENOME GIU3625-ht2-index" > $JOBNAME.cmds 
# submit the job to the cluster
sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm

# align the reads to the genome
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="hisat-align"
# find $WORKDIR/combined_reads -maxdepth 1 -name "*_R1.fq.gz" | parallel --dry-run
parallel --dry-run --rpl "{sample} s:.+/(.+)_R1.trimmed.fastq.gz:\1:" --rpl "{file2} s:_R1:_R2:" "hisat2 --dta -p \$[SLURM_CPUS_PER_TASK] -x $WORKDIR/Mnova_genome/GIU3625-index -1 {} -2 {file2} | bamsormadup tmpfile=\$TMPDIR/bamsormadup_\$(hostname)_\$SLURM_ARRAY_JOB_ID inputformat=sam threads=\$[SLURM_CPUS_PER_TASK - 2] indexfilename=aligned_reads/{sample}.dedup.csorted.bam.bai > aligned_reads/{sample}.dedup.csorted.bam" ::: $(ls -1 $FQ_DIR/trimmed_reads/*_R1.trimmed.fastq.gz | grep -v "26S23") > $JOBNAME.cmds 

# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# find which jobs failed
FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -s -d ',')

# rerun failed tasks
sbatch -a $FAILED_TASKS --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm 
# assemble transcripts with StringTie
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="stringtie"
parallel --dry-run --rpl "{sample} s:.+/(.+).dedup.csorted.bam:\1:" "stringtie {} --rf -l {sample} -p \$[SLURM_CPUS_PER_TASK] -G $GFF -o assembly/{sample}.gtf" ::: $(ls -1 aligned_reads/*.dedup.csorted.bam | grep -v "26S23") > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# find which jobs failed
FAILED_TASKS=$(sacct -n -X -j $ARRAY_ID -o state%20,jobid%20 | grep -v COMPLETED | gawk '{print $2}' | cut -d"_" -f2 | paste -s -d ',')

# merge all transcripts from the different samples
find assembly -name "*.gtf" > mergelist.txt
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="stringtie-merge"

echo "stringtie --merge -p \$[SLURM_CPUS_PER_TASK] -G $GFF -o {$ASS}_stringtie_merged.gtf mergelist.txt; gffcompare -r $GFF -G -o merged {$ASS}_stringtie_merged.gtf" > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | gawk '{print $4}')
    
# Estimate transcfript abundance with Ballgown
JOBNAME="ballgown"

parallel --dry-run --rpl "{sample} s:.+/(.+).dedup.csorted.bam:\1:" "mkdir -p ballgown/{sample}; stringtie -e -B  -p \$[SLURM_CPUS_PER_TASK] -G {$ASS}_stringtie_merged.gtf -o ballgown/{sample}/{sample}.gtf {}" ::: $(ls -1 aligned_reads/*.dedup.csorted.bam | grep -v "26S23") > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

The reads were mapped back to the assembled transcriptome using bwa-mem2 (this was done on Galaxy AU) and based on the alignment rates (~97%), we decided that this reference-based assembly provides a good representation of the transcripts in the samples and we proceeded with it for downstream gene annotation and quantification analyses.

MutilQC

Quality metrics were collected from the raw read QC and alignment steps and were consolidated into a single, interactive report for each batch using MultiQC v1.21 (Ewels et al. 2016).

CONDA_NAME="genomics" # 
WORKDIR="/scratch/project/adna/sandbox/OTE14085"
FQ_DIR="$WORKDIR/combined_reads"
GFF="$WORKDIR/Mnova_genome/GIU3625_Humpback_whale.annotation.gff"
ASS="Mnova_ref_based_assembly"
cd $WORKDIR/$ASS

# Alignmet QC
NCORES=12
MEM=64
WALLTIME=2:00:00
JOBNAME="align-qc"

parallel --dry-run --rpl "{sample} s:.+/(.+).dedup.csorted.bam:\1:" "unset DISPLAY ; qualimap bamqc -bam {} --java-mem-size=32G -c -gff ${GFF} -outdir aligned_reads/{sample}_bamqc; mosdepth -t \$SLURM_CPUS_PER_TASK -x -n aligned_reads/{sample}_bamqc/{sample} {}" ::: $(ls -1 aligned_reads/*.dedup.csorted.bam) > $JOBNAME.cmds 
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# set job resources
NCORES=8
MEM=32
WALLTIME="10:00:00"
JOBNAME="Multiqc_Mnova_RNAseq"

# link fastp results
ln -s $FQ_DIR/trimmed_reads/QC ./

# submit it as a Slurm job
echo "multiqc --interactive --force -i $JOBNAME -o $JOBNAME ." > $JOBNAME.cmds
# submit the job 
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | cut -f 4 -d " " )
# Done!

# Copy files to SharePoint
rclone copy -P --exclude "**/*.html" $WORKDIR/$ASS Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/$ASS
# Copy html files to SharePoint
rclone copy -P --ignore-checksum --ignore-size --include "**/*.html" $WORKDIR/$ASS Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/$ASS

The assembled genome-guided transcriptome was extracted from the reference genome using gffread v0.12.7 (Pertea and Pertea 2020)

CONDA_NAME="ref-trans" # genomics
ASS="Mnova_ref_based_assembly"
WORKDIR="/scratch/project/adna/sandbox/OTE14085/Mnova_ref_based_assembly"
GENOME="/scratch/project/adna/sandbox/OTE14085/Mnova_genome/GIU3625_Humpback_whale.RepeatMasked.fasta" # HumpbackWhale_Final_Genome_forNCBI.fasta # 

cd $WORKDIR

NCORES=4
MEM=16
WALLTIME=2:00:00
JOBNAME="stringtie-gtf2fasta"

echo "gzip -cd $GENOME.gz > $GENOME
samtools faidx $GENOME
gffread -w {$ASS}_stringtie_merged.fa -g $GENOME {$ASS}_stringtie_merged.gtf" > $JOBNAME.cmds 
# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | gawk '{print $4}')

Predict proteins with ORFanage

The gene annotation of the assembled transcriptome was ‘sanitized’ using gffread v0.12.7 (Pertea and Pertea 2020), followed by ORFanage v1.2.0 to predict open reading frames of proteins (see detailed instructions in the Documentation and the publication by Varabyou et al. (2023)).
Consider using AGAT to sanitize and convert the gene annotation files gtf to gff

# setup environment
CONDA_NAME="ref-trans"
# mamba install -n $CONDA_NAME orfanage 
ASS="Mnova_ref_based_assembly"
WORKDIR="/scratch/project/adna/sandbox/OTE14085/$ASS"
GENOME="/scratch/project/adna/sandbox/OTE14085/Mnova_genome/GIU3625_Humpback_whale.RepeatMasked.fasta"
GFF="/scratch/project/adna/sandbox/OTE14085/Mnova_genome/GIU3625_Humpback_whale.annotation.gff"
TRANS="$WORKDIR/${ASS}_stringtie_merged.fa"

JOBNAME="orfanage"
NCORES=8
MEM=32
WALLTIME=10:00:00
echo "gffread -g $GENOME --adj-stop -T -F -J -o ${GFF%.*}.corrected.gtf $GFF
orfanage --stats ${ASS}_stringtie_merged.orfanage.stats --query ${ASS}_stringtie_merged.gtf --output ${ASS}_stringtie_merged.orfanage.gtf --reference $GENOME --rescue --cleant --minlen 20 --mode BEST --non_aug --threads \$SLURM_CPUS_PER_TASK ${GFF%.*}.corrected.gtf
gffread -g $GENOME -y ${ASS}_stringtie_merged.orfanage.prots.faa ${ASS}_stringtie_merged.orfanage.gtf
gffread -g $GENOME -w ${ASS}_stringtie_merged.orfanage.mrna.fna ${ASS}_stringtie_merged.orfanage.gtf" > $JOBNAME.cmds

# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | gawk '{print $4}')

Annotate transcriptome and proteome

Annotate transcriptome with BLAST

The assembled genome-guided transcriptome was annotated using BLASTn v2.16.0 (Camacho et al. 2009) against the non-redundant nucleotide database of the NCBI (nt) to achieve more accurate species-specific annotations. Considering that we’re dealing with a plant transcriptome (hopefully similar to well-annotated plant species), it may also be useful to annotate the transcripts against the refseq_rna database (which only contains curated gene transcripts).
BLASTn was run with nf-blast v0.4.2, a Nextflow pipeline that uses a “split-combine” approach to split the input query (entire transcriptome or proteome) to smaller “chunks” that are run in parallel on the HPC cluster.

ASS="Mnova_ref_based_assembly"
WORKDIR="/scratch/project/adna/sandbox/OTE14085/$ASS"
TRANS="$WORKDIR/${ASS}_stringtie_merged.orfanage.mrna.fna"

mkdir -p $WORKDIR/Annotation && cd $WORKDIR/Annotation

DBS="nt refseq_rna"
JOBNAME="nf-blastn-tax"
CHUNKSIZE=500
CONDA_NAME="base"
NCORES=4
MEM=16
WALLTIME=50:00:00
# run Nextflow Blast pipeline
parallel --dry-run "mkdir -p $WORKDIR/Annotation/$JOBNAME-{} && cd $WORKDIR/Annotation/$JOBNAME-{}; ~/bin/nextflow-22.11.1-edge-all run /scratch/project/adna/tools/nf-blast/nf-blast.nf --app blastn --dbDir /scratch/project/adna/tools/ncbi_db --dbName {} --query $TRANS --outfmt \"6 std stitle staxids ssciname scomname\" --options '-evalue 1e-10 -max_target_seqs 20' --chunkSize $CHUNKSIZE --outdir  $WORKDIR/Annotation/$JOBNAME-{}/results --outfileName ${ASS}_stringtie_merged.orfanage.mrna.{}.tax.outfmt6  -c ~/.nextflow/bunya.config -profile bunya,apptainer -with-tower" ::: $DBS > $JOBNAME.cmds
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

Annotate proteome with DIAMOND

Despite using nf-blast, BLASTp was painfully slow to annotate the predicted proteome (taking ~80 hours for a chunk of 500 proteins) and therefore an alternative approach was taken using DIAMOND v2.1.10 (Buchfink et al. 2015), which offers similar accuracy for protein homology searches and produces an identical output to BLASTp, at a much greater speed (up to x1,000 faster). We adapted the nf-blast Nextflow pipeline to use DIAMOND (reducing the time to 1.5h for the entire proteome) to search against the NCBI non-redundant protein database (nr) to achieve more accurate species-specific annotations. Again, it may also be useful to annotate the proteins against the curated refseq_protein database.

start_long_interactive_job # to be able to use apptainer to download images
ASS="Mnova_ref_based_assembly"
WORKDIR="/scratch/project/adna/sandbox/OTE14085/$ASS"
PROT="$WORKDIR/${ASS}_stringtie_merged.orfanage.prots.faa"

# set environment variables
PROT_DBS="nr refseq_protein"
JOBNAME="nf-dmnd-tax"
CHUNKSIZE=5000
CONDA_NAME="base"
NCORES=4
MEM=16
WALLTIME=50:00:00
# run Nextflow Blast pipeline
parallel --dry-run "mkdir -p $WORKDIR/Annotation/$JOBNAME-{} && cd $WORKDIR/Annotation/$JOBNAME-{}; ~/bin/nextflow-22.11.1-edge-all run /scratch/project/adna/tools/nf-blast/nf-blast.nf -profile bunya,apptainer,diamond_tax --query $PROT --app 'diamond blastp' --db ~/adna/tools/ncbi_db/{} --diamondOpts '--very-sensitive -e 1e-10 -k 20' --chunkSize $CHUNKSIZE --outDir $WORKDIR/Annotation/$JOBNAME-{}/results --out ${ASS}_stringtie_merged.orfanage.prots.dmnd.{}.tax.outfmt6  -c ~/.nextflow/bunya.config -with-tower" ::: $PROT_DBS > $JOBNAME.cmds
# submit to the cluster
ARRAY_ID=$(sbatch -a 1-$(cat $JOBNAME.cmds | wc -l) --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/array.slurm | gawk '{print $4}')

# mkdir -p $WORKDIR/Annotation/Annotation_results/$JOBNAME
# ~/bin/nextflow-22.11.1-edge-all run ~/adna/tools/nf-blast/nf-blast.nf -profile bunya,apptainer,diamond_tax --query $PROT --app 'diamond blastp' --db ~/adna/tools/ncbi_db/$DB --diamondOpts '--very-sensitive -e 1e-10 -k 20' --chunkSize 10000 --outDir  $WORKDIR/Annotation/$JOBNAME/results --out ${ASS}_stringtie_merged.orfanage.prots.dmnd.$DB.tax.outfmt6  -c ~/.nextflow/bunya.config -with-tower -w $TMPDIR/$JOBNAME/work

Functional annotation of proteins

The predicted proteins in the transcriptome were further annotated using InterProScan v5.66-98.0 to assign protein families, motifs and ontologies to assist with transcript-to-gene annotation. Notice the issues mentioned above with SignalP and TMHMM and a new one for Phobius (see discussions and solutions on GitHub and BioStar)

IPSCAN_VERSION=5.66-98.0
NCORES=12
MEM=96
WALLTIME=50:00:00
# annotate proteins 
ASS="Mnova_ref_based_assembly"
WORKDIR="/scratch/project/adna/sandbox/OTE14085/$ASS"
PROT="$WORKDIR/${ASS}_stringtie_merged.orfanage.prots.faa"

cd $WORKDIR/Annotation
JOBNAME="Mnova_prot_ipscan"

# prepare the commands (don't forget to remove the asterisk at the end of the proteins!)
ls -1 $PROT | parallel --dry-run "mkdir -p $WORKDIR/Annotation/$JOBNAME; sed 's/[*]//g' {} > \$TMPDIR/{/} ; apptainer exec -B /home/ibar/scratch/tools -B $WORKDIR/Annotation/$JOBNAME:/output -B \$TMPDIR:/temp $NXF_SINGULARITY_CACHEDIR/interproscan_${IPSCAN_VERSION}.sif /opt/interproscan/interproscan.sh -i /temp/{/} -d /output -pa -dp -goterms -f TSV -T /temp -cpu \$SLURM_CPUS_PER_TASK && gawk '\$4~/PANTHER/' $WORKDIR/Annotation/$JOBNAME/{/}.tsv > $WORKDIR/Annotation/$JOBNAME/{/.}.panther.tsv" > $JOBNAME.cmds

# submit to the cluster
JOB_ID=$(sbatch --job-name=$JOBNAME --cpus-per-task=$NCORES --mem=${MEM}G --time=$WALLTIME --export=ALL,CMDS_FILE=$JOBNAME.cmds,CONDA_NAME=$CONDA_NAME ~/bin/serial_jobs_run.slurm | gawk '{print $4}')

The transcriptome assembly and annotation tables were uploaded to Griffith SharePoint.

ASS="Mnova_ref_based_assembly"
WORKDIR="/scratch/project/adna/sandbox/OTE14085/$ASS"
# Copy html files to SharePoint
rclone copy -P --ignore-checksum --ignore-size --include "**/*.html" $WORKDIR "Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/$ASS"

# Copy files to SharePoint
rclone copy -P --exclude "**/*.html" $WORKDIR/Annotation "Erika_PhD:General/Erika_Whale_fasting_genomics/OTE14085/$ASS"

General information

This document was last updated at 2025-10-05 01:07:07.424896 using R Markdown (built with R version 4.5.1 (2025-06-13 ucrt)). The source code for this webpage can be found at https://github.com/IdoBar/Mnova_transcriptome_pipeline (or via the GitHub logo at the top right corner of this page).

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com, R Markdown: The Definitive Guide and Rmarkdown cheatsheet.


Bibliography

Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nature Methods 12:59–60. doi: 10.1038/nmeth.3176
Camacho C, Coulouris G, Avagyan V, et al. (2009) BLAST+: Architecture and applications. BMC Bioinformatics 10:421. doi: 10.1186/1471-2105-10-421
Carminati M-V, Gashi VL, Li R, et al. (2024) Novel Megaptera novaeangliae (Humpback whale) haplotype chromosome-level reference genome. Sci Data 11:1113. doi: 10.1038/s41597-024-03922-9
Ewels P, Magnusson M, Lundin S, Käller M (2016) MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048. doi: 10.1093/bioinformatics/btw354
Freedman AH, Clamp M, Sackton TB (2021) Error, noise and bias in de novo transcriptome assemblies. Mol Ecol Resour 21:18–29. doi: 10.1111/1755-0998.13156
Kim D, Paggi JM, Park C, et al. (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37:907–915. doi: 10.1038/s41587-019-0201-4
Pertea G, Pertea M (2020) GFF Utilities: GffRead and GffCompare. F1000Res 9:ISCB Comm J–304. doi: 10.12688/f1000research.23297.2
Pertea M, Kim D, Pertea GM, et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protocols 11:1650–1667. doi: 10.1038/nprot.2016.095
Song L, Florea L (2015) Rcorrector: Efficient and accurate error correction for Illumina RNA-seq reads. GigaScience 4:48. doi: 10.1186/s13742-015-0089-y
Varabyou A, Erdogdu B, Salzberg SL, Pertea M (2023) Investigating open reading frames in known and novel transcripts using ORFanage. Nat Comput Sci 3:700–708. doi: 10.1038/s43588-023-00496-1