August 2025

Introduction to Genome Assembly with Oxford Nanopore Sequencing

Welcome to the Workshop

Genome assembly represents one of the most fundamental and transformative processes in modern genomics, serving as the foundation for virtually all downstream genomic analyses. In this workshop, you will learn the essential principles and practical techniques for assembling genomes using Oxford Nanopore Technology (ONT) sequencing data, gaining hands-on experience with one of the most revolutionary sequencing platforms in contemporary biology.

The Critical Importance of Draft Genome Assemblies

Foundation of Genomic Understanding

Draft genome assemblies serve as the cornerstone of modern biological research, providing the essential framework for understanding life at its most fundamental level. A genome assembly is essentially a computational reconstruction of an organism's complete DNA sequence, assembled from millions of overlapping sequence fragments. While "draft" assemblies may not achieve the perfection of finished reference genomes, they provide invaluable insights that drive scientific discovery across multiple disciplines.

The importance of draft genomes extends far beyond academic curiosity. In medicine, genome assemblies enable the identification of disease-causing genetic variants, facilitate personalized treatment approaches, and accelerate drug discovery. Agricultural applications include crop improvement through the identification of beneficial traits and resistance genes. In evolutionary biology, comparative genomics using draft assemblies reveals the relationships between species and the mechanisms driving adaptation and speciation.

Democratizing Genomic Research

Draft genome assemblies have democratized genomic research by making genome sequencing accessible to laboratories worldwide. Unlike finished genomes, which require substantial time and computational resources, draft assemblies can be generated relatively quickly and cost-effectively while still providing sufficient quality for most research applications. This accessibility has led to an explosion in the number of sequenced genomes, with thousands of new species being sequenced annually.

The impact of this democratization cannot be overstated. Small research groups can now tackle questions that were previously the exclusive domain of large genome centers. This has led to breakthrough discoveries in non-model organisms, rare diseases, and microbial communities that would otherwise remain unstudied.

Enabling Personalized Medicine and Precision Agriculture

In clinical applications, draft genome assemblies of pathogens enable rapid identification and characterization during disease outbreaks. The COVID-19 pandemic demonstrated the critical importance of quick genome assembly capabilities, as researchers worldwide used draft assemblies to track viral evolution, identify variants of concern, and develop targeted interventions in near real-time.

Similarly, in agriculture, draft assemblies of crop plants and their associated microbiomes provide insights into plant health, stress resistance, and yield optimization. These assemblies enable breeders to make informed decisions about crop improvement and help farmers optimize growing conditions based on the genetic characteristics of their crops.

Oxford Nanopore Technology: Revolutionizing Genome Assembly

The Long-Read Advantage

Oxford Nanopore Technology has fundamentally transformed the landscape of genome assembly through its unique approach to DNA sequencing. Unlike traditional short-read sequencing technologies that produce fragments of 150-300 base pairs, ONT generates extraordinarily long reads that can span tens of thousands of base pairs, with some reads exceeding 100,000 base pairs in length.

This long-read capability addresses one of the most persistent challenges in genome assembly: repetitive sequences. Traditional short-read assemblies often struggle with repetitive elements, structural variants, and complex genomic regions, resulting in fragmented assemblies with gaps and misassemblies. ONT's long reads can span these problematic regions entirely, providing the contiguity necessary for high-quality assemblies.

Real-Time Sequencing and Portability

One of ONT's most revolutionary features is its real-time sequencing capability. Unlike other sequencing platforms that require hours or days to complete a run, ONT devices stream data continuously during sequencing. This enables researchers to monitor assembly quality in real-time and make informed decisions about when to stop sequencing, optimizing both time and cost.

The portability of ONT devices, particularly the MinION, has brought genomic sequencing capabilities to field stations, remote locations, and resource-limited settings. This portability has enabled groundbreaking research in biodiversity hotspots, rapid pathogen identification during outbreaks, and real-time monitoring of environmental samples.

Direct DNA Sequencing and Methylation Detection

ONT's nanopore sequencing technology directly measures native DNA molecules as they pass through biological nanopores, eliminating the need for amplification steps that can introduce bias. This direct approach preserves important epigenetic modifications, particularly DNA methylation, which plays crucial roles in gene regulation and development. The ability to simultaneously assemble genomes and characterize their epigenetic landscapes provides unprecedented insights into genome function.

The Perfect Synergy: ONT and Genome Assembly

Simplified Assembly Workflows

The combination of ONT's long reads and modern assembly algorithms has dramatically simplified genome assembly workflows. While short-read assemblies often require complex strategies involving multiple library types and careful parameter optimization, ONT assemblies can frequently be generated using straightforward approaches with minimal preprocessing.

Modern ONT assembly pipelines can transform raw sequencing data into high-quality draft assemblies in hours rather than days or weeks. This speed enables iterative approaches where researchers can quickly assess assembly quality and adjust sequencing strategies if needed.

Resolving Complex Genomic Features

ONT's long reads excel at resolving complex genomic features that have traditionally been challenging for genome assembly. These include structural variations, tandem repeats, centromeres, and highly variable regions such as immune system genes. The ability to accurately assemble these regions provides more complete and biologically meaningful genome representations.

Cost-Effective High-Quality Assemblies

The efficiency of ONT-based assembly workflows has made high-quality genome assembly more cost-effective than ever before. Researchers can often achieve draft assemblies suitable for most research purposes using a single ONT flow cell, dramatically reducing the cost per genome compared to traditional approaches.

Learning Objectives and Workshop Overview

Throughout this workshop, you will develop practical skills in handling ONT sequencing data, understanding quality metrics, and implementing assembly workflows. You will learn to evaluate assembly quality, identify potential issues, and optimize assembly parameters for different types of genomic data.

By the end of this workshop, you will understand how ONT's unique characteristics enable high-quality genome assemblies and how these assemblies serve as powerful tools for addressing diverse biological questions. You will be equipped with the knowledge and skills necessary to implement genome assembly projects in your own research, contributing to the expanding universe of genomic knowledge that continues to transform our understanding of life itself.

The skills you develop here will prepare you for the exciting frontier of genomics, where new discoveries await in every genome assembled and every question asked. Welcome to the world of genome assembly with Oxford Nanopore Technology – where the blueprint of life becomes accessible, interpretable, and transformative.

Workshop: 1KSA Genome Assembly

This handbook provides step-by-step instructions for performing de novo genome assembly using Oxford Nanopore Technology (ONT) sequencing data using the Flye assembler on the SAIAB High Performance Computer. The workshop is designed for individuals with minimal bioinformatics.

What You Will Learn

Pre-requisites

SAIAB account & project access

Participants must have an active SAIAB account. If you don’t have an account or haven’t been granted project access, please reach out to SAIAB support well in advance to set this up.

A basic understanding of command line operations is required for participants, including:

Navigating directories (cd, ls, pwd)

Viewing and editing files (nano, less, cat)/p>

Copying, moving, and removing files (cp, mv, rm)

Running basic shell commands (echo, grep, find)

Prerequisite training:

Participants must work through the SAIAB Introduction to Unix for Bioinformaticians.

Schedule

Day 1: Introduction & data preparation

09:00 – 09:30 | Welcome & workshop overview

Introduction to the 1KSA project

Goals and learning outcomes of the workshop

Overview of the workshop schedule

09:30 – 10:30 |Getting started & logging into the SAIAB

Computer setup

Logging into SAIAB via SSH

Navigating the command line

Linux refresher exercises

10:30 – 10:45 | Coffee Break

10:45 – 11:30 | Understanding the SAIAB setup & submitting jobs

Loading modules

Understanding the SLURM job scheduler

SLURM commands

Job script parameters

11:30 – 12:30 | Pipeline overview & download

Overview of pipeline steps

SAIAB setup

Clone the pipeline repository with the necessary scripts

Dependencies

12:30 – 13:30 | Lunch Break

13:30 – 15:00 | Running the pipeline

Data preparation

Running each step individually

Interpretation of results

15:00 – 15:15 | Break

15:15 – 16:30 | Running the pipeline

Running each step individually

Running the pipeline with a script

Interpretation of results

Day 2: Quality assessment & troubleshooting

09:00 – 10:30 | Evaluating assembly quality

Species Identification

Evaluating assembly quality and generating a report

Generate snail plots

10:30 – 10:45 | Coffee Break

10:45 – 12:30 | Troubleshooting common issues

Kmer-Analysis (no peaks detected)

Diagnosing poor assembly results (low coverage, fragmentation)

Addressing issues with read quality and contamination

Handling computational errors and resource limitations

Low BUSCO completeness: improving assembly quality

12:30 – 13:30 | Lunch Break

13:30 – 15:00 | Wrap-up & feedback

Q&A and feedback session

Getting started

# Terminal
These black coloured boxes are code blocks.

1. Accessing the SAIAB server

Logging into SAIAB via SSH

ssh username@lab417.saiab.ac.za                             #use your username and password
# You are on the login node

2. Linux & SLURM Basics

2.1 Essential Linux commands

pwd                  # print current directory
ls                   # list files
ls -lh               # list with sizes and permissions
cd /path/to/dir      # change directory
mkdir my_folder      # create a directory
cp file1 file2       # copy a file
mv file1 file2       # move or rename a file
cat file.txt         # print file contents
head -20 file.txt    # show first 20 lines
rm file.txt          # delete a file

2.2 Managing sessions with screen

Use screen to keep your work running after you disconnect from SSH.

# Start a new named session
screen -S workshop

# Detach (leaves session running in background)
Ctrl + A, then D

# List all sessions
screen -ls

# Reattach to a session
screen -r workshop

2.3 Understanding SLURM

The SAIAB cluster uses SLURM to schedule compute jobs. You submit a job to a queue; SLURM runs it when resources are free. Jobs are prioritised by wait time and fair-share usage. The only partition on SAIAB is agrp. If your job exceeds its requested time it will be killed automatically.

2.4 SLURM commands

Command Description
squeue -u username View your queued and running jobs
sbatch script.sh Submit a job script to the scheduler
scancel 1133 Cancel job with ID 1133
srun --cpus-per-task=2 -t 0-6:30 --mem 10G --pty /bin/bash Start an interactive session (2 CPUs, 10 GB RAM, 6h30m)

2.5 Job script example

Job parameters are set as #SBATCH lines at the top of a bash script.

#!/bin/bash
#SBATCH --partition=agrp          # queue to use
#SBATCH --time=14-00:00:00        # max runtime (days-hours:minutes:seconds)
#SBATCH --cpus-per-task=2         # CPUs per task
#SBATCH --mem=100G                 # RAM per node
#SBATCH --error=job.%J.err        # stderr log (%J = job ID)
#SBATCH --output=job.%J.out       # stdout log
#SBATCH --job-name=my_job         # name shown in squeue

# Your commands below:
bash master.sh /path/to/fastq/file
# Submit the script:
sbatch script.sh

2.6 Setting up your environment

# 1. Start a screen session so work survives disconnection
screen -S workshop

# 2. Request an interactive compute node via SLURM
srun --cpus-per-task=2 -t 0-6:30 --mem 10G --pty /bin/bash

# 3. Activate the software environment
source /opt/miniconda3/bin/activate biotools

# 4. Confirm your job is running
squeue -u $USER

Once the biotools environment is active, check that every tool used in this workshop is available. Each line below should print a path; if any prints nothing, tell an instructor before you start.

# 5. Verify the workshop tools are installed
for tool in NanoFilt NanoPlot kmc flye minimap2 samtools racon busco quast nextflow; do
  printf '%-10s ' "$tool"; command -v "$tool" || echo "MISSING";
done

BlobTools (used for the snail plot near the end) lives in a separate btk environment. Check it like this — then switch back to biotools to continue:

# 6. Verify BlobTools in the btk environment
source /opt/miniconda3/bin/activate btk
command -v blobtools || echo "blobtools MISSING"
source /opt/miniconda3/bin/activate biotools

3. Pipeline overview and download

3.1 Overview of pipeline

The genome assembly pipeline is designed to process raw sequencing data into a polished, high-quality draft genome. Below, we outline the pipeline’s steps, their importance, and benchmarking metrics for evaluating their success.

Figure 1: 1KSA Genome Assembly Workflow
Figure 1. 1KSA Genome Assembly Workflow showing QC, assembly (HiFiASM/Flye), mapping, polishing (Racon), and evaluation (QUAST/BUSCO) stages.

In this course we will run the 1KSA workflow step by step on the command line and in the end we will show how one can run it automatically as a pipeline using Nextflow.

Nextflow is a workflow management system designed specifically for computational pipelines, especially popular in bioinformatics and data science.

3.2 Clone the pipeline repository with the necessary scripts

# Navigate to your home folder
cd ~
# Copy the pipeline files to your directory
cp -r /opt/courses/ont_assembly_course/ .
# Navigate inside the folder of the pipeline:
cd ont_assembly_course
# List what you have downloaded
ls -1 
01_qc_and_trimming.nf
03_mapping.nf
04_polishing.nf
05_eval_final.nf
data
kmer-Analysis.sh
kmerPlot.R
master.sh
params.config
README.md
run_analysis.sh
run_flye.sh

3.3 Dependencies:

# Download the BUSCO lineage folder
busco --download bacteria_odb10
# Make sure the download was completed:
ls ./busco_downloads/lineages/bacteria_odb10  

It should return the following:

ancestral  ancestral_variants  dataset.cfg  hmms  info  lengths_cutoff  links_to_ODB10.txt  refseq_db.faa.gz  refseq_db.faa.gz.md5  scores_cutoff

BUSCO uses a set of evolutionarily-informed expectations of gene content of near-universal single-copy orthologs to assess the completeness of genome assemblies, gene sets, and transcriptomes.

Running the pipeline

Data preparation

# Change to your working directory and the data folder
cd ont_assembly_course/Genome-Assembly-Pipeline-Nextflow/data
# Unzip files 			
gunzip -d escherichia_coli_data.fastq.gz
# List files in folder
ls -1 
escherichia_coli_data.fastq
# Move up to previous directory
cd ..

K-mer Analysis

The primary goal of k-mer analysis is to evaluate the distribution of sequence fragments (k-mers) in raw sequencing data (Manekar and Sathe, 2018). This analysis provides insights into the quality of sequencing data, genome size estimation, sequencing coverage, and the complexity of the genome. It is a critical step in optimizing genome assembly processes, especially for de novo genome assemblies without reference genomes.

Genome size estimation: K-mer analysis provides an approximate estimate of the genome size based on k-mer frequencies, particularly useful for organisms without reference genomes.

Coverage assessment: Helps assess the sequencing depth and whether there is sufficient coverage for a reliable genome assembly.

Error detection: Identifies sequencing errors through over-represented or under-represented k-mers, which may indicate biases or library preparation issues.

Data complexity: Evaluates the complexity of the genome, including repetitive regions or unusual GC content, which impacts the assembly process.

K-mer distribution: A histogram representing k-mer frequencies. A well-defined peak indicates consistent coverage, while the absence of a peak may suggest insufficient coverage or sequencing issues.

Peak location: The position of the peak on the histogram reflects the estimated genome size. A high, sharp peak typically indicates accurate sequencing coverage.

K-mer depth (coverage): The height of the peak represents the sequencing depth. Adequate coverage (e.g., 50x–100x) ensures reliable assembly.

K-mer error rate: An excessive number of low-frequency k-mers can indicate sequencing errors or poor library preparation.

GC content: Confirms whether the genome’s GC content aligns with expected values, identifying potential sequencing biases.

K-mer size and cutoff values: Appropriate k-mer size and frequency cutoffs help filter out sequencing noise and improve assembly accuracy.

Two scripts are used for this step, which is included in the Genome-Assembly-Pipeline-Nextflow repository:

This first script loads the necessary modules, calculates k-mer frequencies from a FASTQ file using KMC, generates a k-mer histogram, and runs an R script for k-mer frequency analysis on raw sequencing reads to estimate genome size, coverage, and ploidy.

Output files:

Example of the kmer21_k_mers.png:

Figure 3: K-mer frequency histogram
Figure 3. K-mer frequency histogram (k=21) showing estimated haploid genome length of 36.61 Mb and coverage peak at ~97x.

Key metrics from the plot

Estimated Haploid Length: 36.61 Mb

Expected Assembly Length: 36.61 Mb

Understanding the Plot

X-axis (Coverage):

Y-axis (Frequency):

Peaks in the Histogram

Left Peak (Coverage ~1–30):

Main Peak (Coverage ~97):

Right Tail (Coverage >150–250):

Interpretation and Quality Insights

A sharp, well-defined main peak at coverage 97 suggests:

Genome Size Estimation Accuracy:

In summary, k-mer analysis acts as a quality checkpoint before genome assembly, ensuring that the sequencing data is sufficient, accurate, and free from major biases or errors. By interpreting the histogram and key metrics, researchers can optimize their assembly pipelines and improve the final genome reconstruction.

# Resume screen session (if you detached previously)
screen -r workshop
# Navigate to the Genome-Assembly-Pipeline-Nextflow working directory:
cd ont_assembly_course/Genome-Assembly-Pipeline-Nextflow                
# Run kmer analysis script
bash kmer-Analysis.sh data/escherichia_coli_data.fastq escherichia_coli
# Record genome coverage and size
less k_mers_Stats_escherichia_coli.txt

escherichia_coli: k-mer=17

Total input bases 176565874

Peaks or Plateaus detected=1

Estimated 2n at?n

2n Genome Length=0.0049 Gb at 36 X Coverage

Expected Assembly Length if fully collapsed=0.0098 Gb at 18 X Coverage

Figure 4: K-mer histogram for E. coli course dataset
Figure 4. K-mer frequency histogram (k=17) for the Escherichia coli course dataset showing 18x coverage and estimated genome length of 4.9 Mb.

You will use the above output as input parameters when running the next steps.

# You can view the kmer plots in either png or .pdf format by opening the files in the SSH browser of Mobaxterm.

Assessing Read Quality

Purpose:

To assess the quality of sequencing reads before any processing begins. High-quality raw reads are critical for successful downstream assembly and analysis.

Why It’s Done:

Important Metrics:

# Resume screen session
screen -r workshop
# Run NanoPlot on your fastq file:
NanoPlot -t 2 --fastq ./data/escherichia_coli_data.fastq --tsv_stats -o NanoPlot_CHECK_1 --no_static
# View the output file:
less ./NanoPlot_CHECK_1/NanoStats.txt

Metrics dataset

number_of_reads 21452

number_of_bases 176565874.0

median_read_length 5178.5

mean_read_length 8230.7

read_length_stdev 9134.9

n50 15335.0

mean_qual 10.8

median_qual 11.8

longest_read_(with_Q):1 81743 (9.5)

longest_read_(with_Q):2 81102 (13.9)

longest_read_(with_Q):3 79307 (9.6)

longest_read_(with_Q):4 78113 (13.0)

longest_read_(with_Q):5 77919 (12.7)

highest_Q_read_(with_length):1 26.7 (37)

highest_Q_read_(with_length):2 25.6 (47)

highest_Q_read_(with_length):3 23.2 (91)

highest_Q_read_(with_length):4 22.9 (103)

highest_Q_read_(with_length):5 22.9 (49)

Reads >Q10: 15584 (72.6%) 133.0Mb

Reads >Q15: 2276 (10.6%) 17.7Mb

Reads >Q20: 27 (0.1%) 0.0Mb

Reads >Q25: 2 (0.0%) 0.0Mb

Reads >Q30: 0 (0.0%) 0.0Mb

To remove low-quality bases and adaptors from sequencing reads, ensuring that only high-confidence data is used for assembly.

Why It’s Done:

Important Metrics:

# Resume screen session
screen -r workshop
# Run NanoFilt on your fastq file:
mkdir trimmed_fastq
NanoFilt -q 10 ./data/escherichia_coli_data.fastq > trimmed_fastq/escherichia_coli_data_trimmed.fastq

Purpose:

To confirm the quality improvement of reads after trimming.

Why It’s Done:

Important Metrics:

# Run NanoPlot on your trimmed fastq file:
NanoPlot -t 1 --fastq ./trimmed_fastq/escherichia_coli_data_trimmed.fastq --tsv_stats -o NanoPlot_CHECK_2
# View the output file:
less ./NanoPlot_CHECK_2/NanoStats.txt

Metrics dataset

number_of_reads 15584

number_of_bases 132950703.0

median_read_length 5443.0

mean_read_length 8531.2

read_length_stdev 9289.3

n50 15618.0

mean_qual 12.6

median_qual 12.8

longest_read_(with_Q):1 81102 (13.9)

longest_read_(with_Q):2 78113 (13.0)

longest_read_(with_Q):3 77919 (12.7)

longest_read_(with_Q):4 76635 (15.4)

longest_read_(with_Q):5 72781 (12.7)

highest_Q_read_(with_length):1 26.7 (37)

highest_Q_read_(with_length):2 25.6 (47)

highest_Q_read_(with_length):3 23.2 (91)

highest_Q_read_(with_length):4 22.9 (103)

highest_Q_read_(with_length):5 22.9 (49)

Reads >Q10: 15584 (100.0%) 133.0Mb

Reads >Q15: 2276 (14.6%) 17.7Mb

Reads >Q20: 27 (0.2%) 0.0Mb

Reads >Q25: 2 (0.0%) 0.0Mb

Reads >Q30: 0 (0.0%) 0.0Mb

Genome Assembly

De novo genome assembly reconstructs a genome from raw sequencing reads without using a reference genome. In order to do that, overlapping DNA sequences are used to build longer DNA sequences (contigs) and these contigs are arranged into a complete genome with the help of various statistical methods and algorithms.

Key Terms

To generate a draft genome by assembling sequencing reads into contiguous sequences (contigs).

# Resume screen session
screen -r workshop
# Ensure that the ont conda environment has been activated
source /opt/miniconda3/bin/activate biotools 
# Run Flye on the trimmed fastq file
# Value for --asm-coverage and estimated genome size –g is obtained from kmer analysis
flye --nano-raw trimmed_fastq/escherichia_coli_data_trimmed.fastq -o assembly --asm-coverage 18 -g 0.01g -t 2
# View the output file:
tail ./assembly/flye.log

Example INFO: Assembly statistics:

Total length: 5101716

Fragments: 6

Fragments N50: 4862946

Largest frg: 4862946

Scaffolds: 0

Mean coverage: 24

Mapping and Polishing

Genome assembly produces a draft — a best-guess reconstruction from overlapping reads. This draft contains errors: insertions, deletions, and substitutions introduced by the sequencer. Mapping and polishing correct these errors using the original reads as evidence.

Why map reads back to the assembly?

After assembly, the raw reads are aligned back to the draft genome. This shows exactly where reads agree with the assembly and where they disagree. Regions with low coverage or many mismatches flag potential misassemblies. The alignment file (SAM) is also the input for polishing.

We use minimap2 for mapping. It is fast, handles long noisy ONT reads, and outputs a SAM file.

# Map the reads using minimap2
minimap2 -ax map-ont -t 2 ./assembly/assembly.fasta ./trimmed_fastq/escherichia_coli_data_trimmed.fastq > escherichia_coli.sam

Why polish?

ONT reads have a raw base-calling error rate of ~5-10%. Even after assembly, the consensus sequence retains some of these errors. Polishing uses the read alignments to vote on the correct base at each position, replacing draft errors with the most supported base call.

We use Racon for polishing. It is lightweight, works directly with minimap2 SAM output, and is well-suited to ONT data.

# Polish the assembly with Racon
mkdir Racon_results
racon -m 3 -x -5 -g -4 -w 500 -t 2 ./trimmed_fastq/escherichia_coli_data_trimmed.fastq escherichia_coli.sam assembly/assembly.fasta > ./Racon_results/Racon_polished.fasta

Genome Assembly Evaluation

Assembly evaluation answers the question: is this genome good enough to use? Raw assembly and polishing steps can introduce errors, leave gaps, or produce fragmented contigs. Evaluation catches these problems before downstream analysis (annotation, comparative genomics, publication).

Two complementary tools are used:

Run both after polishing. QUAST flags fragmentation; BUSCO flags missing gene content. Together they give a full picture of assembly quality.

# Run BUSCO on the polished assembly:
busco -m genome -i ./Racon_results/Racon_polished.fasta -o Busco_outputs -l bacteria_odb10 --download_path ./busco_downloads --metaeuk_parameters METAEUK_PARAMETERS --offline
# View the output file:
less ./Busco_outputs/short_summary.specific.bacteria_odb10.Busco_outputs.txt

# BUSCO version is: 6.0.0

# The lineage dataset is: bacteria_odb10 (Creation date: 2024-01-08, number of genomes: 4085, number of BUSCOs: 124)

# Summarized benchmarking in BUSCO notation for file /home/evilliers/work/ont_assembly_course/Genome-Assembly-Pipeline-Nextflow/Racon_results/Racon_polished.fasta

# BUSCO was run in mode: prok_genome_prod

# Gene predictor used: prodigal

***** Results: *****

C:99.2%[S:99.2%,D:0.0%],F:0.8%,M:0.0%,n:124

123 Complete BUSCOs (C)

123 Complete and single-copy BUSCOs (S)

0 Complete and duplicated BUSCOs (D)

1 Fragmented BUSCOs (F)

0 Missing BUSCOs (M)

124 Total BUSCO groups searched

Assembly Statistics:

6 Number of scaffolds

6 Number of contigs

5103316 Total length

0.000% Percent gaps

4 Mbp Scaffold N50

4 Mbp Contigs N50

Dependencies and versions:

hmmsearch: 3.4

bbtools: None

prodigal: 2.6.3

To validate the final genome assembly’s quality and structural accuracy.

QUAST (Quality Assessment Tool for Genome Assemblies) evaluates assembly quality without a reference genome. It reports key metrics: number of contigs, total assembly length, N50 (the length at which 50% of the assembly is contained in contigs of that size or longer), and largest contig. A good assembly has few contigs, high N50, and total length close to the expected genome size.

# Run QUAST on the polished assembly file:
quast.py -t 2 -o Quast_output --gene-finding ./Racon_results/Racon_polished.fasta --fragmented
    
# View the output file:
less ./Quast_output/report.txt

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly escherichia_coli_Racon_polished

# contigs (>= 0 bp) 6

# contigs (>= 1000 bp) 6

# contigs (>= 5000 bp) 6

# contigs (>= 10000 bp) 4

# contigs (>= 25000 bp) 3

# contigs (>= 50000 bp) 3

Total length (>= 0 bp) 5103316

Total length (>= 1000 bp) 5103316

Total length (>= 5000 bp) 5103316

Total length (>= 10000 bp) 5084379

Total length (>= 25000 bp) 5067109

Total length (>= 50000 bp) 5067109

# contigs 6

Largest contig 4864756

Total length 5103316

GC (%) 50.83

N50 4864756

N90 4864756

auN 4641540.1

L50 1

L90 1

# N's per 100 kbp 0.00

Run the entire pipeline in one script

The 1KSA pipeline to perform de novo genome assembly with Oxford Nanopore long-read sequencing data has been implemented in Nextflow, a workflow management system designed specifically for computational pipelines, especially popular in bioinformatics and data science.

# Open and edit the params.config file 
nano params.config

Change the following path and values:

# File: params.config
species_name='1ksa_test'    # Change species_name
assembler='flye'            # Options: 'flye' or 'hifiasm'
# Shared parameters
threads=2
LINEAGE='bacteria_odb10'      #Mandatory BUSCO lineage (Other options: insecta_odb10, viridiplantae_odb10)
# Flye-specific parameters (mandatory)
genome_size='0.0098g'   # Mandatory genome size - adjust to the values obtained from the k-mer analysis
flye_coverage=18      # Mandatory coverage - adjust to the values obtained from the k-mer analysis
flye_read_type='nano-raw'      # Default read type (can change to nano-hq for Q15 reads)
# Save the script: 
^X  # Exit  
y   # Confirm Save  
Enter  

The script master.sh is calling the different nextflow scripts for the analysis:

01_qc_and_trimming.nf

03_mapping.nf

04_polishing.nf

05_eval_final.nf

# Create a SLURM job script:
nano name_of_your_script_1.sh                          #change the name of the script
# Copy and paste the following
#!/bin/bash
#SBATCH --partition=agrp
#SBATCH --time=12:00:00
#SBATCH -N 1 # Nodes
#SBATCH -n 1 # Tasks
#SBATCH --cpus-per-task=2
#SBATCH --mem 100G
#SBATCH --error=/home/username/flye_stderr.%J.txt
#SBATCH --output=/home/username/flye_stdout.%J.txt
#SBATCH --mail-user=username@gmail.com
#SBATCH --mail-type=ALL
#SBATCH --job-name=1ksa_test_flye
cd /path/to/folder/with/species/fastq/files/Genome-Assembly-Pipeline-Nextflow  #change path
bash master.sh ./data/escherichia_coli_data.fastq
## Save and Exit
^X  # Exit  
y   # Confirm Save  
Enter  
## Submit SLURM job script  
sbatch name_of_your_script_1.sh

Results from the bash script is saved to the folder ./results. These results can be summarized with the following script.

Evaluating assembly quality and generating a report

# Create a script: 
nano generate_report.sh
# Copy and paste the following 
#!/bin/bash
set -euo pipefail
# === USER INPUT ===
species_name="species_name"             # Change species name (same as the species_name in params.config)
cd /path/to/results                     # Change to actual results folder
################################################################################
# PART 1 — SAMTOOLS PROCESSING
################################################################################
echo "=== PART 1: SAMTOOLS ==="
if [[ ! -f ./sam_file/${species_name}.sam ]]; then
    echo "❌ ${species_name}.sam not found in ./sam_file/"
    exit 1
fi
# Sort SAM → BAM
if [[ ! -f ./sam_file/${species_name}_sorted.bam ]]; then
    echo "▶️ Sorting SAM..."
    samtools view -bS ./sam_file/${species_name}.sam | \
        samtools sort -o ./sam_file/${species_name}_sorted.bam
else
    echo "✅ Sorted BAM already exists"
fi
# Coverage
if [[ ! -f ./sam_file/minimap2_coverage.txt ]]; then
    echo "▶️ Calculating depth coverage..."
    samtools depth ./sam_file/${species_name}_sorted.bam |
        awk '{sum+=$3} END { print "Average = ",sum/NR}' \
        > ./sam_file/minimap2_coverage.txt
else
    echo "✅ Coverage already calculated"
fi
# Stats
if [[ ! -f ./sam_file/sam_stats.txt ]]; then
    echo "▶️ Generating SAM stats..."
    samtools stats ./sam_file/${species_name}_sorted.bam |
        grep ^SN | cut -f 2- > ./sam_file/sam_stats.txt
else
    echo "✅ SAM stats already exist"
fi
################################################################################
# PART 2 — Detect Assembler + Handle Coverage
################################################################################
echo "=== PART 2: Assembler Detection ==="
assembler=""
if [[ -d ./Flye_results ]]; then
    assembler="flye"
elif [[ -d ./Hifiasm_results ]]; then
    assembler="hifiasm"
fi
if [[ "$assembler" == "flye" ]]; then
    echo "🔎 Flye assembly detected — extracting mean coverage..."
    OUTPUT_FILE="${species_name}_flye_mean_coverage.txt"
    > "$OUTPUT_FILE"
    for logfile in $(find ./Flye_results -type f -name "flye.log"); do
        mean_cov=$(grep "Mean coverage:" "$logfile" | awk '{print $3}')
        [[ -n "$mean_cov" ]] && echo "$logfile: $mean_cov" >> "$OUTPUT_FILE"
    done
elif [[ "$assembler" == "hifiasm" ]]; then
    echo "🔎 Hifiasm assembly detected — no flye.log coverage to extract."
else
    echo "⚠️ No assembler results directory found."
fi
################################################################################
# PART 3 — Collect Key Outputs
################################################################################
echo "=== PART 3: Collect Outputs ==="
mkdir -p "${species_name}" "${species_name}_other_results_outputs"
# NanoPlot results
mv ./nanoplot_before_trim/NanoPlot_CHECK_1/NanoStats.txt "${species_name}_NanoStats_before_trim.txt" 2>/dev/null || true
mv ./nanoplot_before_trim/NanoPlot_CHECK_1/NanoPlot-report.html "${species_name}_NanoPlot_before_trim.html" 2>/dev/null || true
mv ./nanoplot_after_trim/NanoPlot_CHECK_2/NanoStats.txt "${species_name}_NanoStats_after_trim.txt" 2>/dev/null || true
mv ./nanoplot_after_trim/NanoPlot_CHECK_2/NanoPlot-report.html "${species_name}_NanoPlot_after_trim.html" 2>/dev/null || true
# Trimmed FASTQ
mv ./trimmed_fastq/${species_name}.trimmed.fastq ./"${species_name}" 2>/dev/null || true
# Assemblies
if [[ "$assembler" == "flye" ]]; then
    mv ./Flye_results/${species_name}_assembly.fasta ./"${species_name}/${species_name}_flye_assembly.fasta" 2>/dev/null || true
    mv ./Racon_results/${species_name}_Racon_polished.fasta ./"${species_name}/${species_name}_racon_polished.fasta" 2>/dev/null || true
elif [[ "$assembler" == "hifiasm" ]]; then
    mv ./Hifiasm_results/${species_name}_ctg.fasta ./"${species_name}/${species_name}_hifiasm_assembly.fasta" 2>/dev/null || true
fi
# BUSCO + QUAST
mv ./Busco_results/Busco_output/short_summary.specific.*.txt "${species_name}_busco_summary.txt" 2>/dev/null || true
mv ./quast_report/Quast_result/report.txt "${species_name}_quast_report.txt" 2>/dev/null || true
# Coverage + stats
mv ./sam_file/minimap2_coverage.txt "${species_name}_minimap2_coverage.txt"
mv ./sam_file/sam_stats.txt "${species_name}_sam_stats.txt"
################################################################################
# PART 4 — Compile Final Report
################################################################################
echo "=== PART 4: Final Report ==="
ordered_files=(
    "${species_name}_kmer_total_number_bases.txt"
    "${species_name}_kmer_cov_size.txt"
    "${species_name}_NanoStats_before_trim.txt"
    "${species_name}_NanoStats_after_trim.txt"
    "${species_name}_minimap2_coverage.txt"
    "${species_name}_sam_stats.txt"
    "${species_name}_quast_report.txt"
    "${species_name}_busco_summary.txt"
)
[[ -f "${species_name}_flye_mean_coverage.txt" ]] && ordered_files+=("${species_name}_flye_mean_coverage.txt")
report="${species_name}_report.txt"
{
    echo "${species_name^} Assembly Report for RedCap"
    echo "Generated on: $(date)"
    echo -e "\n========================================\n"
    for file in "${ordered_files[@]}"; do
        if [[ -f "$file" ]]; then
            echo "========== $file =========="
            cat "$file"
            echo -e "\n\n"
        fi
    done
} > "$report"
echo "✅ Report generated: $report"
################################################################################
# PART 5 — Organize Output Folders
################################################################################
echo "=== PART 5: Organizing Outputs ==="
mv *_NanoStats_before_trim.txt *_NanoStats_after_trim.txt *_NanoPlot_*.html "${species_name}" 2>/dev/null || true
mv "${species_name}"_busco_summary.txt "${species_name}" 2>/dev/null || true
mv "${species_name}"_quast_report.txt "${species_name}" 2>/dev/null || true
mv "${species_name}_minimap2_coverage.txt" "${species_name}_other_results_outputs" 2>/dev/null || true
mv "${species_name}_sam_stats.txt" "${species_name}_other_results_outputs" 2>/dev/null || true
mv "${species_name}_flye_mean_coverage.txt" "${species_name}_other_results_outputs" 2>/dev/null || true
mv nanoplot_before_trim nanoplot_after_trim quast_report Busco_results trimmed_fastq sam_file \
   Flye_results Racon_results Hifiasm_results "${species_name}_other_results_outputs" 2>/dev/null || true
echo "🎉 All done!"

Generate snail plots

BlobTools is a bioinformatics software package designed for visualizing and analyzing genome assemblies, particularly for quality control and contamination detection. The following commands will generate a visual summary of the assembly.

BlobTools lives in its own btk environment, so activate it before running the blobtools commands below. The samtools step still uses biotools.

# Convert sam file to bam file format and sort and index (biotools)
samtools view -@ 2 -b escherichia_coli.sam| samtools sort -@ 2 -o escherichia_coli_sorted.bam && samtools index -c escherichia_coli_sorted.bam
# Switch to the BlobTools environment
source /opt/miniconda3/bin/activate btk

# Create blobtools dataset from assembly
blobtools create --fasta Racon_results/Racon_polished.fasta escherichia_coli
# Add read coverage plot 
blobtools add --cov escherichia_coli_sorted.bam --taxdump /data/db/taxdump/ escherichia_coli
# Add busco results
blobtools add --busco Busco_outputs/run_bacteria_odb10/full_table.tsv escherichia_coli
# Run blobtools view to generate a snail plot
blobtools view --plot --view snail escherichia_coli

# Switch back to biotools for the remaining steps
source /opt/miniconda3/bin/activate biotools

Download the escherichia_coli.snail.png file to your local device or open it up using the MobaXterm file viewer.

Figure 2: Blobtools snail plot
Figure 2. Blobtools snail plot showing genome assembly statistics and quality metrics for the Escherichia coli dataset.

Assembly Statistics (Concentric Rings)

The data set used is a smaller version of a larger genome project to enable participants to generate an assembly during the course duration.

Scaffold Statistics

Total scaffolds: 6 (very low - good!)

Total length: 5.1 Mb (typical for E. coli ~4.6-5.5 Mb)

Longest scaffold: 4.86 Mb (represents ~95% of the genome)

N50 = N90 = 4.86 Mb (extremely high contiguity!)

BUSCO Results: Outstanding

Complete BUSCOs: 99.2% (123/124 genes) ✅

Duplicated: 0% (no contamination) ✅

Fragmented: 0.81% (1 gene partially found)

Missing: 0.81% (1 gene not found)

Key Insights

Highly contiguous assembly: One major contig (~4.86 Mb) contains almost the entire genome

Minimal fragmentation: Only 6 scaffolds total

No contamination: 0% duplicated BUSCOs

Proper GC content: 50.8% GC is typical for E. coli

Clean assembly: 0% Ns (no gaps)

Visual Elements

Large orange area: Main chromosome (4.86 Mb)

Small red wedge: Longest scaffold marker

Blue rings: Smaller scaffolds/contigs

Minimal fragmentation: Very few small pieces

The genome assembly pipeline is a meticulous process that transforms raw sequencing data into a polished draft genome. Each step is carefully designed to ensure the final assembly is accurate, complete, and ready for biological interpretation.

Troubleshooting common issues

1. “No Peaks or Plateaus Detected” in K-mer Analysis

If the k-mer histogram does not show a clear peak or plateau:

Possible Causes:

Solution:

2. Diagnosing poor assembly results (low coverage, fragmentation)

Common symptoms include low N50, fragmentation, unexpected assembly size, or low genome completeness.

Possible Causes:

Solution:

Figure 3: Recommended Nanopore assembly strategies based on read length N50.

3. Addressing issues with read quality and contamination

Poor-quality reads or contamination can skew k-mer distributions and assembly.

Indicators: Excess low-frequency k-mers, unexpected peaks, or unusually high GC content.

Solution:

4. Handling computational errors and resource limitations

K-mer counting and assembly are resource-intensive, especially with large genomes.

Common Errors: Job failures, memory limits exceeded, or storage issues.

Solution:

5. Low BUSCO completeness: improving assembly quality

Certain genomes, such as the Western Leopard Toad, can be highly complex due to repetitive regions, high heterozygosity, or structural variation. Despite using Oxford Nanopore (ONT) long-read sequencing, assemblies may still show low BUSCO completeness due to fragmentation or missing genes.

1. If the assembly is fragmented (many short contigs)

Cause: Even with ONT long reads, assembly fragmentation can occur if reads are too short, error rates are high, or repeats are unresolved.

Solution:

Increase read length (N50) by optimizing library prep (e.g., high-molecular-weight DNA extraction, size selection).

Use higher-coverage data (e.g., ≥30x) to improve consensus accuracy and contiguity.

Experiment with alternative assemblers (e.g., Hifiasm, Flye, NextDenovo) to handle complex regions better.

Result: Longer, more contiguous scaffolds improve BUSCO scores by recovering full-length genes.

2. If the assembly is smaller than expected genome size

Cause: Insufficient coverage or filtering out of heterozygous/duplicated regions can result in missing genomic content.

Solution:

Increase sequencing depth to ≥20x–30x to ensure full genome representation.

Evaluate assembly parameters, some assemblers may collapse repeats or filter heterozygous regions, leading to underrepresentation.

Check for excessive read filtering/trimming, retaining more reads (even slightly lower quality) can help recover missing sequences.

Result: A more complete genome reconstruction with improved gene recovery.

3. Prioritizing longer Sequencing N50 vs. higher coverage of shorter reads

Tradeoff:

Longer N50 sequences improve contiguity and can resolve complex structures.

Higher coverage with shorter N50 reads can improve consensus accuracy and fill in missing regions.

Recommended Approach:

Aim for a balance, longer reads (>50 kb where possible) with sufficient coverage (≥20x) for complex genomes. Consider sequencing with two flow cells, using fragmented and non-fragmented DNA respectively.

If coverage is limited, focus on maximizing read length rather than accumulating more short reads.

Consider additional polishing (e.g., Medaka, Racon) to refine base accuracy post-assembly.

Result: Improved BUSCO completeness through better-contiguous and high-accuracy assembly.

Appendices

Genome assembly pipeline parameters

Tool Parameter Description Parameter Flag Value/Notes
NanoFilt Quality -q 10
Minimum read length -l 1000
Flye ONT regular reads (Q>10) --nano-raw
ONT high-quality reads (Q>15) --nano-hq
Threads -t 15
Reduced coverage --asm-coverage (k-mer analysis) e.g. 36
Genome size -g (k-mer analysis) e.g. 2.4g
Racon Match -m 3
Mismatch -x -5
Gap -g -4
Window length -w 500
Threads -t 15
BUSCO Database -l eukaryota_odb10
Mode -m genome
Metaeuk gene predictor --metaeuk_parameters METAEUK_PARAMETERS --offline

Glossary

A

B

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

U

W

Kmer-Analysis Interpretation continued

Two Peaks in K-mer Analysis

When performing k-mer analysis on a diploid organism, two peaks in the k-mer histogram can indicate the presence of homozygous and heterozygous regions, which is a key feature of diploid genomes. Here's a breakdown of what each peak could represent:

Interpretation of Two Peaks:

First Peak (Lower coverage, usually at coverage ~1–30):

This is the heterozygous portion of the genome, where you have one allele from each parent, resulting in slightly different k-mers due to genetic variation.

In diploid organisms, there will be some regions that are heterozygous (i.e., different alleles from each parent). These regions might have lower k-mer counts because the sequences from each allele may be different and thus contribute to a wider variety of k-mers.

Second Peak (Higher coverage, typically at coverage ~97):

This represents the homozygous regions of the genome, where both alleles are identical. These regions will have higher k-mer counts because both copies of the sequence from the two chromosomes are identical, leading to more frequent occurrence of the same k-mer.

Why Two Peaks Occur in Diploid Organisms:

In a diploid organism, the genome consists of two sets of chromosomes, one from each parent. In some regions, both copies of the chromosome may be identical (homozygous), leading to high k-mer coverage for those regions.

In other regions, there might be differences between the two parental copies (heterozygous), leading to the lower coverage peak because k-mers representing the two variants from each allele may not match as frequently.

General Significance of Two Peaks:

Diploidy Detection: The presence of two peaks indicates that the organism is likely diploid, as the two distinct peaks reflect the different copies (alleles) of chromosomes inherited from the two parents.

Coverage Information:

The left peak (lower coverage) often reflects heterozygous regions, while the right peak (higher coverage) corresponds to homozygous regions.

The ratio of heights and positions of these peaks can give further insights into the degree of heterozygosity and the overall complexity of the genome.

Troubleshooting/Considerations:

Genome Size Estimation: In diploid organisms, you’ll typically need to estimate the haploid genome size from the k-mer distribution. The two peaks help in understanding the genome's structure and heterozygosity.

Quality Check: If the peaks are well-defined and clearly separated, it indicates high-quality data, suggesting good sequencing depth and minimal sequencing errors.

In Summary:

Two peaks in k-mer analysis of a diploid organism can signify homozygous and heterozygous regions. The higher peak represents the homozygous regions, while the lower peak represents heterozygous regions. This pattern is normal and indicates a diploid organism. A clear distinction between the peaks typically suggests good sequencing data quality and sufficient coverage for assembly.

Three Peaks in K-mer Analysis

If your k-mer analysis shows three peaks in the histogram, it can be an indicator of a polyploid organism (i.e., an organism with more than two sets of chromosomes). The three peaks may represent the distinct k-mer coverage corresponding to homozygous, heterozygous, and possibly additional levels of ploidy.

Interpretation of Three Peaks:

First Peak (Lower coverage, typically at coverage ~1–30):

This peak likely represents heterozygous regions of the genome, where there are differences between the multiple copies of chromosomes. Since polyploid organisms have more than two copies of each chromosome, these regions will have variation between the copies, leading to lower k-mer counts due to differences in sequence.

Second Peak (Intermediate coverage, typically at coverage ~50–100):

This peak corresponds to homozygous regions where the copies of chromosomes are identical. In a polyploid organism, there will be regions where all copies of the chromosome are the same, leading to higher k-mer counts because multiple identical copies are contributing to the same k-mer.

Third Peak (Higher coverage, typically at coverage >100):

This peak may correspond to additional copies of sequences in polyploid organisms. The higher coverage is indicative of repeated or additional genomic copies present in the organism's genome. This peak could be seen in highly repetitive or duplicated regions, which are common in polyploid species.

Why Three Peaks Occur in Polyploid Organisms:

In polyploid organisms (e.g., tetraploid organisms with 4 copies of each chromosome), you'll observe more complex k-mer distributions. Here's why:

Homozygous regions will appear with higher k-mer counts, as multiple identical copies of the sequence align.

Heterozygous regions will show lower k-mer counts because the variants (alleles) at those loci do not align perfectly, resulting in fewer matching k-mers.

The third peak often arises due to additional genome copies. In polyploid organisms, there are more than two copies of each chromosome, leading to a third peak at higher coverage.

General Significance of Three Peaks:

Polyploidy Detection: The presence of three peaks indicates that the organism might be polyploid (such as tetraploid or hexaploid). Polyploid organisms can have different numbers of chromosome sets (e.g., 4 sets for tetraploid, 6 sets for hexaploid), leading to more complex k-mer distributions.

Genome Complexity: The three peaks provide insights into the complexity of the organism’s genome, such as repetitive sequences and the number of chromosome copies.

Coverage Evaluation: The three peaks also help assess if you have sufficient coverage for each of the different genomic layers (e.g., heterozygous, homozygous, and additional copies).

Quality Check:

If the three peaks are well-defined and separated, it suggests that the sequencing data is of good quality, with accurate estimation of genome size and coverage.

A lack of clear separation between the peaks, or overlapping peaks, could indicate sequencing errors or insufficient data quality.

In Summary:

Three peaks in k-mer analysis often indicate polyploidy and correspond to different genomic regions (homozygous, heterozygous, and additional copies in polyploid organisms).

The first peak represents heterozygous regions, the second peak corresponds to homozygous regions, and the third peak could reflect the additional genome copies or repetitive sequences found in polyploid organisms.

Cheat Sheet

Download / Open Cheat Sheet (PDF)