August 2025

Introduction to Genome Assembly with Oxford Nanopore Sequencing

Welcome to the Workshop

Genome assembly represents one of the most fundamental and transformative processes in modern genomics, serving as the foundation for virtually all downstream genomic analyses. In this workshop, you will learn the essential principles and practical techniques for assembling genomes using Oxford Nanopore Technology (ONT) sequencing data, gaining hands-on experience with one of the most revolutionary sequencing platforms in contemporary biology.

The Critical Importance of Draft Genome Assemblies

Foundation of Genomic Understanding

Draft genome assemblies serve as the cornerstone of modern biological research, providing the essential framework for understanding life at its most fundamental level. A genome assembly is essentially a computational reconstruction of an organism's complete DNA sequence, assembled from millions of overlapping sequence fragments. While "draft" assemblies may not achieve the perfection of finished reference genomes, they provide invaluable insights that drive scientific discovery across multiple disciplines.

The importance of draft genomes extends far beyond academic curiosity. In medicine, genome assemblies enable the identification of disease-causing genetic variants, facilitate personalized treatment approaches, and accelerate drug discovery. Agricultural applications include crop improvement through the identification of beneficial traits and resistance genes. In evolutionary biology, comparative genomics using draft assemblies reveals the relationships between species and the mechanisms driving adaptation and speciation.

Democratizing Genomic Research

Draft genome assemblies have democratized genomic research by making genome sequencing accessible to laboratories worldwide. Unlike finished genomes, which require substantial time and computational resources, draft assemblies can be generated relatively quickly and cost-effectively while still providing sufficient quality for most research applications. This accessibility has led to an explosion in the number of sequenced genomes, with thousands of new species being sequenced annually.

The impact of this democratization cannot be overstated. Small research groups can now tackle questions that were previously the exclusive domain of large genome centers. This has led to breakthrough discoveries in non-model organisms, rare diseases, and microbial communities that would otherwise remain unstudied.

Enabling Personalized Medicine and Precision Agriculture

In clinical applications, draft genome assemblies of pathogens enable rapid identification and characterization during disease outbreaks. The COVID-19 pandemic demonstrated the critical importance of quick genome assembly capabilities, as researchers worldwide used draft assemblies to track viral evolution, identify variants of concern, and develop targeted interventions in near real-time.

Similarly, in agriculture, draft assemblies of crop plants and their associated microbiomes provide insights into plant health, stress resistance, and yield optimization. These assemblies enable breeders to make informed decisions about crop improvement and help farmers optimize growing conditions based on the genetic characteristics of their crops.

Oxford Nanopore Technology: Revolutionizing Genome Assembly

The Long-Read Advantage

Oxford Nanopore Technology has fundamentally transformed the landscape of genome assembly through its unique approach to DNA sequencing. Unlike traditional short-read sequencing technologies that produce fragments of 150-300 base pairs, ONT generates extraordinarily long reads that can span tens of thousands of base pairs, with some reads exceeding 100,000 base pairs in length.

This long-read capability addresses one of the most persistent challenges in genome assembly: repetitive sequences. Traditional short-read assemblies often struggle with repetitive elements, structural variants, and complex genomic regions, resulting in fragmented assemblies with gaps and misassemblies. ONT's long reads can span these problematic regions entirely, providing the contiguity necessary for high-quality assemblies.

Real-Time Sequencing and Portability

One of ONT's most revolutionary features is its real-time sequencing capability. Unlike other sequencing platforms that require hours or days to complete a run, ONT devices stream data continuously during sequencing. This enables researchers to monitor assembly quality in real-time and make informed decisions about when to stop sequencing, optimizing both time and cost.

The portability of ONT devices, particularly the MinION, has brought genomic sequencing capabilities to field stations, remote locations, and resource-limited settings. This portability has enabled groundbreaking research in biodiversity hotspots, rapid pathogen identification during outbreaks, and real-time monitoring of environmental samples.

Direct DNA Sequencing and Methylation Detection

ONT's nanopore sequencing technology directly measures native DNA molecules as they pass through biological nanopores, eliminating the need for amplification steps that can introduce bias. This direct approach preserves important epigenetic modifications, particularly DNA methylation, which plays crucial roles in gene regulation and development. The ability to simultaneously assemble genomes and characterize their epigenetic landscapes provides unprecedented insights into genome function.

The Perfect Synergy: ONT and Genome Assembly

Simplified Assembly Workflows

The combination of ONT's long reads and modern assembly algorithms has dramatically simplified genome assembly workflows. While short-read assemblies often require complex strategies involving multiple library types and careful parameter optimization, ONT assemblies can frequently be generated using straightforward approaches with minimal preprocessing.

Modern ONT assembly pipelines can transform raw sequencing data into high-quality draft assemblies in hours rather than days or weeks. This speed enables iterative approaches where researchers can quickly assess assembly quality and adjust sequencing strategies if needed.

Resolving Complex Genomic Features

ONT's long reads excel at resolving complex genomic features that have traditionally been challenging for genome assembly. These include structural variations, tandem repeats, centromeres, and highly variable regions such as immune system genes. The ability to accurately assemble these regions provides more complete and biologically meaningful genome representations.

Cost-Effective High-Quality Assemblies

The efficiency of ONT-based assembly workflows has made high-quality genome assembly more cost-effective than ever before. Researchers can often achieve draft assemblies suitable for most research purposes using a single ONT flow cell, dramatically reducing the cost per genome compared to traditional approaches.

Learning Objectives and Workshop Overview

Throughout this workshop, you will develop practical skills in handling ONT sequencing data, understanding quality metrics, and implementing assembly workflows. You will learn to evaluate assembly quality, identify potential issues, and optimize assembly parameters for different types of genomic data.

By the end of this workshop, you will understand how ONT's unique characteristics enable high-quality genome assemblies and how these assemblies serve as powerful tools for addressing diverse biological questions. You will be equipped with the knowledge and skills necessary to implement genome assembly projects in your own research, contributing to the expanding universe of genomic knowledge that continues to transform our understanding of life itself.

The skills you develop here will prepare you for the exciting frontier of genomics, where new discoveries await in every genome assembled and every question asked. Welcome to the world of genome assembly with Oxford Nanopore Technology – where the blueprint of life becomes accessible, interpretable, and transformative.

Workshop: 1KSA Genome Assembly

This handbook provides step-by-step instructions for performing de novo genome assembly using Oxford Nanopore Technology (ONT) sequencing data using the Flye assembler on the SAIAB High Performance Computer. The workshop is designed for individuals with minimal bioinformatics.

What You Will Learn

Navigating the command line: Basic Linux commands, file navigation, and working with remote servers.
Fundamentals of genome assembly: Understanding de novo assembly, its importance, and how ONT sequencing data is used in the process.
Accessing and using SAIAB: How to log in, navigate the system, and configure your environment for genome assembly.
Running the Flye assembler: Step-by-step instructions for assembling a genome, including input preparation, command execution, and parameter selection.
Evaluating assembly quality: How to assess genome completeness, contiguity, and accuracy using tools such as QUAST and BUSCO.
Troubleshooting common issues: Identifying and resolving problems related to read quality, assembly fragmentation, and computational errors.

Pre-requisites

SAIAB account & project access

Participants must have an active SAIAB account. If you don’t have an account or haven’t been granted project access, please reach out to SAIAB support well in advance to set this up.

Basic command line knowledge

A basic understanding of command line operations is required for participants, including:

Navigating directories (cd, ls, pwd)

Viewing and editing files (nano, less, cat)/p>

Copying, moving, and removing files (cp, mv, rm)

Running basic shell commands (echo, grep, find)

Prerequisite training:

Participants must work through the SAIAB Introduction to Unix for Bioinformaticians.

Course Outline

The course is built from self-contained modules that follow the genome assembly workflow from raw reads to an evaluated draft genome. The modules are designed to be flexible: a one-day course can cover the core path (Modules 1–5), while a longer course can add the troubleshooting and automation material and allow more time for hands-on practice. Instructors can expand or compress each module to fit the time available.

Module 1 — Foundations

Introduction to genome assembly and the role of draft genomes
Oxford Nanopore (ONT) long-read sequencing and why it suits de novo assembly
Goals and learning outcomes of the course

Module 2 — Getting Started on the HPC

Logging into the server via SSH and navigating the command line
Essential Linux refresher and managing sessions with screen
Loading software modules and configuring your environment
Understanding the SLURM job scheduler, common commands, and job script parameters

Module 3 — Pipeline Overview & Setup

Overview of the assembly pipeline steps
Cloning / copying the pipeline repository and scripts
Installing dependencies (e.g. BUSCO lineage data)

Module 4 — Running the Assembly Pipeline

Data preparation and read quality assessment
K-mer analysis and genome size estimation
Genome assembly, mapping, and polishing
Running each step individually, then as a single automated script
Interpreting the results at each stage

Module 5 — Assembly Quality Assessment

Species identification
Evaluating assembly quality and generating a report (QUAST, BUSCO)
Generating and interpreting snail plots

Module 6 — Troubleshooting (optional / extended)

K-mer analysis with no peaks detected
Diagnosing poor assembly results (low coverage, fragmentation)
Read quality and contamination issues
Computational errors and resource limitations
Improving low BUSCO completeness

Module 7 — Automation & Wrap-up (optional / extended)

Running the full workflow automatically with Nextflow
Q&A, feedback, and next steps

Getting started

# Terminal
These black coloured boxes are code blocks.

1. Accessing the SAIAB server

Logging into SAIAB via SSH

ssh username@lab417.saiab.ac.za                             #use your username and password
# You are on the login node

2. Linux & SLURM Basics

2.1 Essential Linux commands

pwd                  # print current directory
ls                   # list files
ls -lh               # list with sizes and permissions
cd /path/to/dir      # change directory
mkdir my_folder      # create a directory
cp file1 file2       # copy a file
mv file1 file2       # move or rename a file
cat file.txt         # print file contents
head -20 file.txt    # show first 20 lines
rm file.txt          # delete a file

2.2 Managing sessions with screen

Use screen to keep your work running after you disconnect from SSH.

# Start a new named session
screen -S workshop

# Detach (leaves session running in background)
Ctrl + A, then D

# List all sessions
screen -ls

# Reattach to a session
screen -r workshop

2.3 Understanding SLURM

The SAIAB cluster uses SLURM to schedule compute jobs. You submit a job to a queue; SLURM runs it when resources are free. Jobs are prioritised by wait time and fair-share usage. The only partition on SAIAB is agrp. If your job exceeds its requested time it will be killed automatically.

2.4 SLURM commands

Command	Description
`squeue -u username`	View your queued and running jobs
`sbatch script.sh`	Submit a job script to the scheduler
`scancel 1133`	Cancel job with ID 1133
`srun --cpus-per-task=2 -t 0-6:30 --mem 10G --pty /bin/bash`	Start an interactive session (2 CPUs, 10 GB RAM, 6h30m)

2.5 Job script example

Job parameters are set as #SBATCH lines at the top of a bash script.

#!/bin/bash
#SBATCH --partition=agrp          # queue to use
#SBATCH --time=14-00:00:00        # max runtime (days-hours:minutes:seconds)
#SBATCH --cpus-per-task=2         # CPUs per task
#SBATCH --mem=100G                 # RAM per node
#SBATCH --error=job.%J.err        # stderr log (%J = job ID)
#SBATCH --output=job.%J.out       # stdout log
#SBATCH --job-name=my_job         # name shown in squeue

# Your commands below:
bash master.sh /path/to/fastq/file

# Submit the script:
sbatch script.sh

2.6 Setting up your environment

# 1. Start a screen session so work survives disconnection
screen -S workshop

# 2. Request an interactive compute node via SLURM
srun --cpus-per-task=2 -t 0-6:30 --mem 10G --pty /bin/bash

# 3. Activate the software environment
source /opt/miniconda3/bin/activate ont_course

# 4. Confirm your job is running
squeue -u $USER

Once the ont_course environment is active, check that every tool used in this workshop is available. Each line below should print a path; if any prints nothing, tell an instructor before you start.

# 5. Verify the workshop tools are installed
for tool in chopper NanoPlot kmc genomescope2 flye minimap2 samtools busco quast nextflow; do
  printf '%-10s ' "$tool"; command -v "$tool" || echo "MISSING";
done

Medaka (the polishing step) and BlobTools (the snail plot near the end) each live in their own conda environment. Check them like this — then switch back to ont_course to continue:

# 6. Verify Medaka in the medaka environment
source /opt/miniconda3/bin/activate medaka
command -v medaka || echo "medaka MISSING"

# 7. Verify BlobTools in the btk environment
source /opt/miniconda3/bin/activate btk
command -v blobtools || echo "blobtools MISSING"

# Switch back to the course environment
source /opt/miniconda3/bin/activate ont_course

3. Pipeline overview and download

3.1 Overview of pipeline

The genome assembly pipeline is designed to process raw sequencing data into a polished, high-quality draft genome. Below, we outline the pipeline’s steps, their importance, and benchmarking metrics for evaluating their success.

Figure 1: 1KSA Genome Assembly Workflow — **Figure 1.** 1KSA Genome Assembly Workflow showing QC, assembly (HiFiASM/Flye), mapping, polishing (Medaka), and evaluation (QUAST/BUSCO) stages.

In this course we will run the 1KSA workflow step by step on the command line and in the end we will show how one can run it automatically as a pipeline using Nextflow.

Nextflow is a workflow management system designed specifically for computational pipelines, especially popular in bioinformatics and data science.

3.2 Clone the pipeline repository with the necessary scripts

# Navigate to your home folder
cd ~

# Copy the pipeline files to your directory
cp -r /opt/courses/ont_assembly/ .

# Navigate inside the folder of the pipeline:
cd ont_assembly
# List what you have downloaded
ls -1

01_qc_and_trimming.nf
02_decontamination.nf
03_mapping.nf
04_polishing.nf
05_eval_final.nf
data
generate_report.sh
kmer-Analysis.sh
master.sh
nextflow.config
params.config
run_flye.sh
run_hifiasm.sh
submit_run.sh
tar_report.sh

3.3 Dependencies:

# Download the BUSCO lineage folder
busco --download bacteria_odb10
# Make sure the download was completed:
ls ./busco_downloads/lineages/bacteria_odb10

It should return the following:

ancestral  ancestral_variants  dataset.cfg  hmms  info  lengths_cutoff  links_to_ODB10.txt  refseq_db.faa.gz  refseq_db.faa.gz.md5  scores_cutoff

BUSCO uses a set of evolutionarily-informed expectations of gene content of near-universal single-copy orthologs to assess the completeness of genome assemblies, gene sets, and transcriptomes.

Running the pipeline

Data preparation

# Change to the data folder
cd data
# Unzip files 			
gunzip -d escherichia_coli_data.fastq.gz

# List files in folder
ls -1 
escherichia_coli_data.fastq

# Move up to previous directory
cd ..

Create an output directory so all analysis results are collected in one place. The subfolders below mirror the layout the automated pipeline (master.sh) writes to, keeping your manual results tidy and separate from the scripts.

# Create the results directory tree
mkdir -p results/trimmed_fastq results/sam_file results/Busco_results results/quast_report results/kmer results/Flye_results

K-mer Analysis

The primary goal of k-mer analysis is to evaluate the distribution of sequence fragments (k-mers) in raw sequencing data (Manekar and Sathe, 2018). This analysis provides insights into the quality of sequencing data, genome size estimation, sequencing coverage, and the complexity of the genome. It is a critical step in optimizing genome assembly processes, especially for de novo genome assemblies without reference genomes.

Genome size estimation: K-mer analysis provides an approximate estimate of the genome size based on k-mer frequencies, particularly useful for organisms without reference genomes.

Coverage assessment: Helps assess the sequencing depth and whether there is sufficient coverage for a reliable genome assembly.

Error detection: Identifies sequencing errors through over-represented or under-represented k-mers, which may indicate biases or library preparation issues.

Data complexity: Evaluates the complexity of the genome, including repetitive regions or unusual GC content, which impacts the assembly process.

K-mer distribution: A histogram representing k-mer frequencies. A well-defined peak indicates consistent coverage, while the absence of a peak may suggest insufficient coverage or sequencing issues.

Peak location: The position of the peak on the histogram reflects the estimated genome size. A high, sharp peak typically indicates accurate sequencing coverage.

K-mer depth (coverage): The height of the peak represents the sequencing depth. Adequate coverage (e.g., 50x–100x) ensures reliable assembly.

K-mer error rate: An excessive number of low-frequency k-mers can indicate sequencing errors or poor library preparation.

GC content: Confirms whether the genome’s GC content aligns with expected values, identifying potential sequencing biases.

K-mer size and cutoff values: Appropriate k-mer size and frequency cutoffs help filter out sequencing noise and improve assembly accuracy.

One script handles this step; it is included in the genome assembly pipeline repository:

kmer-Analysis.sh

This script counts k-mer frequencies from a FASTQ file using KMC (k=17, appropriate for ONT reads), builds a k-mer histogram, and then runs GenomeScope2 on that histogram to estimate genome size, coverage, heterozygosity, and ploidy.

# Run kmer analysis (3rd argument sends all outputs to results/kmer)
bash kmer-Analysis.sh data/escherichia_coli_data.fastq escherichia_coli results/kmer
# List the output folder
ls -l results/kmer/

The folder should contain:

escherichia_coli_17_mers_histo.txt
escherichia_coli_17_mers.kmc_pre
escherichia_coli_17_mers.kmc_suf
escherichia_coli_genomescope
escherichia_coli_total_bases.txt

Output files:

escherichia_coli_total_bases.txt — total input bases (sanity check)
escherichia_coli_17_mers_histo.txt — KMC k-mer frequency histogram
escherichia_coli_17_mers.kmc_pre / escherichia_coli_17_mers.kmc_suf — KMC k-mer count database (intermediate files)
escherichia_coli_genomescope/ — GenomeScope2 output folder, containing:
- summary.txt — genome size, coverage, heterozygosity
- linear_plot.png — k-mer histogram with fitted model
- log_plot.png — log-scale histogram

# Record genome size and coverage from the GenomeScope2 summary
cat results/kmer/escherichia_coli_genomescope/summary.txt

For the E. coli course dataset, summary.txt looks like this:

GenomeScope version 2.0
k = 17
property                      min            max
Homozygous (aa)               99.7189%       99.7496%
Heterozygous (ab)             0.250368%      0.281093%
Genome Haploid Length         4,986,082 bp   4,993,847 bp
Genome Repeat Length          244,651 bp     245,032 bp
Genome Unique Length          4,741,431 bp   4,748,814 bp
Model Fit                     95.7734%       99.1079%
Read Error Rate               0.271157%      0.271157%
K-mer coverage (kcov)         10.0
Coverage peak (2 x kcov)      20        use for Flye --asm-coverage

Two numbers carry forward to the assembly step, and kmer-Analysis.sh writes both into summary.txt. The Genome Haploid Length (~5 Mb) is Flye's genome-size estimate — set genome_size to 0.005g (≈ 5 Mb). The Coverage peak (2 × kcov ≈ 20x — GenomeScope2 does not normally print a coverage, so the script derives it from the fitted kmercov) is the value for asm_coverage (e.g. 20). Set both in params.config, or pass them to Flye directly. The k-mer histogram is shown in results/kmer/escherichia_coli_genomescope/linear_plot.png.

Figure 2. GenomeScope2 k-mer profile (k=17) for the Escherichia coli course dataset: estimated haploid length ~4.99 Mb, coverage peak ~18–20x, model fit ~99%.

You will use the above output as input parameters when running the next steps.

# You can view the kmer plots in either png or .pdf format by opening the files in the SSH browser of Mobaxterm.

The sections below explain how to read the GenomeScope2 profile (Figure 2 above) and interpret the metrics you just generated.

Key metrics from the plot

Genome Haploid Length: ~4.99 Mb

GenomeScope2's estimate of the haploid genome size, read from summary.txt (and shown as len at the top of the plot). For the E. coli dataset this is ~5 Mb. This is the value you pass to Flye as -g.

Homozygous (aa) / Heterozygous (ab): ~99.7% / ~0.25%

The proportion of homozygous vs heterozygous k-mer sites. E. coli is haploid, so heterozygosity is near zero; a diploid organism would show a clear second (heterozygous) peak.

Model fit / Read error rate: ~99% / ~0.27%

How well GenomeScope2's model matches the observed k-mers, and the estimated per-base error rate. A high fit and low error rate indicate clean, well-modelled data.

Understanding the Plot

X-axis (Coverage):

Represents how often each unique k-mer occurs in your sequencing data (i.e., sequencing depth or k-mer coverage).

Y-axis (Frequency):

The number of distinct k-mers at each coverage level. On the log_plot.png shown here this axis is log-scaled, which makes the low-coverage error spike and the high-coverage tail easy to see; the linear_plot.png shows the same data on a linear axis (handy for judging peak height).

Coloured curves (legend):

observed — the actual k-mer counts; full model — GenomeScope2's fitted curve; unique sequence — the non-repetitive fraction; errors — low-coverage k-mers attributed to sequencing error. The dashed kmer-peaks lines mark the fitted coverage peak.

Peaks in the Histogram

Left Peak (Coverage ~1–10):

Represents sequencing errors or low-frequency k-mers (the red errors curve). These are filtered out during assembly.

Main Peak (Coverage ~18–20):

Represents correct, high-confidence k-mers.
The peak coverage (~20x here = 2 × kcov, marked by the dashed kmer-peaks line) indicates your k-mer sequencing depth. For this small E. coli dataset ~20x is sufficient for a complete assembly. This is the value (reported as Coverage peak in summary.txt) that you pass to Flye as --asm-coverage.

Right Tail (Coverage >~30):

Usually corresponds to repetitive regions (e.g., rRNA genes, repeats) that occur multiple times in the genome, leading to higher coverage.

Interpretation and Quality Insights

A sharp, well-defined main peak at coverage ~20x suggests:

Good sequencing quality.
Sufficient coverage for a reliable genome assembly.
Low noise from sequencing errors, since the left error peak is clearly separated from the main peak.

Genome Size Estimation Accuracy:

The haploid length of ~4.99 Mb is a reliable estimate for E. coli, assuming your k-mer size and filtering settings were appropriate.

In summary, k-mer analysis acts as a quality checkpoint before genome assembly, ensuring that the sequencing data is sufficient, accurate, and free from major biases or errors. By interpreting the histogram and key metrics, researchers can optimize their assembly pipelines and improve the final genome reconstruction.

Assessing Read Quality

Purpose:

To assess the quality of sequencing reads before any processing begins. High-quality raw reads are critical for successful downstream assembly and analysis.

Why It’s Done:

Identifies sequencing errors, read length distribution, and data completeness.
Helps detect any systematic issues with the sequencing run.

Important Metrics:

Read N50: The length at which 50% of the total sequence data is contained in reads of equal or greater length.
Average Read Length: Should align with the expected output of the sequencer (e.g., 10-20 kb for Oxford Nanopore).
GC Content: Should reflect the expected range for the organism sequenced.

# Run NanoPlot on your fastq file:
NanoPlot -t 2 --fastq ./data/escherichia_coli_data.fastq --tsv_stats -o results/nanoplot_before_trim/NanoPlot_CHECK_1 --no_static

# View the output file:
less ./results/nanoplot_before_trim/NanoPlot_CHECK_1/NanoStats.txt

Metrics dataset
number_of_reads 21452
number_of_bases 176565874.0
median_read_length      5178.5
mean_read_length        8230.7
read_length_stdev       9134.9
n50     15335.0
mean_qual       10.8
median_qual     11.8
longest_read_(with_Q):1 81743 (9.5)
longest_read_(with_Q):2 81102 (13.9)
longest_read_(with_Q):3 79307 (9.6)
longest_read_(with_Q):4 78113 (13.0)
longest_read_(with_Q):5 77919 (12.7)
highest_Q_read_(with_length):1  26.7 (37)
highest_Q_read_(with_length):2  25.6 (47)
highest_Q_read_(with_length):3  23.2 (91)
highest_Q_read_(with_length):4  22.9 (103)
highest_Q_read_(with_length):5  22.9 (49)
Reads >Q10:     15584 (72.6%) 133.0Mb
Reads >Q15:     2276 (10.6%) 17.7Mb
Reads >Q20:     27 (0.1%) 0.0Mb
Reads >Q25:     2 (0.0%) 0.0Mb
Reads >Q30:     0 (0.0%) 0.0Mb

Trimming Reads with Chopper

Purpose:

To remove low-quality bases and adaptors from sequencing reads, ensuring that only high-confidence data is used for assembly.

Why It’s Done:

Improves the accuracy of the assembly.
Reduces computational burden by eliminating unnecessary or misleading data.

Important Metrics:

Percentage of Reads Retained: Typically, >80% of reads should remain post-trimming.
Mean Quality Score: Post-trim scores should exceed Q10 for base accuracy.
Minlength:Sets a minimum read length.

# Run chopper on your fastq file (output goes to results/trimmed_fastq):
chopper -q 10 -l 500 -i ./data/escherichia_coli_data.fastq > results/trimmed_fastq/escherichia_coli_data_trimmed.fastq

Re-checking Read Quality After Trimming

Purpose:

To confirm the quality improvement of reads after trimming.

Why It’s Done:

Verifies that trimming has effectively removed errors while retaining biologically meaningful data.
Detects any anomalies introduced during the trimming process.

Important Metrics:

Improved Mean Quality Score: An increase from raw read quality.
Read Length Distribution: Consistency with expectations for trimmed reads.

# Run NanoPlot on your trimmed fastq file:
NanoPlot -t 1 --fastq ./results/trimmed_fastq/escherichia_coli_data_trimmed.fastq --tsv_stats -o results/nanoplot_after_trim/NanoPlot_CHECK_2

# View the output file:
less ./results/nanoplot_after_trim/NanoPlot_CHECK_2/NanoStats.txt

Metrics dataset
number_of_reads 14447
number_of_bases 132669299.0
median_read_length      5946.0
mean_read_length        9183.2
read_length_stdev       9341.1
n50     15634.0
mean_qual       12.6
median_qual     12.8
longest_read_(with_Q):1 81102 (13.9)
longest_read_(with_Q):2 78113 (13.0)
longest_read_(with_Q):3 77919 (12.7)
longest_read_(with_Q):4 76635 (15.4)
longest_read_(with_Q):5 72781 (12.7)
highest_Q_read_(with_length):1   22.2 (1758)
highest_Q_read_(with_length):2  22.1 (599)
highest_Q_read_(with_length):3   21.0 (7959)
highest_Q_read_(with_length):4   20.7 (752)
highest_Q_read_(with_length):5  20.2 (893)
Reads >Q10:     14447 (100.0%) 132.7Mb
Reads >Q15:     2052 (14.2%) 17.6Mb
Reads >Q20:      6 (0.0%) 0.0Mb
Reads >Q25:     0 (0.0%) 0.0Mb
Reads >Q30:     0 (0.0%) 0.0Mb

Genome Assembly

De novo genome assembly reconstructs a genome from raw sequencing reads without using a reference genome. In order to do that, overlapping DNA sequences are used to build longer DNA sequences (contigs) and these contigs are arranged into a complete genome with the help of various statistical methods and algorithms.

Key Terms

Reads: DNA sequences generated by the sequencer
Contigs: Continuous sequences assembled from overlapping reads
Scaffolds: Larger sequences created by linking contigs
N50: A metric that represents assembly continuity
BUSCO: A tool to assess assembly completeness

To generate a draft genome by assembling sequencing reads into contiguous sequences (contigs).

Converts raw sequence reads into a coherent representation of the genome.
Provides the structural foundation for downstream annotation and analysis.
Assembly N50: Longer contigs indicate a more contiguous assembly.
Total Assembly Size: Should approximate the expected genome size (e.g., 800 Mb).
Number of Contigs: Fewer contigs suggest a more complete assembly.

Now feed the two numbers from the k-mer step into Flye:

-g (estimated genome size) — the Genome Haploid Length from results/kmer/escherichia_coli_genomescope/summary.txt (~5 Mb here). so we pass 0.005g (≈ the 5 Mb haploid length). Flye only uses -g together with --asm-coverage to decide how many of the longest reads to subsample.
--asm-coverage — the Coverage peak line in summary.txt (2 × kcov ≈ 20x for this dataset).

For your own species, read both values off your own summary.txt or kmer plot and substitute them below in the flye commandline or set the matching genome_size / asm_coverage in params.config when you run the full pipeline.

# Run Flye on the trimmed fastq file
# -g (genome size) and --asm-coverage (coverage peak) come from the k-mer analysis
flye --nano-raw results/trimmed_fastq/escherichia_coli_data_trimmed.fastq -o results/Flye_results/assembly --asm-coverage 20 -g 0.005g -t 2

# View the output file:
tail results/Flye_results/assembly/flye.log

Example output:

INFO: Assembly statistics:
        Total length:   5075826
        Fragments:      4
        Fragments N50:  4862942
        Largest frg:    4862942
        Scaffolds:      0
        Mean coverage:  24

Mapping and Polishing

Genome assembly produces a draft — a best-guess reconstruction from overlapping reads. This draft contains errors: insertions, deletions, and substitutions introduced by the sequencer. Mapping and polishing correct these errors using the original reads as evidence.

Why map reads back to the assembly?

After assembly, the raw reads are aligned back to the draft genome. This shows exactly where reads agree with the assembly and where they disagree. Regions with low coverage or many mismatches flag potential misassemblies.

Alignment rate — aim for >95% of reads mapping. Lower rates suggest the assembly is missing sequence or reads are contaminated.
Coverage depth — should match the expected sequencing depth (e.g., 30x). Regions with very low coverage are unreliable.

We use minimap2 for mapping. It is fast, handles long noisy ONT reads, and outputs a SAM file.

# Map the reads using minimap2
minimap2 -ax map-ont -t 2 ./results/Flye_results/assembly/assembly.fasta ./results/trimmed_fastq/escherichia_coli_data_trimmed.fastq > results/sam_file/escherichia_coli.sam

Why polish?

ONT reads have a raw base-calling error rate of ~5-10%. Even after assembly, the consensus sequence retains some of these errors. Polishing uses the read alignments to vote on the correct base at each position, replacing draft errors with the most supported base call.

Polishing rounds — Medaka usually needs a single round for modern SUP/HAC-basecalled ONT data. Each round uses the previous consensus as the new reference; gains beyond one round are typically small.
Expected improvement — error rate drops significantly after polishing; BUSCO completeness and QUAST metrics should both improve.

We use Medaka, Oxford Nanopore's neural-network consensus polisher. Medaka maps the reads to the draft internally (it does not use the SAM file produced above — that mapping is kept for coverage QC and the snail plot later), then calls an improved consensus. Match the -m model to your flowcell and basecaller; the course E. coli dataset is R9.4.1, so we use r941_min_hac_g507 (list available models with medaka tools list_models).

Medaka has its own heavy dependencies, so it lives in a separate medaka conda environment. Activate it just for this step, then switch back to ont_course afterwards.

# Switch to the medaka environment
source /opt/miniconda3/bin/activate medaka

# Polish the assembly with Medaka
medaka_consensus -i ./results/trimmed_fastq/escherichia_coli_data_trimmed.fastq -d ./results/Flye_results/assembly/assembly.fasta -o results/Medaka_results -t 2 -m r941_min_hac_g507

# Switch back to ont_course for the remaining steps
source /opt/miniconda3/bin/activate ont_course

The final assembled genome sequence is found in the file results/Flye_results/assembly/assembly.fasta that is used downstream for genome annotation or comparative genomics.

Genome Assembly Evaluation

Assembly evaluation answers the question: is this genome good enough to use? Raw assembly and polishing steps can introduce errors, leave gaps, or produce fragmented contigs. Evaluation catches these problems before downstream analysis (annotation, comparative genomics, publication).

Two complementary tools are used:

QUAST — measures structural quality: number of contigs, total length, N50, and largest contig. Tells you whether the assembly is contiguous and the right size.
BUSCO — measures biological completeness by checking for the presence of conserved single-copy genes expected in the organism's lineage. A score >95% complete indicates a high-quality assembly.

Run both after polishing. QUAST flags fragmentation; BUSCO flags missing gene content. Together they give a full picture of assembly quality.

# Run BUSCO on the polished assembly:
busco -m genome -i results/Medaka_results/consensus.fasta -o Busco_output --out_path results/Busco_results -l bacteria_odb10 --download_path busco_downloads --offline

# View the output file:
less results/Busco_results/Busco_output/short_summary.specific.bacteria_odb10.Busco_output.txt

# BUSCO version is: 6.1.0
# The lineage dataset is: bacteria_odb10 (Creation date: 2024-01-08, number of genomes: 4085, number of BUSCOs: 124)
# Summarized benchmarking in BUSCO notation for file /home/edv/ont_assembly/results/Medaka_results/consensus.fasta
# BUSCO was run in mode: prok_genome_prod
# Gene predictor used: prodigal
        ***** Results: *****
        C:100.0%[S:100.0%,D:0.0%],F:0.0%,M:0.0%,n:124
        124     Complete BUSCOs (C)
        124     Complete and single-copy BUSCOs (S)
        0       Complete and duplicated BUSCOs (D)
        0       Fragmented BUSCOs (F)
        0       Missing BUSCOs (M)
        124     Total BUSCO groups searched
Assembly Statistics:
        5       Number of scaffolds
        5       Number of contigs
        5106391 Total length
        0.000%  Percent gaps
        4 Mbp   Scaffold N50
        4 Mbp   Contigs N50
Dependencies and versions:
        hmmsearch: 3.4
        bbtools: None
        prodigal: 2.6.3

Reading the BUSCO summary:

The header lines record the BUSCO version, the lineage dataset (bacteria_odb10 — a set of 124 genes expected to be present and single-copy in essentially all bacteria), the input file, and the prediction mode and tool. The result you care about is the single-line score:

C:100.0%[S:100.0%,D:0.0%],F:0.0%,M:0.0%,n:124

Each letter is a category, given as a percentage of the n genes searched:

C — Complete: the gene was found at close to its expected full length. This is the headline completeness number; >95% is considered a high-quality assembly.
S — Complete & single-copy: found exactly once, which is what you expect for these genes.
D — Complete & duplicated: found more than once. A high value can signal an over-assembled or redundant genome (e.g. uncollapsed haplotypes or contamination).
F — Fragmented: only partially recovered — the gene is split or incomplete, often a sign of low coverage or a broken assembly.
M — Missing: not found at all, indicating gaps in gene content.
n — Total searched: the number of BUSCO genes in the lineage (124 for bacteria_odb10).

For this E. coli assembly the score is 100% complete, all single-copy, with zero duplicated, fragmented or missing genes (124/124) — an excellent result that means the polished assembly is biologically complete. The Assembly Statistics block beneath the score (scaffolds, contigs, total length, N50) reports the same structural metrics that QUAST covers in more detail below.

QUAST (Quality Assessment Tool for Genome Assemblies) evaluates assembly quality without a reference genome. It reports key metrics: number of contigs, total assembly length, N50 (the length at which 50% of the assembly is contained in contigs of that size or longer), and largest contig. A good assembly has few contigs, high N50, and total length close to the expected genome size.

# Run QUAST on the polished assembly file:
quast.py -t 2 -o results/quast_report/Quast_output --gene-finding results/Medaka_results/consensus.fasta --fragmented

# View the output file:
less results/quast_report/Quast_output/report.txt

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
Assembly                    consensus
# contigs (>= 0 bp)         5
# contigs (>= 1000 bp)      5
# contigs (>= 5000 bp)      5
# contigs (>= 10000 bp)     4
# contigs (>= 25000 bp)     4
# contigs (>= 50000 bp)     3
Total length (>= 0 bp)      5106391
Total length (>= 1000 bp)   5106391
Total length (>= 5000 bp)   5106391
Total length (>= 10000 bp)  5096683
Total length (>= 25000 bp)  5096683
Total length (>= 50000 bp)  5067095
# contigs                   5
Largest contig              4863813
Total length                5106391
GC (%)                      50.81
N50                         4863813
N90                         4863813
auN                         4637077.2
L50                         1
L90                         1
# N's per 100 kbp           0.00

Reading the QUAST report:

The first block (# contigs (>= X bp) and Total length (>= X bp)) shows how the count and total sequence change as you raise the minimum contig size — a quick way to see how much of the genome sits in large versus small contigs. The headline metrics below them (based on contigs ≥ 500 bp) are what matter most:

# contigs = 5 — the whole genome is in just five pieces. Few contigs means high contiguity.
Largest contig = 4,863,813 bp and Total length = 5,106,391 bp — one contig holds almost the entire ~5.1 Mb assembly (the E. coli chromosome), and the total is close to the expected E. coli genome size.
N50 = 4,863,813 bp — half of the assembly is contained in contigs at least this long. An N50 nearly equal to the genome size is excellent.
L50 = 1 (and L90 = 1) — only a single contig is needed to cover 50% (and even 90%) of the assembly, i.e. near chromosome-level contiguity. auN (~4.64 M) is a threshold-free summary of the same contiguity.
GC (%) = 50.81 — matches the expected GC content for E. coli (~50%), a good sign there is no major contamination.
# N's per 100 kbp = 0.00 — no ambiguous bases or internal gaps in the assembly.

Taken together — few contigs, N50 close to the genome size, L50 of 1, expected total length and GC, and no Ns — this is a highly contiguous, high-quality draft assembly.

Evaluating assembly quality and generating a report

The pipeline ships a unified generate_report.sh that auto-detects the assembler and collects the NanoPlot, k-mer, assembly, BUSCO and QUAST outputs from ./results into a single summary report. Run it from the pipeline folder:

# Generate the summary report (reads species_name and assembler from params.config)
bash generate_report.sh

This produces <species_name>_report.txt plus a staging folder of the key result files. To bundle everything into one archive for transfer, run:

# Bundle the report files into a tar.gz archive
bash tar_report.sh

Generate snail plots

BlobTools is a bioinformatics software package designed for visualizing and analyzing genome assemblies, particularly for quality control and contamination detection. The following commands will generate a visual summary of the assembly.

BlobTools lives in its own btk environment, so activate it before running the blobtools commands below. The samtools step still uses ont_course.

# Convert sam file to bam file format and sort and index (ont_course)
samtools view -@ 2 -b results/sam_file/escherichia_coli.sam| samtools sort -@ 2 -o results/sam_file/escherichia_coli_sorted.bam && samtools index -c results/sam_file/escherichia_coli_sorted.bam

# Switch to the BlobTools environment
source /opt/miniconda3/bin/activate btk

# Create blobtools dataset from assembly
blobtools create --fasta results/Medaka_results/consensus.fasta escherichia_coli
# Add read coverage plot 
blobtools add --cov results/sam_file/escherichia_coli_sorted.bam --taxdump /data/db/taxdump/ escherichia_coli

# Add busco results
blobtools add --busco results/Busco_results/Busco_output/run_bacteria_odb10/full_table.tsv escherichia_coli

# Run blobtools view to generate a snail plot
blobtools view --plot --view snail escherichia_coli

# Switch back to ont_course for the remaining steps
source /opt/miniconda3/bin/activate ont_course

Download the escherichia_coli.snail.png file to your local device or open it up using the MobaXterm file viewer.

Figure 3: Blobtools snail plot — **Figure 3.** Blobtools snail plot showing genome assembly statistics and quality metrics for the *Escherichia coli* dataset.

Assembly Statistics (Concentric Rings)

The data set used is a smaller version of a larger genome project to enable participants to generate an assembly during the course duration.

Scaffold Statistics

Total scaffolds: 6 (very low - good!)

Total length: 5.1 Mb (typical for E. coli ~4.6-5.5 Mb)

Longest scaffold: 4.86 Mb (represents ~95% of the genome)

N50 = N90 = 4.86 Mb (extremely high contiguity!)

BUSCO Results: Outstanding

Complete BUSCOs: 100% (124/124 genes) ✅

Duplicated: 0% (no contamination) ✅

Fragmented: 0.0%

Missing: 0.0%

Key Insights

Highly contiguous assembly: One major contig (~4.86 Mb) contains almost the entire genome

No contamination: 0% duplicated BUSCOs

Proper GC content: 50.8% GC is typical for E. coli

Clean assembly: 0% Ns (no gaps)

Visual Elements

Large orange area: Main chromosome (4.86 Mb)

Small red wedge: Longest scaffold marker

Blue rings: Smaller scaffolds/contigs

The genome assembly pipeline is a meticulous process that transforms raw sequencing data into a polished draft genome. Each step is carefully designed to ensure the final assembly is accurate, complete, and ready for biological interpretation.

Run the entire pipeline in one script

We have ran the 1KSA pipeline to perform de novo genome assembly with Oxford Nanopore long-read sequencing data on the command line as single line scripts but it has also been implemented in Nextflow, a workflow management system designed specifically for computational pipelines, especially popular in bioinformatics and data science.

In this section we will run the Nextflow implemented 1KSA workflow.

# Open and edit the params.config file 
nano params.config

Change the following path and values:

# =============================================================================
# File: params.config
# Description: Master parameter file for the Genome Assembly Pipeline.
#              Supports Hifiasm or Flye assembly, with optional Medaka polishing.
#              All workflow options are controlled from here.
# =============================================================================

# --- Species ---
species_name='escherichia_coli'         # Change per species

# =============================================================================
# WORKFLOW SWITCHES
# =============================================================================

run_decontamination='false'             # Set to 'true' to run Kraken2 decontamination
run_polishing='true'                    # Set to 'true' to run Medaka polishing

assembler='flye'                        # Options: 'hifiasm' | 'flye'

# =============================================================================
# SHARED PARAMETERS
# =============================================================================

threads=2
lineage='bacteria_odb10'                # BUSCO lineage (lowercase for consistency)
                                        # Other options: actinopterygii_odb10,
                                        #   vertebrata_odb10, insecta_odb10,
                                        #   viridiplantae_odb10

# =============================================================================
# QC / TRIMMING PARAMETERS
# =============================================================================

min_length=500                          # Minimum read length (bp)
min_qual=10                             # Minimum mean quality score (Phred)

# =============================================================================
# DECONTAMINATION PARAMETERS (required when run_decontamination='true')
# =============================================================================

kraken_db='/home/evilliers/work/1KSA_workflow/kraken_db'   # Path to Kraken2 database
kraken_confidence=0.1                   # Kraken2 confidence threshold (0.0-1.0)

# =============================================================================
# ASSEMBLY PARAMETERS
# =============================================================================

# --- Flye-specific (required when assembler='flye') ---
# Populate genome_size and asm_coverage from your kmer analysis output
# (e.g. GenomeScope2 / Smudgeplots estimates)

genome_size='0.005g'                    # Estimated genome size (e.g. 800m, 1.2g)
asm_coverage='20'                       # Subsampling coverage for Flye (e.g. 50)
read_type='nano-raw'                    # Flye read type:
                                        #   nano-raw   : Guppy <5 / older basecalling
                                        #   nano-hq    : Guppy 5+ / Dorado sup/hac (R10)
                                        #   nano-corr  : error-corrected ONT reads
                                        #   pacbio-raw : PacBio CLR
                                        #   pacbio-hifi: PacBio HiFi / CCS

# =============================================================================
# POLISHING PARAMETERS (required when run_polishing='true')
# =============================================================================

medaka_model='r941_min_hac_g507'        # Match to your flowcell + basecalling model
                                        # (course E. coli data is R9.4.1; verify with
                                        #  'medaka tools list_models')

# =============================================================================
# BUSCO PARAMETERS
# =============================================================================

# Shared BUSCO lineage downloads directory (use real path, not symlink)
busco_downloads='busco_downloads'

# Save the script: 
^X  # Exit  
y   # Confirm Save  
Enter

The script master.sh ties the steps together, calling the Nextflow scripts and assembler wrappers in order. It skips any step whose outputs already exist:

01_qc_and_trimming.nf — NanoPlot QC + chopper trimming

02_decontamination.nf — optional Kraken2 decontamination (off by default)

run_flye.sh / run_hifiasm.sh — assembly

04_polishing.nf — Medaka polishing (when run_polishing='true')

05_eval_final.nf — BUSCO + QUAST evaluation

The pipeline already ships a ready-made SLURM submission script, submit_run.sh (listed in the folder above). The easiest path is to edit that file rather than write one from scratch: open it, set the FASTQ path near the bottom, then submit it with sbatch submit_run.sh. It already requests the right partition (agrp), 2 CPUs, and activates the course conda environment.

# Edit the shipped submission script:
nano submit_run.sh        # set the FASTQ path, then save

First, protect the results from your manual run. You already have a results folder from running the steps by hand. The pipeline writes to ./results too, so before submitting, rename your manual folder out of the way. The pipeline then creates a fresh ./results for the automated run, and your hand-run outputs stay safe in results_manual:

# Keep your manual results, then start the automated run with a clean folder
mv results results_manual

# Submit it:
sbatch submit_run.sh

After the job finishes you will have both: results_manual (your hand-run) and results (the automated run) side by side to compare.

You can follow the progress of your job with the SLURM command squeue -u $USER.

The pipeline's own progress and any errors are written to two log files in the submission directory, where <job_id> is the number SLURM assigned your job: assembly_<job_id>.out (normal output) and assembly_<job_id>.err (warnings and errors). Watch them live as the job runs with:

# Follow the job output as it runs (Ctrl + C to stop watching)
tail -f assembly_<job_id>.out
tail -f assembly_<job_id>.err

Troubleshooting common issues

1. “No Peaks or Plateaus Detected” in K-mer Analysis

If the k-mer histogram does not show a clear peak or plateau:

Possible Causes:

Insufficient or excessive sequencing data, leading to noisy or flattened histograms.
Poor-quality reads or contamination obscuring the true k-mer distribution.

Solution:

Ensure balanced coverage: Both too little and too much data can cause problems. In one case with a fungal genome, downsampling excessively large datasets improved peak clarity and genome size estimation. (You can use seqtk to subsample a percentage of the reads).
seqtk sample -s42 input.fastq 0.1 > subsampled.fastq
-s42: Sets a random seed (change the number to get different subsamples).
0.1: Keeps 10% of the reads (adjust as needed).
input.fastq: Your original FASTQ file.
subsampled.fastq: Output file with the downsampled reads.
Why it’s unbiased? Uses uniform random sampling while preserving diversity.
Check read quality with FastQC and remove low-quality reads or adapters.
Verify appropriate k-mer size (k=17 is used here for ONT reads; k=21 is common for high-accuracy short reads) for your genome.
If contamination is suspected, use tools like Kraken or FastQ Screen to assess and clean your dataset.

2. Diagnosing poor assembly results (low coverage, fragmentation)

Common symptoms include low N50, fragmentation, unexpected assembly size, or low genome completeness.

Possible Causes:

Low coverage leading to incomplete assemblies.
Low BUSCO completeness scores (e.g., <90%) suggests missing or fragmented conserved genes.
Assembly size significantly smaller or larger than estimated genome size.
High number of short contigs or scaffolds.

Solution:

Increase sequencing depth.
Use assemblers optimized for your read type and genome complexity (Figure 3).
Pre-process data to remove adapters, contaminants, and low-quality reads.
Ensure sufficient and balanced coverage (typically 30–60x for short reads, 20x for long reads).
Re-evaluate read quality and trimming steps.
Consider alternative assemblers or adjust parameters (e.g., error correction, minimum contig length).
Assess for contamination affecting assembly with tools like BlobTools

Figure 3: Recommended Nanopore assembly strategies based on read length N50.

3. Addressing issues with read quality and contamination

Poor-quality reads or contamination can skew k-mer distributions and assembly.

Indicators: Excess low-frequency k-mers, unexpected peaks, or unusually high GC content.

Solution:

Perform thorough quality trimming (e.g., with Trimmomatic, fastp).
Remove contaminants using tools like Kraken or DecontaMiner.
Evaluate GC content and coverage plots to identify anomalies.

4. Handling computational errors and resource limitations

K-mer counting and assembly are resource-intensive, especially with large genomes.

Common Errors: Job failures, memory limits exceeded, or storage issues.

Solution:

Use a SAIAB queue with higher RAM and more cores to accommodate large-genome assemblies (bigmem queue; see under Bigmem queue on the SAIAB in the Appendix).
Optimize job scheduling by monitoring resource usage and adjusting memory and core requests via PBS.
Downsample input data where appropriate to reduce computational load without compromising assembly quality.
Increase quality thresholds during trimming and filtering to improve data quality and minimize unnecessary processing.
Use efficient tools like KMC for k-mer counting while fine-tuning parameters for memory efficiency.
Modify workflow strategies by splitting large jobs, implementing checkpointing, or restructuring pipelines for better resource management.

5. Low BUSCO completeness: improving assembly quality

Certain genomes, such as the Western Leopard Toad, can be highly complex due to repetitive regions, high heterozygosity, or structural variation. Despite using Oxford Nanopore (ONT) long-read sequencing, assemblies may still show low BUSCO completeness due to fragmentation or missing genes.

1. If the assembly is fragmented (many short contigs)

Cause: Even with ONT long reads, assembly fragmentation can occur if reads are too short, error rates are high, or repeats are unresolved.

Solution:

Increase read length (N50) by optimizing library prep (e.g., high-molecular-weight DNA extraction, size selection).

Use higher-coverage data (e.g., ≥30x) to improve consensus accuracy and contiguity.

Experiment with alternative assemblers (e.g., Hifiasm, Flye, NextDenovo) to handle complex regions better.

Result: Longer, more contiguous scaffolds improve BUSCO scores by recovering full-length genes.

2. If the assembly is smaller than expected genome size

Cause: Insufficient coverage or filtering out of heterozygous/duplicated regions can result in missing genomic content.

Solution:

Increase sequencing depth to ≥20x–30x to ensure full genome representation.

Evaluate assembly parameters, some assemblers may collapse repeats or filter heterozygous regions, leading to underrepresentation.

Check for excessive read filtering/trimming, retaining more reads (even slightly lower quality) can help recover missing sequences.

Result: A more complete genome reconstruction with improved gene recovery.

3. Prioritizing longer Sequencing N50 vs. higher coverage of shorter reads

Tradeoff:

Longer N50 sequences improve contiguity and can resolve complex structures.

Higher coverage with shorter N50 reads can improve consensus accuracy and fill in missing regions.

Recommended Approach:

Aim for a balance, longer reads (>50 kb where possible) with sufficient coverage (≥20x) for complex genomes. Consider sequencing with two flow cells, using fragmented and non-fragmented DNA respectively.

If coverage is limited, focus on maximizing read length rather than accumulating more short reads.

Consider additional polishing (e.g., Medaka, Racon) to refine base accuracy post-assembly.

Result: Improved BUSCO completeness through better-contiguous and high-accuracy assembly.

Appendices

Genome assembly pipeline parameters

Tool	Parameter Description	Parameter Flag	Value/Notes
chopper	Minimum mean quality (Q)	-q	10
	Minimum read length	-l	500
Flye	ONT regular reads (Q>10)	--nano-raw
	ONT high-quality reads (Q>15)	--nano-hq
	Threads	-t	2
	Reduced coverage	--asm-coverage	(Coverage peak in summary.txt) e.g. 20
	Genome size	-g	(k-mer analysis) e.g. 2.4g
Medaka	Input reads	-i	trimmed FASTQ
	Draft assembly	-d	assembly.fasta
	Output directory	-o	results/Medaka_results
	Basecaller model	-m	r941_min_hac_g507
	Threads	-t	2
BUSCO	Database	-l	bacteria_odb10
	Mode	-m	genome
	Offline lineage path	--download_path / --offline	./busco_downloads

Glossary

Assembly (Genome Assembly): The process of reconstructing a complete genome sequence from smaller DNA sequence fragments.
Adaptive Traits: Genetic features that enhance an organism's ability to survive and reproduce in specific environments. These traits may evolve due to natural selection and can be studied through comparative genomics.
Annotation (Genome Annotation): The process of identifying and labeling functional elements in a genome, such as protein-coding genes, non-coding regions, and regulatory elements, often through sequence comparison with known genes.
Amino Acid: The building blocks of proteins, coded by DNA sequences, which link together in chains to form proteins.
Antimicrobial Resistance (AMR): The ability of microorganisms to resist the effects of drugs that once killed them or inhibited their growth, posing a serious threat to public health.

Batch Job: A non-interactive job that runs on a high-performance computing (HPC) system when resources become available.
Bigmem Queue: A SAIAB queue with high memory resources (1 TB RAM) used for memory-intensive tasks such as genome assembly.
BUSCO (Benchmarking Universal Single-Copy Orthologs): A tool used to assess genome assembly completeness by comparing it to a database of conserved genes.

SAIAB (Centre for High-Performance Computing): South Africa’s national supercomputing facility.
Concatenation: The process of merging multiple files into a single file, e.g., combining multiple FASTQ files into one.
Contigs: Continuous sequences assembled from overlapping reads.
Coverage (Sequencing Coverage): The average number of times each nucleotide in a genome is sequenced, affecting accuracy and completeness.
Contigs: Continuous sequences of DNA that result from the assembly of smaller sequencing reads. Contigs are used to build the larger genome assembly.
Comparative Genomics: The field that compares the genomes of different species to understand their evolutionary history, functional genes, and adaptations to the environment.

Detached Screen Session: A process running in the background after disconnecting from an SSH session, which can later be resumed.
Draft Genome Assembly: A preliminary genome sequence generated from raw DNA sequencing data. While it may be incomplete or contain errors, it serves as the foundation for further analysis, including genome annotation and population genetics studies.

Error Detection (K-mer Analysis): Identifying sequencing errors based on abnormal k-mer frequency distributions.
Expected Assembly Length: The anticipated genome size after the assembly process, estimated from k-mer analysis.
Evolutionary Relationships: The connections between different species based on shared ancestry. These are often studied using comparative genomics, which compares genetic sequences to understand the evolutionary history of organisms.

FASTA File: A text file format that stores nucleotide or protein sequences without quality scores. Each sequence is represented by a header line (starting with >), followed by the sequence on the next lines.
FASTQ File: A text file format that stores raw sequencing reads and their quality scores.
Fairshare (SLURM): A priority adjustment system that ensures equal distribution of SAIAB computing resources among users.
Flye: A genome assembler specifically designed for long-read sequencing technologies like Oxford Nanopore.
Functional Genomics: The study of gene functions and their interactions, often using genome sequencing data.

GC Content: The percentage of guanine (G) and cytosine (C) bases in a DNA sequence, used to evaluate sequencing quality.
Genome Size Estimation: The process of predicting the total size of a genome using k-mer frequency distributions.
grep: A command-line utility used to search text within files, often used to filter job queue information in SAIAB.

Haploid Length (Estimated Haploid Length): The predicted genome size assuming a single set of chromosomes, calculated using k-mer analysis.
Histogram (K-mer Histogram): A graphical representation of k-mer frequency distribution, helping assess sequencing quality.
High-Performance Computing (HPC): The use of powerful computers to process complex tasks, such as genome assembly.

Interactive Job: A real-time session on the SAIAB system that allows users to execute commands directly, often used for debugging.
Inbreeding: Mating between closely related individuals, which can lead to a higher frequency of deleterious recessive traits. Studying inbreeding levels in populations is important for conservation biology.

K-mer: A sequence of ‘k’ nucleotides used to analyze sequencing data and estimate genome size.
K-mer Analysis: A computational approach to evaluate sequencing data quality, estimate genome size, and detect sequencing errors.
K-mer Coverage: The number of times a specific k-mer appears in the dataset, used to infer sequencing depth.

Lengau: The SAIAB supercomputer, accessed via SSH for computational tasks.
Lineage Folder (BUSCO Lineage Folder): A reference dataset used by BUSCO to assess genome completeness.

Medaka: Oxford Nanopore's neural-network consensus tool used to polish a draft assembly. It maps the reads internally and calls an improved consensus, correcting base errors left after assembly.
Module (Environment Module): A system that allows users to load and unload software packages dynamically on SAIAB.
mpiprocs: The number of MPI (Message Passing Interface) processes allocated per node in a job submission.
Metagenomics: The study of genetic material recovered directly from environmental samples, useful for understanding microbial diversity and function.

N50: A metric that represents assembly continuity. It is the length of the shortest sequence in the largest group of sequences that together make up at least 50% of the total nucleotides.
Nextflow: A workflow management tool used for running bioinformatics pipelines.
ncpus: The number of CPU cores allocated per node in a PBS job submission.
Node: A single computer within the SAIAB cluster used for running computational tasks.

PBS (Portable Batch System): A job scheduling system used on SAIAB to manage resource allocation and job execution.
SLURM: The specific implementation of the PBS job scheduler used on SAIAB.
Pipeline: A series of computational steps used for genome assembly and data processing.
Polishing: The process of improving the accuracy of a genome assembly by correcting errors such as insertions, deletions, and mismatches.
Protein-Coding Genes: Genes that contain instructions for synthesizing proteins, which perform a vast array of functions in living organisms.

qdel: A command to cancel or remove a submitted job from the SAIAB job queue.
qstat: A command used to check the status of submitted jobs in the SAIAB queue.
Queue: A waiting list of submitted jobs on SAIAB, organized by priority and resource availability.

Racon: A tool used for genome polishing to improve assembly accuracy.
Read: DNA sequences generated by the sequencer.
Read Length: The length of a sequencing read, typically measured in base pairs.
rsync: A command used to transfer files between a local machine and SAIAB securely.

Scaffolds: Larger sequences created by linking contigs
Scheduler (SLURM Scheduler): A system that manages job execution based on priority and resource availability.
Screen Session: A tool that allows users to run processes in the background on SAIAB and reconnect later.
Seriallong Queue: A SAIAB queue used for long-running jobs with moderate memory requirements.
SSH (Secure Shell): A protocol used to securely log into the SAIAB system from a remote computer.
Standard Output (stdout): The default destination for output data from a program or script.

Timeline (-with-timeline flag in Nextflow): A Nextflow option that records the execution timeline of a pipeline.
Total Number of Bases: The total number of nucleotides in the sequencing dataset, calculated during k-mer analysis.

Unzip: The process of decompressing a .gz file to access raw sequencing data.
User Priority: The dynamically adjusted ranking in SLURM that determines when a user’s jobs will be executed.

Walltime: The maximum duration a job is allowed to run on SAIAB before being terminated.
Workflow: A structured sequence of computational steps in a bioinformatics pipeline.

Kmer-Analysis Interpretation continued

Two Peaks in K-mer Analysis

When performing k-mer analysis on a diploid organism, two peaks in the k-mer histogram can indicate the presence of homozygous and heterozygous regions, which is a key feature of diploid genomes. Here's a breakdown of what each peak could represent:

Interpretation of Two Peaks:

First Peak (Lower coverage, usually at coverage ~1–30):

This is the heterozygous portion of the genome, where you have one allele from each parent, resulting in slightly different k-mers due to genetic variation.

In diploid organisms, there will be some regions that are heterozygous (i.e., different alleles from each parent). These regions might have lower k-mer counts because the sequences from each allele may be different and thus contribute to a wider variety of k-mers.

Second Peak (Higher coverage, typically at coverage ~97):

This represents the homozygous regions of the genome, where both alleles are identical. These regions will have higher k-mer counts because both copies of the sequence from the two chromosomes are identical, leading to more frequent occurrence of the same k-mer.

Why Two Peaks Occur in Diploid Organisms:

In a diploid organism, the genome consists of two sets of chromosomes, one from each parent. In some regions, both copies of the chromosome may be identical (homozygous), leading to high k-mer coverage for those regions.

In other regions, there might be differences between the two parental copies (heterozygous), leading to the lower coverage peak because k-mers representing the two variants from each allele may not match as frequently.

General Significance of Two Peaks:

Diploidy Detection: The presence of two peaks indicates that the organism is likely diploid, as the two distinct peaks reflect the different copies (alleles) of chromosomes inherited from the two parents.

Coverage Information:

The left peak (lower coverage) often reflects heterozygous regions, while the right peak (higher coverage) corresponds to homozygous regions.

The ratio of heights and positions of these peaks can give further insights into the degree of heterozygosity and the overall complexity of the genome.

Troubleshooting/Considerations:

Genome Size Estimation: In diploid organisms, you’ll typically need to estimate the haploid genome size from the k-mer distribution. The two peaks help in understanding the genome's structure and heterozygosity.

Quality Check: If the peaks are well-defined and clearly separated, it indicates high-quality data, suggesting good sequencing depth and minimal sequencing errors.

In Summary:

Two peaks in k-mer analysis of a diploid organism can signify homozygous and heterozygous regions. The higher peak represents the homozygous regions, while the lower peak represents heterozygous regions. This pattern is normal and indicates a diploid organism. A clear distinction between the peaks typically suggests good sequencing data quality and sufficient coverage for assembly.

Three Peaks in K-mer Analysis

If your k-mer analysis shows three peaks in the histogram, it can be an indicator of a polyploid organism (i.e., an organism with more than two sets of chromosomes). The three peaks may represent the distinct k-mer coverage corresponding to homozygous, heterozygous, and possibly additional levels of ploidy.

Interpretation of Three Peaks:

First Peak (Lower coverage, typically at coverage ~1–30):

This peak likely represents heterozygous regions of the genome, where there are differences between the multiple copies of chromosomes. Since polyploid organisms have more than two copies of each chromosome, these regions will have variation between the copies, leading to lower k-mer counts due to differences in sequence.

Second Peak (Intermediate coverage, typically at coverage ~50–100):

This peak corresponds to homozygous regions where the copies of chromosomes are identical. In a polyploid organism, there will be regions where all copies of the chromosome are the same, leading to higher k-mer counts because multiple identical copies are contributing to the same k-mer.

Third Peak (Higher coverage, typically at coverage >100):

This peak may correspond to additional copies of sequences in polyploid organisms. The higher coverage is indicative of repeated or additional genomic copies present in the organism's genome. This peak could be seen in highly repetitive or duplicated regions, which are common in polyploid species.

Why Three Peaks Occur in Polyploid Organisms:

In polyploid organisms (e.g., tetraploid organisms with 4 copies of each chromosome), you'll observe more complex k-mer distributions. Here's why:

Homozygous regions will appear with higher k-mer counts, as multiple identical copies of the sequence align.

Heterozygous regions will show lower k-mer counts because the variants (alleles) at those loci do not align perfectly, resulting in fewer matching k-mers.

The third peak often arises due to additional genome copies. In polyploid organisms, there are more than two copies of each chromosome, leading to a third peak at higher coverage.

General Significance of Three Peaks:

Polyploidy Detection: The presence of three peaks indicates that the organism might be polyploid (such as tetraploid or hexaploid). Polyploid organisms can have different numbers of chromosome sets (e.g., 4 sets for tetraploid, 6 sets for hexaploid), leading to more complex k-mer distributions.

Genome Complexity: The three peaks provide insights into the complexity of the organism’s genome, such as repetitive sequences and the number of chromosome copies.

Coverage Evaluation: The three peaks also help assess if you have sufficient coverage for each of the different genomic layers (e.g., heterozygous, homozygous, and additional copies).

Quality Check:

If the three peaks are well-defined and separated, it suggests that the sequencing data is of good quality, with accurate estimation of genome size and coverage.

A lack of clear separation between the peaks, or overlapping peaks, could indicate sequencing errors or insufficient data quality.

In Summary:

Three peaks in k-mer analysis often indicate polyploidy and correspond to different genomic regions (homozygous, heterozygous, and additional copies in polyploid organisms).

The first peak represents heterozygous regions, the second peak corresponds to homozygous regions, and the third peak could reflect the additional genome copies or repetitive sequences found in polyploid organisms.

Cheat Sheet

Download / Open Cheat Sheet (PDF)

Genome Assembly Workshop Handbook

Introduction to Genome Assembly with Oxford Nanopore Sequencing

Welcome to the Workshop

The Critical Importance of Draft Genome Assemblies

Foundation of Genomic Understanding

Democratizing Genomic Research

Enabling Personalized Medicine and Precision Agriculture

Oxford Nanopore Technology: Revolutionizing Genome Assembly

The Long-Read Advantage

Real-Time Sequencing and Portability

Direct DNA Sequencing and Methylation Detection

The Perfect Synergy: ONT and Genome Assembly

Simplified Assembly Workflows

Resolving Complex Genomic Features

Cost-Effective High-Quality Assemblies

Learning Objectives and Workshop Overview

Workshop: 1KSA Genome Assembly

What You Will Learn

Pre-requisites

Course Outline

Module 1 — Foundations

Module 2 — Getting Started on the HPC

Module 3 — Pipeline Overview & Setup

Module 4 — Running the Assembly Pipeline

Module 5 — Assembly Quality Assessment

Module 6 — Troubleshooting (optional / extended)

Module 7 — Automation & Wrap-up (optional / extended)

Getting started

1. Accessing the SAIAB server

Logging into SAIAB via SSH

2. Linux & SLURM Basics

2.1 Essential Linux commands

2.2 Managing sessions with screen

2.3 Understanding SLURM

2.4 SLURM commands

2.5 Job script example

2.6 Setting up your environment

3. Pipeline overview and download

3.1 Overview of pipeline

3.2 Clone the pipeline repository with the necessary scripts

3.3 Dependencies:

Running the pipeline

Data preparation

K-mer Analysis

Key metrics from the plot

Understanding the Plot

Peaks in the Histogram

Main Peak (Coverage ~18–20):

Right Tail (Coverage >~30):

Interpretation and Quality Insights

Assessing Read Quality

Trimming Reads with Chopper

Re-checking Read Quality After Trimming

Genome Assembly

Key Terms

Mapping and Polishing

Why map reads back to the assembly?

Why polish?

Genome Assembly Evaluation

Evaluating assembly quality and generating a report

Generate snail plots

Assembly Statistics (Concentric Rings)

Scaffold Statistics

BUSCO Results: Outstanding

Key Insights

Visual Elements

Run the entire pipeline in one script

Troubleshooting common issues

1. “No Peaks or Plateaus Detected” in K-mer Analysis

2. Diagnosing poor assembly results (low coverage, fragmentation)

3. Addressing issues with read quality and contamination

4. Handling computational errors and resource limitations

5. Low BUSCO completeness: improving assembly quality

1. If the assembly is fragmented (many short contigs)

2. If the assembly is smaller than expected genome size

3. Prioritizing longer Sequencing N50 vs. higher coverage of shorter reads

Appendices

Genome assembly pipeline parameters

Glossary

Kmer-Analysis Interpretation continued