Conda Practical — Installing ONT Bioinformatics Tools

Learning objectives

Build a dedicated Conda environment for Oxford Nanopore (ONT) read QC
Install real ONT tools from the Bioconda channel
Run nanoq and seqkit on a real ONT FASTQ file and interpret the output
Export the environment to a YAML file for reproducibility

Approximate time: 40 minutes

This is a hands-on practical that brings together everything from the Conda sessions. You will build a real working environment for quality-checking Oxford Nanopore sequencing data — using the same tools you will need in the ONT genome assembly course later this week.

Before you start: make sure Conda is installed and your Bioconda channels are configured (covered in the earlier Conda sessions). You should be on a compute node — your prompt should show [SLURM].

1. Create and activate the environment

Keeping each project in its own environment avoids dependency clashes. Create one named ont-qc with a specific Python version, then activate it:

conda create -n ont-qc python=3.10
conda activate ont-qc

Your prompt should now start with (ont-qc), showing the environment is active.

2. Install real ONT tools from Bioconda

Install two widely-used QC tools in a single command, pulling from both the bioconda and conda-forge channels:

conda install -c bioconda -c conda-forge nanoq seqkit

nanoq — a fast QC tool built specifically for long-read (Nanopore / PacBio) data.
seqkit — a general-purpose FASTA/FASTQ toolkit that works for both short and long reads.

3. Run the tools on an ONT FASTQ file

A small subsampled ONT read set (~500 reads) is provided for this exercise in the course folder unix_lesson:

unix_lesson/ont_demo_reads.fastq.gz

Notice it is a .gz file — both tools read it compressed, no need to decompress.

Generate read statistics with nanoq (the -s flag prints a summary report):

nanoq -i unix_lesson/ont_demo_reads.fastq.gz -s

Now run seqkit stats on the same file:

seqkit stats unix_lesson/ont_demo_reads.fastq.gz

4. Interpret the output

Both tools report the same kinds of summary numbers. The key ones for Nanopore QC:

Number of reads — how many reads are in the file.
Total bases — the total amount of sequence (your raw "yield").
Read N50 — the read length such that half of all sequenced bases are in reads of that length or longer. For long-read data, a higher N50 generally means a better, more contiguous assembly.
Mean / median read length — typical read length.
Mean quality — average per-read quality score (reported by nanoq).

Worked example — the demo dataset

Running nanoq on the provided demo file:

nanoq -i unix_lesson/ont_demo_reads.fastq.gz -s

produces this single line of summary statistics:

500 4807721 12335 47072 1025 9615 8548 30.0 30.0

Reading the columns left to right:

Number of reads: 500
Total bases: 4,807,721 (~4.8 Mb)
Read N50: 12,335 bp
Longest read: 47,072 bp
Shortest read: 1,025 bp
Mean read length: 9,615 bp
Median read length: 8,548 bp
Mean read quality: 30.0
Median read quality: 30.0

Running seqkit stats on the same file gives the same numbers in a labelled table:

seqkit stats unix_lesson/ont_demo_reads.fastq.gz

file                                              format  type  num_seqs    sum_len  min_len  avg_len  max_len
unix_lesson/ont_demo_reads.fastq.gz  FASTQ   DNA        500  4,807,721    1,025  9,615.4   47,072

Note how the two tools agree: num_seqs = 500 reads, sum_len = 4,807,721 bases, min_len = 1,025, avg_len = 9,615, max_len = 47,072. seqkit gives a quick labelled overview; nanoq adds long-read-specific metrics like N50 and quality.

So this small demo set has 500 long reads with a healthy N50 of ~12 kb and a mean quality around Q30 — exactly the kind of read length and quality you want going into an assembly.

Why this matters: these are exactly the numbers you check before attempting an assembly. Too few bases, short N50, or low quality all predict a poor assembly — so QC first, assemble second.

5. Export the environment for reproducibility

An analysis is only reproducible if someone else can rebuild the exact software environment. Export it to YAML two ways:

conda env export > ont-qc-environment.yml
conda env export --from-history > ont-qc-minimal.yml

ont-qc-environment.yml — the full export: every package and exact version, including dependencies. Fully reproducible but platform-specific.
ont-qc-minimal.yml — the --from-history export: only the packages you explicitly asked for. Cleaner and more portable across systems.

Open the YAML files and look at their structure — the environment name, the channels, and the dependency list:

cat ont-qc-minimal.yml

Anyone can later recreate the environment from this file with:

conda env create -f ont-qc-environment.yml

6. Useful extras

mamba is a faster drop-in replacement for conda install — same arguments, much quicker dependency solving:
```
mamba install -c bioconda -c conda-forge nanoq seqkit
```
Once tools are installed, free disk space by clearing cached package downloads:
```
conda clean --all
```

Bridge to the assembly course

You have now built a real environment, installed real ONT tools, run QC, and exported a reproducible recipe. This week, in the ONT genome assembly course, you will use these same tools plus flye (assembler) and medaka (polisher) — installed into a new environment using the exact workflow you just practiced here.

Exercises

Create the ont-qc environment and confirm it is active (check your prompt).
Install nanoq and seqkit, then run both on the demo FASTQ file.
Compare the read count and N50 reported by nanoq and seqkit stats — do they agree?
Export both a full and a minimal YAML, then open each and describe how they differ.

SAIAB AGRP Bioinformatics Training. Open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0).