Conda Practical — Installing ONT Bioinformatics Tools
Learning objectives
- Build a dedicated Conda environment for Oxford Nanopore (ONT) read QC
- Install real ONT tools from the Bioconda channel
- Run
nanoqandseqkiton a real ONT FASTQ file and interpret the output - Export the environment to a YAML file for reproducibility
Approximate time: 40 minutes
This is a hands-on practical that brings together everything from the Conda sessions. You will build a real working environment for quality-checking Oxford Nanopore sequencing data — using the same tools you will need in the ONT genome assembly course later this week.
Before you start: make sure Conda is installed and your Bioconda channels are configured (covered in the earlier Conda sessions). You should be on a compute node — your prompt should show
[SLURM].
1. Create and activate the environment
Keeping each project in its own environment avoids dependency clashes. Create one named ont-qc with a specific Python version, then activate it:
conda create -n ont-qc python=3.10
conda activate ont-qc
Your prompt should now start with (ont-qc), showing the environment is active.
2. Install real ONT tools from Bioconda
Install two widely-used QC tools in a single command, pulling from both the bioconda and conda-forge channels:
conda install -c bioconda -c conda-forge nanoq seqkit
nanoq— a fast QC tool built specifically for long-read (Nanopore / PacBio) data.seqkit— a general-purpose FASTA/FASTQ toolkit that works for both short and long reads.
3. Run the tools on an ONT FASTQ file
A small subsampled ONT read set (~500 reads) is provided for this exercise in the course folder unix_lesson:
unix_lesson/ont_demo_reads.fastq.gz
Notice it is a .gz file — both tools read it compressed, no need to decompress.
Generate read statistics with nanoq (the -s flag prints a summary report):
nanoq -i unix_lesson/ont_demo_reads.fastq.gz -s
Now run seqkit stats on the same file:
seqkit stats unix_lesson/ont_demo_reads.fastq.gz
4. Interpret the output
Both tools report the same kinds of summary numbers. The key ones for Nanopore QC:
- Number of reads — how many reads are in the file.
- Total bases — the total amount of sequence (your raw "yield").
- Read N50 — the read length such that half of all sequenced bases are in reads of that length or longer. For long-read data, a higher N50 generally means a better, more contiguous assembly.
- Mean / median read length — typical read length.
- Mean quality — average per-read quality score (reported by
nanoq).
Worked example — the demo dataset
Running nanoq on the provided demo file:
nanoq -i unix_lesson/ont_demo_reads.fastq.gz -s
produces this single line of summary statistics:
500 4807721 12335 47072 1025 9615 8548 30.0 30.0
Reading the columns left to right:
- Number of reads: 500
- Total bases: 4,807,721 (~4.8 Mb)
- Read N50: 12,335 bp
- Longest read: 47,072 bp
- Shortest read: 1,025 bp
- Mean read length: 9,615 bp
- Median read length: 8,548 bp
- Mean read quality: 30.0
- Median read quality: 30.0
Running seqkit stats on the same file gives the same numbers in a labelled table:
seqkit stats unix_lesson/ont_demo_reads.fastq.gz
file format type num_seqs sum_len min_len avg_len max_len
unix_lesson/ont_demo_reads.fastq.gz FASTQ DNA 500 4,807,721 1,025 9,615.4 47,072
Note how the two tools agree: num_seqs = 500 reads, sum_len = 4,807,721 bases, min_len = 1,025, avg_len = 9,615, max_len = 47,072. seqkit gives a quick labelled overview; nanoq adds long-read-specific metrics like N50 and quality.
So this small demo set has 500 long reads with a healthy N50 of ~12 kb and a mean quality around Q30 — exactly the kind of read length and quality you want going into an assembly.
Why this matters: these are exactly the numbers you check before attempting an assembly. Too few bases, short N50, or low quality all predict a poor assembly — so QC first, assemble second.
5. Export the environment for reproducibility
An analysis is only reproducible if someone else can rebuild the exact software environment. Export it to YAML two ways:
conda env export > ont-qc-environment.yml
conda env export --from-history > ont-qc-minimal.yml
ont-qc-environment.yml— the full export: every package and exact version, including dependencies. Fully reproducible but platform-specific.ont-qc-minimal.yml— the--from-historyexport: only the packages you explicitly asked for. Cleaner and more portable across systems.
Open the YAML files and look at their structure — the environment name, the channels, and the dependency list:
cat ont-qc-minimal.yml
Anyone can later recreate the environment from this file with:
conda env create -f ont-qc-environment.yml
6. Useful extras
mambais a faster drop-in replacement forconda install— same arguments, much quicker dependency solving:mamba install -c bioconda -c conda-forge nanoq seqkit- Once tools are installed, free disk space by clearing cached package downloads:
conda clean --all
Bridge to the assembly course
You have now built a real environment, installed real ONT tools, run QC, and exported a reproducible recipe. This week, in the ONT genome assembly course, you will use these same tools plus flye (assembler) and medaka (polisher) — installed into a new environment using the exact workflow you just practiced here.
Exercises
- Create the
ont-qcenvironment and confirm it is active (check your prompt). - Install
nanoqandseqkit, then run both on the demo FASTQ file. - Compare the read count and N50 reported by
nanoqandseqkit stats— do they agree? - Export both a full and a minimal YAML, then open each and describe how they differ.
SAIAB AGRP Bioinformatics Training. Open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0).