South African Institute for Aquatic Biodiversity AGRP Bioinformatics Training
SAIAB Workshop

Introduction to Unix for Bioinformaticians

Essential command-line, software management, and HPC skills for bioinformatics research at SAIAB

2-Day Workshop
~12 Hours
Beginner Level

Course Overview

This two-day course provides a practical introduction to the Unix command line and essential software management skills for bioinformatics research. You will learn to navigate the Unix file system, manipulate bioinformatics data files, write shell scripts for automation, manage software environments with Conda, and submit jobs on the SAIAB lab417 compute cluster using SLURM.

By the end of this course, you will be equipped with the foundational computing skills required to run bioinformatics analyses efficiently and reproducibly.

Learning Objectives

Prerequisites

No prior command-line experience is required. You should have a SAIAB lab417 account (or training account) and a laptop with an SSH client installed (Terminal on macOS; MobaXterm on Windows).

Day 1 — Unix Fundamentals & Shell Scripting

Day 1 covers the core Unix command-line skills: navigating the file system, working with files, understanding file permissions and environment variables, searching data, writing scripts, and using variables and loops for automation.

Schedule
1

Introduction to the Shell

Log in to the lab417 cluster and understand HPC architecture — login nodes vs. compute nodes, shared storage, and the SLURM scheduler.

Topics covered: Logging in via SSH (ssh username@lab417.saiab.ac.za). Starting an interactive session with srun. Copying exercise data with cp -r. Navigating the file system: cd, ls, ls -l, ls -F, pwd. Understanding the root directory (/) and home directory (~). Full vs. relative paths. Using man pages for help. Tab completion for efficient typing. Copying (cp), creating directories (mkdir), moving/renaming (mv), and removing (rm, rm -ri) files and directories.
Open full lesson
30 min
2

Wildcards & Shortcuts

Learn time-saving shortcuts for navigating and selecting files on the command line.

Topics covered: The * wildcard — matching any number of characters (e.g. ls *fq, ls Mov10*fq). The ? wildcard — matching exactly one character (e.g. ls /bin/d?). Home directory shortcut ~. Parent directory .. and current directory .. Command history with up/down arrow keys and history. Cancelling commands with Ctrl+C. Jump to start/end of line with Ctrl+A / Ctrl+E. Tab completion for file and directory names.
Open full lesson
30 min
3

Examining & Creating Files

View, create, and edit files from the command line using a variety of tools.

Topics covered: Printing file contents with cat (e.g. viewing FASTA files). Paging through large files with less — navigation keys (SPACE, b, g, G, q) and searching with /. Viewing the start and end of files with head -n and tail -n. Understanding the FASTQ file format (4-line records: header, sequence, separator, quality). Checking file sizes with ls -lh. Introduction to the Nano text editor — opening, navigating (Ctrl+A/E/Y/V), editing, cutting/pasting (Ctrl+K/U), saving (Ctrl+O), exiting (Ctrl+X), and search/replace (Ctrl+W, Ctrl+\).
Open full lesson
30 min
4

Permissions & Environment Variables

Understand how Unix controls access to files and how the shell environment is configured.

Topics covered: File ownership — user, group, and others. The groups command. Reading the permissions string (e.g. -rw-rw-r--): r (read), w (write), x (execute/traverse). Interpreting permissions for files vs. directories. Changing permissions with chmod — adding (+) and removing (-) permissions (e.g. chmod o-r file, chmod u+x script.sh). Checking directory sizes with du -sh. Environment variables: $HOME, $PATH, $USER. Viewing with echo and printenv. Adding directories to $PATH using export. The .bashrc file for persistent configuration.
Open full lesson
40 min
☕ Tea/Coffee Break — 15 min
5

Searching & Redirection

Search within files and chain commands together — one of the most powerful concepts in Unix.

Topics covered: Searching files with grep — basic pattern matching in FASTQ files (e.g. finding reads with NNNNNNNNNN). Using grep options: -B/-A for context lines, -n for line numbers, -c for counts, -v for inverted matches, --no-group-separator. Output redirection: writing to files with > and appending with >>. Piping with | — combining grep with less, head, wc -l. The wc command for counting lines, words, and characters. Extracting columns with cut -f. Sorting with sort and removing duplicates with sort -u. Practical exercise: counting unique exons in a GTF gene annotation file by chaining grep, cut, sort -u, and wc -l. Introduction to awk for column-based processing — printing specific columns (awk '{print $1, $3}'), filtering rows by column value (awk '$3 > 100'), and combining with pipes (cat file.txt | awk '{print $2}' | sort | uniq -c). Practical example: extract read names and lengths from a FASTQ file using awk.
Open full lesson
60 min
🍽️ Lunch Break — 60 min
6

Shell Scripts & Variables

Capture commands in reusable scripts and introduce variables for flexible automation.

Topics covered: What is a shell script — text files containing commands, .sh extension convention. Creating a simple script with nano (e.g. listing.sh) combining pwd, ls -l, and echo. Running scripts with sh or bash. Bash variables — defining (num=25), referencing ($num), rules (no spaces around =). Using variables as input to commands (e.g. wc -l $file). The basename command for extracting filenames and trimming extensions. Assigning command output to variables using backticks (`basename ...`). Adding comments with # for documentation. Building a practical directory_info.sh script that takes a directory path, reports its contents, and counts files.
Open full lesson
60 min
7

Loops & Automation

Iterate commands over multiple files and build automated analysis scripts.

Topics covered: The for loop structure — for, in, do, done keywords. Step-by-step loop execution: variable initialisation, body execution, reassignment. Using wildcards in loop lists (e.g. for file in Mov10*.fq). Best practices: meaningful variable names, not using ls in loop definitions. Combining loops with echo, wc -l, grep, and redirection. Building a complete automation script (generate_bad_reads_summary.sh): the shebang line (#!/bin/bash), iterating over FASTQ files, using basename to generate output file prefixes, extracting bad reads with grep, counting and logging results to a summary file.
Open full lesson
60 min
Day 1 — Exercise Answer Key

Solutions to all Day 1 exercises are available here: Day 1 Answer Key

Day 2 — HPC, Data Handling & Conda

Day 2 begins with HPC and SLURM job scheduling on the lab417 cluster, then covers essential data handling skills — file transfer, compressed files, and keeping jobs running with screen — before moving into Conda for managing bioinformatics software environments. The day closes with a hands-on practical installing real Oxford Nanopore tools, bridging directly into the ONT genome assembly course later this week.

~4 h of content plus breaks (starts 09:00, finishes ~14:30).

Schedule
1

Introduction to HPC & SLURM

Understand cluster computing and learn to submit jobs on the lab417 server.

Topics covered: Computer components review — CPU (cores, processors), data storage (HDD vs. SSD), memory (RAM). HPC architecture: nodes, shared storage, why the cluster is more powerful. Parallelisation concepts — serial vs. parallel processing, multithreading vs. true parallelisation. SLURM architecture: slurmctld (controller) and slurmd (node daemons), partitions, jobs, job steps. Key SLURM commands: sinfo (cluster status), squeue (job queue), srun (interactive jobs), sbatch (batch submission), scancel (cancel jobs), scontrol show job. Interactive jobs vs. batch jobs. Writing SLURM batch scripts with #SBATCH directives: --job-name, --output, --error, --cpus-per-task, --mem, --time, --partition=agrp. The lab417 agrp partition. Being a good cluster citizen — resource requests, monitoring, and etiquette.
Open full lesson
60 min
2

File Transfer — Moving Data On and Off the Server

Transfer sequencing data and results between your local computer and the lab417 server using scp and rsync.

Topics covered: Copying a file from local machine to server (scp localfile.fastq username@lab417.saiab.ac.za:~/data/). Copying a file from server to local machine (scp username@lab417.saiab.ac.za:~/results/assembly.fasta ./). Copying a directory recursively with scp -r. Introduction to rsync — smarter transfers that only copy what has changed (rsync -avh username@lab417.saiab.ac.za:~/results/ ./results/). Key rsync flags: -a (archive), -v (verbose), -h (human-readable), --progress (show transfer progress), --dry-run (preview without copying). When to use scp vs. rsyncscp for quick single files, rsync for directories and repeated transfers. Practical exercise: transfer a FASTQ file from the server to your local machine using both scp and rsync.
Open full lesson
20 min
3

Compressed Files and Keeping Jobs Running with screen

Work with compressed sequencing files and keep long-running jobs alive after disconnecting from the server.

Topics covered: Why sequencing data is always compressed — storage and transfer efficiency. Compressing and decompressing files with gzip and gunzip (gzip file.fastqfile.fastq.gz; gunzip file.fastq.gz). Viewing compressed files without decompressing (zcat file.fastq.gz | head). Piping compressed files directly into tools (zcat reads.fastq.gz | grep "^@" | wc -l). Archiving directories with tar — create (tar -czf results.tar.gz results/), extract (tar -xzf results.tar.gz), list contents (tar -tzf results.tar.gz). Persistent terminal sessions with screen: the problem that SSH disconnects kill running jobs; starting a session (screen, screen -S jobname); detaching with Ctrl+A then D; listing sessions (screen -ls); reattaching (screen -r jobname); closing with exit. Practical: start a screen session, run a long command, detach, log out, log back in, and reattach.
Open full lesson
25 min
☕ Tea/Coffee Break — 15 min
4

Introduction to Conda

Install and configure Conda for managing bioinformatics software without admin rights.

Topics covered: What is Conda — package and environment management in user space, no root required. Why Conda: dependency management, environment isolation, cross-platform, scientific computing focus. Choosing a distribution: Miniconda (recommended, minimal) vs. Anaconda (full). Installing Miniconda on Linux — downloading, running the installer, initialising with source ~/.bashrc. Verifying installation: conda --version, conda info, conda list. Understanding the base environment. Setting up bioinformatics channels: conda config --add channels bioconda, conda-forge, defaults. Setting strict channel priority. Searching and installing packages: conda search samtools, conda install -c bioconda samtools. Installing specific versions for reproducibility (e.g. samtools=1.15).
Open full lesson
30 min
5

Conda Environments for Bioinformatics

Create isolated environments for different analyses — the key to reproducible bioinformatics.

Topics covered: Why environments are critical — dependency conflicts, version-specific tool behaviour. Environment patterns: project-specific (cancer-rnaseq-2024), analysis-specific (RNA-seq, variant calling, assembly), pipeline-specific. Creating environments: conda create --name rnaseq python=3.9. Activating (conda activate) and deactivating (conda deactivate). Installing bioinformatics tools into environments — alignment (bwa, star, minimap2), variant calling (gatk4, bcftools), assembly (flye, spades), QC (fastqc, multiqc). Version pinning for reproducibility (e.g. samtools=1.15). Listing environments (conda env list) and packages (conda list). Exporting to YAML: conda env export > environment.yml. Recreating from YAML: conda env create -f environment.yml. Minimal exports with --from-history. Sharing environments for reproducible research.
Open full lesson
45 min
🍽️ Lunch Break — 60 min
6

Conda Practical — Installing ONT Bioinformatics Tools

Apply everything learned about Conda by building a real environment for Oxford Nanopore sequencing data QC — the same tools used in the ONT genome assembly course later this week.

Topics covered: Create a dedicated environment for ONT QC work (conda create -n ont-qc python=3.10, conda activate ont-qc). Install real ONT tools from Bioconda (conda install -c bioconda -c conda-forge nanoq seqkit). Run nanoq on a provided ONT FASTQ file to generate read statistics (nanoq -i reads.fastq.gz -s). Run seqkit stats on the same file (seqkit stats reads.fastq.gz). Inspect and interpret the output — read count, N50, mean quality. Export the environment for reproducibility (conda env export > ont-qc-environment.yml, conda env export --from-history > ont-qc-minimal.yml). View the YAML and understand its structure. Using mamba as a faster drop-in replacement for conda install. Housekeeping to free disk space (conda clean --all). Bridge to the assembly course: this week you will use these same tools plus Flye and Medaka — installed in a new environment on Day 1 of the assembly course using the same workflow you just practiced.
Open full lesson
40 min
☕ Tea/Coffee Break — 15 min
7

Wrap-up and Q&A

Bringing it all together — reproducibility, documentation, and next steps into the ONT assembly course.

Topics covered: Reproducibility recap — pin tool versions during active analysis and export environment.yml with publications. Recap of the day's data-handling skills: file transfer, compressed files, screen, and Conda environments. The CHPC alternative: PBS job scheduling on Lengau. Open Q&A and bridge into the ONT genome assembly course later this week.
Open full lesson
20 min

Quick Reference

Essential Unix Commands

# Navigation
pwd                     # print working directory
ls -lh                  # list files with sizes
cd directory/           # change directory
cd ~                    # go home
cd ..                   # go up one level

# Files
cp source dest          # copy
mv source dest          # move or rename
rm file                 # remove (careful!)
mkdir dirname           # create directory

# Viewing
cat file                # print file contents
less file               # page through file
head -n 20 file         # first 20 lines
tail -n 20 file         # last 20 lines

# Searching & Processing
grep "pattern" file     # search for pattern
grep -c "pattern" file  # count matches
wc -l file              # count lines
sort file               # sort lines
uniq                    # remove duplicates

# Redirection & Pipes
command > file          # write output to file
command >> file         # append output to file
cmd1 | cmd2             # pipe output of cmd1 into cmd2

# Permissions
chmod u+x script.sh    # make script executable
chmod o-r file          # remove read for others

# Variables
var="value"             # assign
echo $var               # use

Essential Conda Commands

# Environment Management
conda create -n myenv python=3.9        # create environment
conda activate myenv                    # activate
conda deactivate                        # deactivate
conda env list                          # list environments
conda env remove -n myenv              # remove environment

# Package Management
conda install -c bioconda samtools      # install from bioconda
conda install -c bioconda "bwa=0.7.17"  # specific version
conda search -c bioconda toolname       # search for packages
conda list                              # list installed packages
conda update toolname                   # update package
mamba install -c bioconda toolset        # faster alternative

# Reproducibility
conda env export > environment.yml             # export environment
conda env create -f environment.yml            # recreate from file
conda env export --from-history > minimal.yml  # minimal export

# Maintenance
conda clean --all                       # clear package cache

File Transfer & Compression

# Copy file to server
scp file.fastq username@lab417.saiab.ac.za:~/data/

# Copy file from server
scp username@lab417.saiab.ac.za:~/results/file.fasta ./

# Sync directory with rsync
rsync -avh --progress username@lab417.saiab.ac.za:~/results/ ./results/

# Compress / decompress
gzip file.fastq
gunzip file.fastq.gz
zcat file.fastq.gz | head -8

# Archive a directory
tar -czf archive.tar.gz directory/
tar -xzf archive.tar.gz

# screen — persistent sessions
screen -S mysession       # start named session
# Ctrl+A then D           # detach
screen -ls                # list sessions
screen -r mysession       # reattach

Essential SLURM Commands

# Interactive Session
srun --cpus-per-task=1 -t 0-3:00 --mem 1G --pty /bin/bash

# Job Submission
sbatch myscript.sh          # submit batch job
squeue -u $USER             # check your jobs
scancel JOBID               # cancel a job
sinfo                       # cluster/partition status

# Batch Script Template
#!/bin/bash
#SBATCH --job-name=myanalysis
#SBATCH --output=myanalysis_%j.out
#SBATCH --error=myanalysis_%j.err
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=0-4:00:00
#SBATCH --partition=agrp

# Your commands here
echo "Job started at $(date)"
samtools sort input.bam -o sorted.bam -@ 4
echo "Job finished at $(date)"

Resources

Cheat Sheets

Online Tutorials

Shell
Explain Shell — paste any command for a visual breakdown
Conda
Bioconda Project — 6000+ bioinformatics packages

CHPC Users

If you work on the CHPC Lengau cluster (PBS scheduler), refer to the CHPC-specific lesson for PBS job submission and module usage.