Course Overview
This two-day course provides a practical introduction to the Unix command line and essential software management skills for bioinformatics research. You will learn to navigate the Unix file system, manipulate bioinformatics data files, write shell scripts for automation, manage software environments with Conda, and submit jobs on the SAIAB lab417 compute cluster using SLURM.
By the end of this course, you will be equipped with the foundational computing skills required to run bioinformatics analyses efficiently and reproducibly.
Learning Objectives
- Navigate the Unix file system and work confidently at the command line
- Create, examine, search, and manipulate text-based bioinformatics files
- Use wildcards, pipes, and redirection to chain commands efficiently
- Write reusable shell scripts with variables and loops
- Understand file permissions and environment variables
- Install and manage bioinformatics software using Conda environments
- Understand HPC architecture and submit jobs using SLURM on lab417
- Transfer data between local and remote systems using
scpandrsync - Work with compressed sequencing files using
gzip,zcat, andtar - Maintain persistent terminal sessions with
screento keep jobs running - Install and run real ONT bioinformatics tools in a Conda environment
- Apply best practices for reproducibility in computational research
Prerequisites
No prior command-line experience is required. You should have a SAIAB lab417 account (or training account) and a laptop with an SSH client installed (Terminal on macOS; MobaXterm on Windows).
Day 1 — Unix Fundamentals & Shell Scripting
Day 1 covers the core Unix command-line skills: navigating the file system, working with files, understanding file permissions and environment variables, searching data, writing scripts, and using variables and loops for automation.
Introduction to the Shell
Log in to the lab417 cluster and understand HPC architecture — login nodes vs. compute nodes, shared storage, and the SLURM scheduler.
ssh username@lab417.saiab.ac.za). Starting an interactive session with srun. Copying exercise data with cp -r. Navigating the file system: cd, ls, ls -l, ls -F, pwd. Understanding the root directory (/) and home directory (~). Full vs. relative paths. Using man pages for help. Tab completion for efficient typing. Copying (cp), creating directories (mkdir), moving/renaming (mv), and removing (rm, rm -ri) files and directories.
Wildcards & Shortcuts
Learn time-saving shortcuts for navigating and selecting files on the command line.
* wildcard — matching any number of characters (e.g. ls *fq, ls Mov10*fq). The ? wildcard — matching exactly one character (e.g. ls /bin/d?). Home directory shortcut ~. Parent directory .. and current directory .. Command history with up/down arrow keys and history. Cancelling commands with Ctrl+C. Jump to start/end of line with Ctrl+A / Ctrl+E. Tab completion for file and directory names.
Examining & Creating Files
View, create, and edit files from the command line using a variety of tools.
cat (e.g. viewing FASTA files). Paging through large files with less — navigation keys (SPACE, b, g, G, q) and searching with /. Viewing the start and end of files with head -n and tail -n. Understanding the FASTQ file format (4-line records: header, sequence, separator, quality). Checking file sizes with ls -lh. Introduction to the Nano text editor — opening, navigating (Ctrl+A/E/Y/V), editing, cutting/pasting (Ctrl+K/U), saving (Ctrl+O), exiting (Ctrl+X), and search/replace (Ctrl+W, Ctrl+\).
Permissions & Environment Variables
Understand how Unix controls access to files and how the shell environment is configured.
groups command. Reading the permissions string (e.g. -rw-rw-r--): r (read), w (write), x (execute/traverse). Interpreting permissions for files vs. directories. Changing permissions with chmod — adding (+) and removing (-) permissions (e.g. chmod o-r file, chmod u+x script.sh). Checking directory sizes with du -sh. Environment variables: $HOME, $PATH, $USER. Viewing with echo and printenv. Adding directories to $PATH using export. The .bashrc file for persistent configuration.
Searching & Redirection
Search within files and chain commands together — one of the most powerful concepts in Unix.
grep — basic pattern matching in FASTQ files (e.g. finding reads with NNNNNNNNNN). Using grep options: -B/-A for context lines, -n for line numbers, -c for counts, -v for inverted matches, --no-group-separator. Output redirection: writing to files with > and appending with >>. Piping with | — combining grep with less, head, wc -l. The wc command for counting lines, words, and characters. Extracting columns with cut -f. Sorting with sort and removing duplicates with sort -u. Practical exercise: counting unique exons in a GTF gene annotation file by chaining grep, cut, sort -u, and wc -l. Introduction to awk for column-based processing — printing specific columns (awk '{print $1, $3}'), filtering rows by column value (awk '$3 > 100'), and combining with pipes (cat file.txt | awk '{print $2}' | sort | uniq -c). Practical example: extract read names and lengths from a FASTQ file using awk.
Shell Scripts & Variables
Capture commands in reusable scripts and introduce variables for flexible automation.
.sh extension convention. Creating a simple script with nano (e.g. listing.sh) combining pwd, ls -l, and echo. Running scripts with sh or bash. Bash variables — defining (num=25), referencing ($num), rules (no spaces around =). Using variables as input to commands (e.g. wc -l $file). The basename command for extracting filenames and trimming extensions. Assigning command output to variables using backticks (`basename ...`). Adding comments with # for documentation. Building a practical directory_info.sh script that takes a directory path, reports its contents, and counts files.
Loops & Automation
Iterate commands over multiple files and build automated analysis scripts.
for loop structure — for, in, do, done keywords. Step-by-step loop execution: variable initialisation, body execution, reassignment. Using wildcards in loop lists (e.g. for file in Mov10*.fq). Best practices: meaningful variable names, not using ls in loop definitions. Combining loops with echo, wc -l, grep, and redirection. Building a complete automation script (generate_bad_reads_summary.sh): the shebang line (#!/bin/bash), iterating over FASTQ files, using basename to generate output file prefixes, extracting bad reads with grep, counting and logging results to a summary file.
Day 1 — Exercise Answer Key
Solutions to all Day 1 exercises are available here: Day 1 Answer Key
Day 2 — HPC, Data Handling & Conda
Day 2 begins with HPC and SLURM job scheduling on the lab417 cluster, then covers essential data handling skills — file transfer, compressed files, and keeping jobs running with screen — before moving into Conda for managing bioinformatics software environments. The day closes with a hands-on practical installing real Oxford Nanopore tools, bridging directly into the ONT genome assembly course later this week.
~4 h of content plus breaks (starts 09:00, finishes ~14:30).
Introduction to HPC & SLURM
Understand cluster computing and learn to submit jobs on the lab417 server.
slurmctld (controller) and slurmd (node daemons), partitions, jobs, job steps. Key SLURM commands: sinfo (cluster status), squeue (job queue), srun (interactive jobs), sbatch (batch submission), scancel (cancel jobs), scontrol show job. Interactive jobs vs. batch jobs. Writing SLURM batch scripts with #SBATCH directives: --job-name, --output, --error, --cpus-per-task, --mem, --time, --partition=agrp. The lab417 agrp partition. Being a good cluster citizen — resource requests, monitoring, and etiquette.
File Transfer — Moving Data On and Off the Server
Transfer sequencing data and results between your local computer and the lab417 server using scp and rsync.
scp localfile.fastq username@lab417.saiab.ac.za:~/data/). Copying a file from server to local machine (scp username@lab417.saiab.ac.za:~/results/assembly.fasta ./). Copying a directory recursively with scp -r. Introduction to rsync — smarter transfers that only copy what has changed (rsync -avh username@lab417.saiab.ac.za:~/results/ ./results/). Key rsync flags: -a (archive), -v (verbose), -h (human-readable), --progress (show transfer progress), --dry-run (preview without copying). When to use scp vs. rsync — scp for quick single files, rsync for directories and repeated transfers. Practical exercise: transfer a FASTQ file from the server to your local machine using both scp and rsync.
Compressed Files and Keeping Jobs Running with screen
Work with compressed sequencing files and keep long-running jobs alive after disconnecting from the server.
gzip and gunzip (gzip file.fastq → file.fastq.gz; gunzip file.fastq.gz). Viewing compressed files without decompressing (zcat file.fastq.gz | head). Piping compressed files directly into tools (zcat reads.fastq.gz | grep "^@" | wc -l). Archiving directories with tar — create (tar -czf results.tar.gz results/), extract (tar -xzf results.tar.gz), list contents (tar -tzf results.tar.gz). Persistent terminal sessions with screen: the problem that SSH disconnects kill running jobs; starting a session (screen, screen -S jobname); detaching with Ctrl+A then D; listing sessions (screen -ls); reattaching (screen -r jobname); closing with exit. Practical: start a screen session, run a long command, detach, log out, log back in, and reattach.
Introduction to Conda
Install and configure Conda for managing bioinformatics software without admin rights.
source ~/.bashrc. Verifying installation: conda --version, conda info, conda list. Understanding the base environment. Setting up bioinformatics channels: conda config --add channels bioconda, conda-forge, defaults. Setting strict channel priority. Searching and installing packages: conda search samtools, conda install -c bioconda samtools. Installing specific versions for reproducibility (e.g. samtools=1.15).
Conda Environments for Bioinformatics
Create isolated environments for different analyses — the key to reproducible bioinformatics.
cancer-rnaseq-2024), analysis-specific (RNA-seq, variant calling, assembly), pipeline-specific. Creating environments: conda create --name rnaseq python=3.9. Activating (conda activate) and deactivating (conda deactivate). Installing bioinformatics tools into environments — alignment (bwa, star, minimap2), variant calling (gatk4, bcftools), assembly (flye, spades), QC (fastqc, multiqc). Version pinning for reproducibility (e.g. samtools=1.15). Listing environments (conda env list) and packages (conda list). Exporting to YAML: conda env export > environment.yml. Recreating from YAML: conda env create -f environment.yml. Minimal exports with --from-history. Sharing environments for reproducible research.
Conda Practical — Installing ONT Bioinformatics Tools
Apply everything learned about Conda by building a real environment for Oxford Nanopore sequencing data QC — the same tools used in the ONT genome assembly course later this week.
conda create -n ont-qc python=3.10, conda activate ont-qc). Install real ONT tools from Bioconda (conda install -c bioconda -c conda-forge nanoq seqkit). Run nanoq on a provided ONT FASTQ file to generate read statistics (nanoq -i reads.fastq.gz -s). Run seqkit stats on the same file (seqkit stats reads.fastq.gz). Inspect and interpret the output — read count, N50, mean quality. Export the environment for reproducibility (conda env export > ont-qc-environment.yml, conda env export --from-history > ont-qc-minimal.yml). View the YAML and understand its structure. Using mamba as a faster drop-in replacement for conda install. Housekeeping to free disk space (conda clean --all). Bridge to the assembly course: this week you will use these same tools plus Flye and Medaka — installed in a new environment on Day 1 of the assembly course using the same workflow you just practiced.
Wrap-up and Q&A
Bringing it all together — reproducibility, documentation, and next steps into the ONT assembly course.
environment.yml with publications. Recap of the day's data-handling skills: file transfer, compressed files, screen, and Conda environments. The CHPC alternative: PBS job scheduling on Lengau. Open Q&A and bridge into the ONT genome assembly course later this week.
Quick Reference
Essential Unix Commands
# Navigation
pwd # print working directory
ls -lh # list files with sizes
cd directory/ # change directory
cd ~ # go home
cd .. # go up one level
# Files
cp source dest # copy
mv source dest # move or rename
rm file # remove (careful!)
mkdir dirname # create directory
# Viewing
cat file # print file contents
less file # page through file
head -n 20 file # first 20 lines
tail -n 20 file # last 20 lines
# Searching & Processing
grep "pattern" file # search for pattern
grep -c "pattern" file # count matches
wc -l file # count lines
sort file # sort lines
uniq # remove duplicates
# Redirection & Pipes
command > file # write output to file
command >> file # append output to file
cmd1 | cmd2 # pipe output of cmd1 into cmd2
# Permissions
chmod u+x script.sh # make script executable
chmod o-r file # remove read for others
# Variables
var="value" # assign
echo $var # use
Essential Conda Commands
# Environment Management
conda create -n myenv python=3.9 # create environment
conda activate myenv # activate
conda deactivate # deactivate
conda env list # list environments
conda env remove -n myenv # remove environment
# Package Management
conda install -c bioconda samtools # install from bioconda
conda install -c bioconda "bwa=0.7.17" # specific version
conda search -c bioconda toolname # search for packages
conda list # list installed packages
conda update toolname # update package
mamba install -c bioconda toolset # faster alternative
# Reproducibility
conda env export > environment.yml # export environment
conda env create -f environment.yml # recreate from file
conda env export --from-history > minimal.yml # minimal export
# Maintenance
conda clean --all # clear package cache
File Transfer & Compression
# Copy file to server
scp file.fastq username@lab417.saiab.ac.za:~/data/
# Copy file from server
scp username@lab417.saiab.ac.za:~/results/file.fasta ./
# Sync directory with rsync
rsync -avh --progress username@lab417.saiab.ac.za:~/results/ ./results/
# Compress / decompress
gzip file.fastq
gunzip file.fastq.gz
zcat file.fastq.gz | head -8
# Archive a directory
tar -czf archive.tar.gz directory/
tar -xzf archive.tar.gz
# screen — persistent sessions
screen -S mysession # start named session
# Ctrl+A then D # detach
screen -ls # list sessions
screen -r mysession # reattach
Essential SLURM Commands
# Interactive Session
srun --cpus-per-task=1 -t 0-3:00 --mem 1G --pty /bin/bash
# Job Submission
sbatch myscript.sh # submit batch job
squeue -u $USER # check your jobs
scancel JOBID # cancel a job
sinfo # cluster/partition status
# Batch Script Template
#!/bin/bash
#SBATCH --job-name=myanalysis
#SBATCH --output=myanalysis_%j.out
#SBATCH --error=myanalysis_%j.err
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=0-4:00:00
#SBATCH --partition=agrp
# Your commands here
echo "Job started at $(date)"
samtools sort input.bam -o sorted.bam -@ 4
echo "Job finished at $(date)"
Resources
Cheat Sheets
Online Tutorials
CHPC Users
If you work on the CHPC Lengau cluster (PBS scheduler), refer to the CHPC-specific lesson for PBS job submission and module usage.