Introduction to Unix for Bioinformaticians

Course Overview

This two-day course provides a practical introduction to the Unix command line and essential software management skills for bioinformatics research. You will learn to navigate the Unix file system, manipulate bioinformatics data files, write shell scripts for automation, manage software environments with Conda, and submit jobs on the SAIAB lab417 compute cluster using SLURM.

By the end of this course, you will be equipped with the foundational computing skills required to run bioinformatics analyses efficiently and reproducibly.

Learning Objectives

Navigate the Unix file system and work confidently at the command line
Create, examine, search, and manipulate text-based bioinformatics files
Use wildcards, pipes, and redirection to chain commands efficiently
Write reusable shell scripts with variables and loops
Understand file permissions and environment variables
Install and manage bioinformatics software using Conda environments
Understand HPC architecture and submit jobs using SLURM on lab417
Transfer data between local and remote systems using scp and rsync
Work with compressed sequencing files using gzip, zcat, and tar
Maintain persistent terminal sessions with screen to keep jobs running
Install and run real ONT bioinformatics tools in a Conda environment
Apply best practices for reproducibility in computational research

Prerequisites

No prior command-line experience is required. You should have a SAIAB lab417 account (or training account) and a laptop with an SSH client installed (Terminal on macOS; MobaXterm on Windows).

Day 1 — Unix Fundamentals & Shell Scripting

Day 1 covers the core Unix command-line skills: navigating the file system, working with files, understanding file permissions and environment variables, searching data, writing scripts, and using variables and loops for automation.

Schedule

Introduction to the Shell

Log in to the lab417 cluster and understand HPC architecture — login nodes vs. compute nodes, shared storage, and the SLURM scheduler.

Topics covered: Logging in via SSH (ssh username@lab417.saiab.ac.za). Starting an interactive session with srun. Copying exercise data with cp -r. Navigating the file system: cd, ls, ls -l, ls -F, pwd. Understanding the root directory (/) and home directory (~). Full vs. relative paths. Using man pages for help. Tab completion for efficient typing. Copying (cp), creating directories (mkdir), moving/renaming (mv), and removing (rm, rm -ri) files and directories.

Open full lesson

30 min

Wildcards & Shortcuts

Learn time-saving shortcuts for navigating and selecting files on the command line.

Topics covered: The * wildcard — matching any number of characters (e.g. ls *fq, ls Mov10*fq). The ? wildcard — matching exactly one character (e.g. ls /bin/d?). Home directory shortcut ~. Parent directory .. and current directory .. Command history with up/down arrow keys and history. Cancelling commands with Ctrl+C. Jump to start/end of line with Ctrl+A / Ctrl+E. Tab completion for file and directory names.

Open full lesson

30 min

Examining & Creating Files

View, create, and edit files from the command line using a variety of tools.

Topics covered: Printing file contents with cat (e.g. viewing FASTA files). Paging through large files with less — navigation keys (SPACE, b, g, G, q) and searching with /. Viewing the start and end of files with head -n and tail -n. Understanding the FASTQ file format (4-line records: header, sequence, separator, quality). Checking file sizes with ls -lh. Introduction to the Nano text editor — opening, navigating (Ctrl+A/E/Y/V), editing, cutting/pasting (Ctrl+K/U), saving (Ctrl+O), exiting (Ctrl+X), and search/replace (Ctrl+W, Ctrl+\).

Open full lesson

30 min

Permissions & Environment Variables

Understand how Unix controls access to files and how the shell environment is configured.

Topics covered: File ownership — user, group, and others. The groups command. Reading the permissions string (e.g. -rw-rw-r--): r (read), w (write), x (execute/traverse). Interpreting permissions for files vs. directories. Changing permissions with chmod — adding (+) and removing (-) permissions (e.g. chmod o-r file, chmod u+x script.sh). Checking directory sizes with du -sh. Environment variables: $HOME, $PATH, $USER. Viewing with echo and printenv. Adding directories to $PATH using export. The .bashrc file for persistent configuration.

Open full lesson

40 min

☕ Tea/Coffee Break — 15 min

Searching & Redirection

Search within files and chain commands together — one of the most powerful concepts in Unix.

Topics covered: Searching files with grep — basic pattern matching in FASTQ files (e.g. finding reads with NNNNNNNNNN). Using grep options: -B/-A for context lines, -n for line numbers, -c for counts, -v for inverted matches, --no-group-separator. Output redirection: writing to files with > and appending with >>. Piping with | — combining grep with less, head, wc -l. The wc command for counting lines, words, and characters. Extracting columns with cut -f. Sorting with sort and removing duplicates with sort -u. Practical exercise: counting unique exons in a GTF gene annotation file by chaining grep, cut, sort -u, and wc -l. Introduction to awk for column-based processing — printing specific columns (awk '{print $1, $3}'), filtering rows by column value (awk '$3 > 100'), and combining with pipes (cat file.txt | awk '{print $2}' | sort | uniq -c). Practical example: extract read names and lengths from a FASTQ file using awk.

Open full lesson

60 min

🍽️ Lunch Break — 60 min

Shell Scripts & Variables

Capture commands in reusable scripts and introduce variables for flexible automation.

Topics covered: What is a shell script — text files containing commands, .sh extension convention. Creating a simple script with nano (e.g. listing.sh) combining pwd, ls -l, and echo. Running scripts with sh or bash. Bash variables — defining (num=25), referencing ($num), rules (no spaces around =). Using variables as input to commands (e.g. wc -l $file). The basename command for extracting filenames and trimming extensions. Assigning command output to variables using backticks (`basename ...`). Adding comments with # for documentation. Building a practical directory_info.sh script that takes a directory path, reports its contents, and counts files.

Open full lesson

60 min

Loops & Automation

Iterate commands over multiple files and build automated analysis scripts.

Topics covered: The for loop structure — for, in, do, done keywords. Step-by-step loop execution: variable initialisation, body execution, reassignment. Using wildcards in loop lists (e.g. for file in Mov10*.fq). Best practices: meaningful variable names, not using ls in loop definitions. Combining loops with echo, wc -l, grep, and redirection. Building a complete automation script (generate_bad_reads_summary.sh): the shebang line (#!/bin/bash), iterating over FASTQ files, using basename to generate output file prefixes, extracting bad reads with grep, counting and logging results to a summary file.

Open full lesson

60 min

Day 1 — Exercise Answer Key

Solutions to all Day 1 exercises are available here: Day 1 Answer Key

Day 2 — HPC, Data Handling & Conda

Day 2 begins with HPC and SLURM job scheduling on the lab417 cluster, then covers essential data handling skills — file transfer, compressed files, and keeping jobs running with screen — before moving into Conda for managing bioinformatics software environments. The day closes with a hands-on practical installing real Oxford Nanopore tools, bridging directly into the ONT genome assembly course later this week.

~4 h of content plus breaks (starts 09:00, finishes ~14:30).

Schedule

Introduction to HPC & SLURM

Understand cluster computing and learn to submit jobs on the lab417 server.

Topics covered: Computer components review — CPU (cores, processors), data storage (HDD vs. SSD), memory (RAM). HPC architecture: nodes, shared storage, why the cluster is more powerful. Parallelisation concepts — serial vs. parallel processing, multithreading vs. true parallelisation. SLURM architecture: slurmctld (controller) and slurmd (node daemons), partitions, jobs, job steps. Key SLURM commands: sinfo (cluster status), squeue (job queue), srun (interactive jobs), sbatch (batch submission), scancel (cancel jobs), scontrol show job. Interactive jobs vs. batch jobs. Writing SLURM batch scripts with #SBATCH directives: --job-name, --output, --error, --cpus-per-task, --mem, --time, --partition=agrp. The lab417 agrp partition. Being a good cluster citizen — resource requests, monitoring, and etiquette.

Open full lesson

60 min

File Transfer — Moving Data On and Off the Server

Transfer sequencing data and results between your local computer and the lab417 server using scp and rsync.

Topics covered: Copying a file from local machine to server (scp localfile.fastq username@lab417.saiab.ac.za:~/data/). Copying a file from server to local machine (scp username@lab417.saiab.ac.za:~/results/assembly.fasta ./). Copying a directory recursively with scp -r. Introduction to rsync — smarter transfers that only copy what has changed (rsync -avh username@lab417.saiab.ac.za:~/results/ ./results/). Key rsync flags: -a (archive), -v (verbose), -h (human-readable), --progress (show transfer progress), --dry-run (preview without copying). When to use scp vs. rsync — scp for quick single files, rsync for directories and repeated transfers. Practical exercise: transfer a FASTQ file from the server to your local machine using both scp and rsync.

Open full lesson

20 min

Compressed Files and Keeping Jobs Running with `screen`

Work with compressed sequencing files and keep long-running jobs alive after disconnecting from the server.

Topics covered: Why sequencing data is always compressed — storage and transfer efficiency. Compressing and decompressing files with gzip and gunzip (gzip file.fastq → file.fastq.gz; gunzip file.fastq.gz). Viewing compressed files without decompressing (zcat file.fastq.gz | head). Piping compressed files directly into tools (zcat reads.fastq.gz | grep "^@" | wc -l). Archiving directories with tar — create (tar -czf results.tar.gz results/), extract (tar -xzf results.tar.gz), list contents (tar -tzf results.tar.gz). Persistent terminal sessions with screen: the problem that SSH disconnects kill running jobs; starting a session (screen, screen -S jobname); detaching with Ctrl+A then D; listing sessions (screen -ls); reattaching (screen -r jobname); closing with exit. Practical: start a screen session, run a long command, detach, log out, log back in, and reattach.

Open full lesson

25 min

☕ Tea/Coffee Break — 15 min

Introduction to Conda

Install and configure Conda for managing bioinformatics software without admin rights.

Topics covered: What is Conda — package and environment management in user space, no root required. Why Conda: dependency management, environment isolation, cross-platform, scientific computing focus. Choosing a distribution: Miniconda (recommended, minimal) vs. Anaconda (full). Installing Miniconda on Linux — downloading, running the installer, initialising with source ~/.bashrc. Verifying installation: conda --version, conda info, conda list. Understanding the base environment. Setting up bioinformatics channels: conda config --add channels bioconda, conda-forge, defaults. Setting strict channel priority. Searching and installing packages: conda search samtools, conda install -c bioconda samtools. Installing specific versions for reproducibility (e.g. samtools=1.15).

Open full lesson

30 min

Conda Environments for Bioinformatics

Create isolated environments for different analyses — the key to reproducible bioinformatics.

Topics covered: Why environments are critical — dependency conflicts, version-specific tool behaviour. Environment patterns: project-specific (cancer-rnaseq-2024), analysis-specific (RNA-seq, variant calling, assembly), pipeline-specific. Creating environments: conda create --name rnaseq python=3.9. Activating (conda activate) and deactivating (conda deactivate). Installing bioinformatics tools into environments — alignment (bwa, star, minimap2), variant calling (gatk4, bcftools), assembly (flye, spades), QC (fastqc, multiqc). Version pinning for reproducibility (e.g. samtools=1.15). Listing environments (conda env list) and packages (conda list). Exporting to YAML: conda env export > environment.yml. Recreating from YAML: conda env create -f environment.yml. Minimal exports with --from-history. Sharing environments for reproducible research.

Open full lesson

45 min

🍽️ Lunch Break — 60 min

Conda Practical — Installing ONT Bioinformatics Tools

Apply everything learned about Conda by building a real environment for Oxford Nanopore sequencing data QC — the same tools used in the ONT genome assembly course later this week.

Topics covered: Create a dedicated environment for ONT QC work (conda create -n ont-qc python=3.10, conda activate ont-qc). Install real ONT tools from Bioconda (conda install -c bioconda -c conda-forge nanoq seqkit). Run nanoq on a provided ONT FASTQ file to generate read statistics (nanoq -i reads.fastq.gz -s). Run seqkit stats on the same file (seqkit stats reads.fastq.gz). Inspect and interpret the output — read count, N50, mean quality. Export the environment for reproducibility (conda env export > ont-qc-environment.yml, conda env export --from-history > ont-qc-minimal.yml). View the YAML and understand its structure. Using mamba as a faster drop-in replacement for conda install. Housekeeping to free disk space (conda clean --all). Bridge to the assembly course: this week you will use these same tools plus Flye and Medaka — installed in a new environment on Day 1 of the assembly course using the same workflow you just practiced.

Open full lesson

40 min

☕ Tea/Coffee Break — 15 min

Wrap-up and Q&A

Bringing it all together — reproducibility, documentation, and next steps into the ONT assembly course.

Topics covered: Reproducibility recap — pin tool versions during active analysis and export environment.yml with publications. Recap of the day's data-handling skills: file transfer, compressed files, screen, and Conda environments. The CHPC alternative: PBS job scheduling on Lengau. Open Q&A and bridge into the ONT genome assembly course later this week.

Open full lesson

20 min

Quick Reference

Essential Unix Commands

# Navigation
pwd                     # print working directory
ls -lh                  # list files with sizes
cd directory/           # change directory
cd ~                    # go home
cd ..                   # go up one level

# Files
cp source dest          # copy
mv source dest          # move or rename
rm file                 # remove (careful!)
mkdir dirname           # create directory

# Viewing
cat file                # print file contents
less file               # page through file
head -n 20 file         # first 20 lines
tail -n 20 file         # last 20 lines

# Searching & Processing
grep "pattern" file     # search for pattern
grep -c "pattern" file  # count matches
wc -l file              # count lines
sort file               # sort lines
uniq                    # remove duplicates

# Redirection & Pipes
command > file          # write output to file
command >> file         # append output to file
cmd1 | cmd2             # pipe output of cmd1 into cmd2

# Permissions
chmod u+x script.sh    # make script executable
chmod o-r file          # remove read for others

# Variables
var="value"             # assign
echo $var               # use

Essential Conda Commands

# Environment Management
conda create -n myenv python=3.9        # create environment
conda activate myenv                    # activate
conda deactivate                        # deactivate
conda env list                          # list environments
conda env remove -n myenv              # remove environment

# Package Management
conda install -c bioconda samtools      # install from bioconda
conda install -c bioconda "bwa=0.7.17"  # specific version
conda search -c bioconda toolname       # search for packages
conda list                              # list installed packages
conda update toolname                   # update package
mamba install -c bioconda toolset        # faster alternative

# Reproducibility
conda env export > environment.yml             # export environment
conda env create -f environment.yml            # recreate from file
conda env export --from-history > minimal.yml  # minimal export

# Maintenance
conda clean --all                       # clear package cache

File Transfer & Compression

# Copy file to server
scp file.fastq username@lab417.saiab.ac.za:~/data/

# Copy file from server
scp username@lab417.saiab.ac.za:~/results/file.fasta ./

# Sync directory with rsync
rsync -avh --progress username@lab417.saiab.ac.za:~/results/ ./results/

# Compress / decompress
gzip file.fastq
gunzip file.fastq.gz
zcat file.fastq.gz | head -8

# Archive a directory
tar -czf archive.tar.gz directory/
tar -xzf archive.tar.gz

# screen — persistent sessions
screen -S mysession       # start named session
# Ctrl+A then D           # detach
screen -ls                # list sessions
screen -r mysession       # reattach

Essential SLURM Commands

# Interactive Session
srun --cpus-per-task=1 -t 0-3:00 --mem 1G --pty /bin/bash

# Job Submission
sbatch myscript.sh          # submit batch job
squeue -u $USER             # check your jobs
scancel JOBID               # cancel a job
sinfo                       # cluster/partition status

# Batch Script Template
#!/bin/bash
#SBATCH --job-name=myanalysis
#SBATCH --output=myanalysis_%j.out
#SBATCH --error=myanalysis_%j.err
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=0-4:00:00
#SBATCH --partition=agrp

# Your commands here
echo "Job started at $(date)"
samtools sort input.bam -o sorted.bam -@ 4
echo "Job finished at $(date)"