Skip to the content.

Compressed Files and screen

Learning objectives

Approximate time: 25 minutes


Part 1 — Compressed files

Why is sequencing data always compressed?

Raw sequencing files are enormous — a single Nanopore or Illumina run can produce many gigabytes of reads. Storing and transferring these as plain text would be slow and wasteful. Compression shrinks them dramatically (often 3–5×), so sequencing data is almost always distributed as .gz files (gzip format), for example reads.fastq.gz.

The good news: most bioinformatics tools read .gz files directly, so you rarely need to decompress them at all.

Compressing and decompressing: gzip and gunzip

Compress a file with gzip. The original is replaced by a .gz version:

gzip file.fastq        # produces file.fastq.gz (file.fastq is removed)

Decompress it again with gunzip:

gunzip file.fastq.gz   # produces file.fastq (file.fastq.gz is removed)

Note: by default both commands replace the input file. To keep the original as well, use gzip -k file.fastq (the -k flag = "keep").

Reading compressed files without decompressing

You usually do not want to decompress a multi-gigabyte file just to peek at it. Use zcat — it is like cat, but reads gzip files on the fly. Combine it with head to view just the first few lines:

zcat file.fastq.gz | head

Because zcat writes to standard output, you can pipe it straight into any tool. For example, count the number of reads in a compressed FASTQ file (each read header starts with @):

zcat reads.fastq.gz | grep "^@" | wc -l

This reads the compressed file, keeps only header lines, and counts them — without ever writing an uncompressed copy to disk.

Archiving directories with tar

gzip compresses a single file. To bundle a whole directory into one file, use tar. The common flags are -c (create), -x (extract), -t (list), -z (gzip-compress), and -f (filename).

Create a compressed archive of a directory:

tar -czf results.tar.gz results/

Extract it again:

tar -xzf results.tar.gz

List the contents without extracting (useful to check what is inside first):

tar -tzf results.tar.gz

Tip: a .tar.gz archive is the standard way to send a whole results folder to a colleague, or to transfer many small files in one go with scp — one big file transfers far faster than thousands of tiny ones.


Part 2 — Keeping jobs running with screen

The problem: disconnects kill your jobs

When you start a long command in an SSH session and then close your laptop, lose Wi-Fi, or log out, the connection drops — and every program running in that terminal is killed. For an assembly that runs for hours, that is a disaster.

screen solves this. It creates a terminal session that lives on the server itself, independent of your SSH connection. You can detach from it, disconnect entirely, come back later, and reattach — with your job still running.

The screen workflow

Start a new named session (give it a meaningful name):

screen -S assembly

You are now inside the screen session. Start your long-running command as normal, e.g. a long assembly or download. Then detach — leave it running and return to your normal terminal — by pressing:

Ctrl+A   then   D

(Hold Ctrl and press A, release both, then press D.) Your job keeps running on the server. You can now safely log out.

Later — even from a different computer — log back in and list your sessions:

screen -ls

Reattach to your session and pick up where you left off:

screen -r assembly

When the work is finished and you no longer need the session, close it permanently by typing exit inside it.

Quick command summary

Note: Ctrl+A is the command key for screen — it tells screen "the next key is an instruction for you." That is why detaching is two steps: Ctrl+A (get screen's attention) then D (detach).

Exercises

  1. Count the reads in a compressed FASTQ file using zcat ... | grep "^@" | wc -l.
  2. Archive a directory with tar -czf, then list its contents with tar -tzf without extracting.
  3. Start a named screen session, run a long command (e.g. sleep 300 or a real download), detach with Ctrl+A then D, log out of the server, log back in, and reattach with screen -r. Confirm the command kept running.

SAIAB AGRP Bioinformatics Training. Open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0).