Compressed Files and screen
Learning objectives
- Explain why sequencing data is stored compressed
- Compress and decompress files with
gzipandgunzip - Read and process compressed files without decompressing them, using
zcatand pipes - Bundle directories into a single archive with
tar - Keep long-running jobs alive after disconnecting, using
screen
Approximate time: 25 minutes
Part 1 — Compressed files
Why is sequencing data always compressed?
Raw sequencing files are enormous — a single Nanopore or Illumina run can produce many gigabytes of reads. Storing and transferring these as plain text would be slow and wasteful. Compression shrinks them dramatically (often 3–5×), so sequencing data is almost always distributed as .gz files (gzip format), for example reads.fastq.gz.
The good news: most bioinformatics tools read .gz files directly, so you rarely need to decompress them at all.
Compressing and decompressing: gzip and gunzip
Compress a file with gzip. The original is replaced by a .gz version:
gzip file.fastq # produces file.fastq.gz (file.fastq is removed)
Decompress it again with gunzip:
gunzip file.fastq.gz # produces file.fastq (file.fastq.gz is removed)
Note: by default both commands replace the input file. To keep the original as well, use
gzip -k file.fastq(the-kflag = "keep").
Reading compressed files without decompressing
You usually do not want to decompress a multi-gigabyte file just to peek at it. Use zcat — it is like cat, but reads gzip files on the fly. Combine it with head to view just the first few lines:
zcat file.fastq.gz | head
Because zcat writes to standard output, you can pipe it straight into any tool. For example, count the number of reads in a compressed FASTQ file (each read header starts with @):
zcat reads.fastq.gz | grep "^@" | wc -l
This reads the compressed file, keeps only header lines, and counts them — without ever writing an uncompressed copy to disk.
Archiving directories with tar
gzip compresses a single file. To bundle a whole directory into one file, use tar. The common flags are -c (create), -x (extract), -t (list), -z (gzip-compress), and -f (filename).
Create a compressed archive of a directory:
tar -czf results.tar.gz results/
Extract it again:
tar -xzf results.tar.gz
List the contents without extracting (useful to check what is inside first):
tar -tzf results.tar.gz
Tip: a
.tar.gzarchive is the standard way to send a whole results folder to a colleague, or to transfer many small files in one go withscp— one big file transfers far faster than thousands of tiny ones.
Part 2 — Keeping jobs running with screen
The problem: disconnects kill your jobs
When you start a long command in an SSH session and then close your laptop, lose Wi-Fi, or log out, the connection drops — and every program running in that terminal is killed. For an assembly that runs for hours, that is a disaster.
screen solves this. It creates a terminal session that lives on the server itself, independent of your SSH connection. You can detach from it, disconnect entirely, come back later, and reattach — with your job still running.
The screen workflow
Start a new named session (give it a meaningful name):
screen -S assembly
You are now inside the screen session. Start your long-running command as normal, e.g. a long assembly or download. Then detach — leave it running and return to your normal terminal — by pressing:
Ctrl+A then D
(Hold Ctrl and press A, release both, then press D.) Your job keeps running on the server. You can now safely log out.
Later — even from a different computer — log back in and list your sessions:
screen -ls
Reattach to your session and pick up where you left off:
screen -r assembly
When the work is finished and you no longer need the session, close it permanently by typing exit inside it.
Quick command summary
screen— start a new (unnamed) sessionscreen -S jobname— start a named sessionCtrl+AthenD— detach (job keeps running)screen -ls— list active sessionsscreen -r jobname— reattach to a sessionexit— close the current session permanently
Note:
Ctrl+Ais the command key forscreen— it tells screen "the next key is an instruction for you." That is why detaching is two steps:Ctrl+A(get screen's attention) thenD(detach).
Exercises
- Count the reads in a compressed FASTQ file using
zcat ... | grep "^@" | wc -l. - Archive a directory with
tar -czf, then list its contents withtar -tzfwithout extracting. - Start a named
screensession, run a long command (e.g.sleep 300or a real download), detach withCtrl+AthenD, log out of the server, log back in, and reattach withscreen -r. Confirm the command kept running.
SAIAB AGRP Bioinformatics Training. Open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0).