Conda for Bioinformatics

Managing Bioinformatics Software in Your Home Directory

Page 1: Introduction to Conda

What is Conda?

Conda is a powerful package and environment management system that allows you to install, update, and manage software packages and their dependencies entirely within your home directory. Unlike system-wide package managers that require administrator privileges, conda gives you complete control over your software environment without needing root access.

Why Use Conda?

  • No Admin Rights Required: Install complex software stacks in your home directory without bothering system administrators.
  • Dependency Management: Conda automatically resolves and installs all required dependencies, preventing the "dependency hell" that often plagues manual installations.
  • Environment Isolation: Create separate environments for different projects, preventing conflicts between different versions of the same software.
  • Cross-Platform: Works identically on Linux, macOS, and Windows.
  • Scientific Computing Focus: Excellent support for Python, R, scientific libraries, and data science tools.

What You Can Install with Conda

  • Python and R interpreters with different versions
  • Scientific libraries (NumPy, SciPy, Pandas, Matplotlib)
  • Machine learning frameworks (TensorFlow, PyTorch, scikit-learn)
  • Bioinformatics tools (BWA, SAMtools, BLAST, GATK, STAR)
  • Development tools (Git, editors, compilers)
  • System utilities and command-line tools

Course Prerequisites

  • Basic familiarity with command-line interface
  • Access to a terminal (Linux, macOS, or Windows with WSL)
  • At least 2GB of free space in your home directory

Page 2: Installing Conda

Choosing Your Conda Distribution

Miniconda (Recommended): Minimal installation with just conda and Python. Smaller download, faster installation, and you install only what you need.

Anaconda: Full distribution with 250+ pre-installed packages. Larger but comes with many common scientific computing tools.

Installing Miniconda (Linux/macOS)

Step 1: Download the installer

# For Linux (64-bit)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# For macOS (Intel)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

# For macOS (Apple Silicon)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh

Step 2: Run the installer

# Make it executable and run
chmod +x Miniconda3-latest-*.sh
./Miniconda3-latest-*.sh

Step 3: Follow the prompts

  • Press Enter to read the license
  • Type "yes" to accept the license
  • Accept the default installation location (in your home directory)
  • Type "yes" when asked to initialize conda

Step 4: Restart your terminal or run

source ~/.bashrc

Installing on Windows

Download the Windows installer from https://conda.io/miniconda.html and run it. The installer will guide you through the process.

Verifying Installation

# Check conda version
conda --version

# Check conda info
conda info

# List installed packages
conda list

If these commands work, conda is successfully installed!

Installation Location

By default, conda installs to:

  • Linux/macOS: ~/miniconda3/ or ~/anaconda3/
  • Windows: C:\Users\<username>\miniconda3\

Everything conda manages stays within this directory in your home space.

Page 3: Getting Started with Conda

Understanding the Base Environment

When conda is installed, it creates a "base" environment containing conda itself and a Python installation. For bioinformatics work, you'll primarily create specialized environments for different analyses rather than working in the base environment.

# Check which environment you're in
conda info --envs

# Check conda version
conda --version

# See what's installed in current environment
conda list

Essential Conda Commands for Bioinformatics

Keeping conda updated

# Always keep conda updated for latest bioinformatics packages
conda update conda

Setting up bioinformatics channels

The bioconda channel is essential for bioinformatics software. Set it up first:

# Add essential channels for bioinformatics
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# Set channel priority (recommended)
conda config --set channel_priority strict

Basic bioinformatics package operations

# Search for bioinformatics tools
conda search samtools
conda search bwa
conda search blast

# Install bioinformatics software
conda install samtools

# Install specific version (important for reproducibility)
conda install samtools=1.15

# Install multiple related tools
conda install samtools bcftools htslib

# Update bioinformatics tools
conda update samtools

# Remove tools
conda remove samtools

Your First Bioinformatics Package Installation

Let's install some commonly used bioinformatics tools:

# Install essential sequence analysis tools
conda install -c bioconda samtools bcftools bwa bowtie2 hisat2

# Install quality control tools
conda install -c bioconda fastqc multiqc trimmomatic

# Verify installation
samtools --version
bwa
fastqc --version

Understanding Bioinformatics Channels

Key channels for bioinformatics:

  • bioconda: Primary source for bioinformatics software (6000+ packages)
  • conda-forge: Community-maintained packages, including Python libraries
  • defaults: Anaconda's main channel
  • r: R packages for statistical analysis
# Install from specific channels
conda install -c bioconda blast
conda install -c conda-forge biopython
conda install -c r r-ggplot2

# Search in bioconda specifically
conda search -c bioconda "gatk*"

Bioinformatics Tip: Version Control

Always specify exact versions for critical analysis tools to ensure reproducibility. Many bioinformatics tools have version-specific behaviors that can affect results.

Practice Exercises

  1. Search for available versions of BLAST
  2. Install FastQC and check its version
  3. Install the latest version of BWA-MEM2
  4. List all currently installed bioinformatics packages
  5. Search for packages related to "assembly" in bioconda

Page 4: Bioinformatics Environments

Why Environments Are Critical in Bioinformatics

Bioinformatics workflows often require specific tool versions, and different analyses may need conflicting dependencies. Environments solve this by creating isolated spaces for each project or analysis type.

Common Bioinformatics Environment Patterns

  • Project-specific: One environment per research project
  • Analysis-specific: Separate environments for RNA-seq, ChIP-seq, variant calling, etc.
  • Tool-specific: Environments for complex tools with many dependencies (e.g., GATK, Nextflow)
  • Pipeline-specific: Environments matching published workflow requirements

Creating Bioinformatics Environments

RNA-seq analysis environment

# Create RNA-seq analysis environment
conda create --name rnaseq python=3.9

# Activate and install RNA-seq tools
conda activate rnaseq
conda install -c bioconda star salmon hisat2 stringtie htseq subread
conda install -c conda-forge numpy pandas matplotlib seaborn jupyter

Variant calling environment

# Create variant calling environment
conda create --name variantcalling python=3.8
conda activate variantcalling
conda install -c bioconda gatk4 samtools bcftools bwa picard
conda install -c conda-forge pandas vcftools

Genome assembly environment

# Create assembly environment
conda create --name assembly python=3.9
conda activate assembly
conda install -c bioconda spades flye canu quast busco
conda install -c conda-forge matplotlib seaborn

Managing Bioinformatics Environments

Working with environments

# List all environments
conda env list

# Activate specific environment
conda activate rnaseq

# Check what's installed in current environment
conda list

# Show environment info with sizes
conda info

# Deactivate environment
conda deactivate

Environment documentation for reproducibility

# Export environment for sharing/publication
conda env export > rnaseq_environment.yml

# Export with exact versions and hashes
conda env export --no-builds > rnaseq_environment_exact.yml

# Create environment from published requirements
conda env create -f published_workflow.yml

Bioinformatics Environment Files

RNA-seq environment.yml example

name: rnaseq-analysis
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - star=2.7.10a
  - salmon=1.9.0
  - hisat2=2.2.1
  - stringtie=2.2.1
  - htseq=2.0.2
  - subread=2.0.3
  - samtools=1.15
  - pandas=1.5.0
  - numpy=1.23.0
  - matplotlib=3.6.0
  - seaborn=0.11.2
  - jupyter=1.0.0
  - pip=22.0
  - pip:
    - multiqc==1.13

Complex workflow environment

name: chip-seq-pipeline
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - bwa=0.7.17
  - samtools=1.15
  - bedtools=2.30.0
  - macs2=2.2.7.1
  - deeptools=3.5.1
  - homer=4.11
  - fastqc=0.11.9
  - trim-galore=0.6.7
  - picard=2.27.4
  - r-base=4.2.0
  - bioconductor-diffbind=3.8.0
  - bioconductor-chipseeker=1.34.0

Environment Management Best Practices

Bioinformatics Environment Guidelines

  1. Descriptive naming: Use names like cancer-rnaseq-2024 or assembly-pacbio
  2. Version pinning: Always specify versions for critical tools
  3. Documentation: Export environment.yml files with your analysis
  4. Minimal environments: Don't install everything in one environment
  5. Testing environments: Create test environments for new tool versions

Sharing Environments for Reproducible Research

# Create environment from collaborator's file
conda env create -f collaborator_environment.yml

# Update existing environment from new requirements
conda env update -f updated_requirements.yml --prune

# Export minimal requirements (only explicitly installed)
conda env export --from-history > minimal_requirements.yml

Reproducibility Tip

Always include your environment.yml file with your analysis code and data. This allows others to recreate your exact computational environment, ensuring reproducible results.

Practice Exercises

  1. Create a metagenomics environment with Kraken2, MetaPhlAn, and QIIME2
  2. Set up a phylogenetics environment with RAxML, IQ-TREE, and FigTree
  3. Export your RNA-seq environment to a YAML file
  4. Create an environment from a provided environment.yml file
  5. Set up separate environments for Python and R-based analyses

Page 5: Managing Bioinformatics Software

Understanding Bioinformatics Package Ecosystem

Bioinformatics software comes from multiple sources with different update frequencies and dependency requirements:

  • Bioconda packages: Pre-compiled bioinformatics tools
  • PyPI packages: Python libraries via pip
  • R/Bioconductor: Statistical and genomics packages
  • Direct downloads: Some tools still require manual installation

Advanced Bioinformatics Package Installation

Version-specific installations

# Install specific tool versions for reproducibility
conda install -c bioconda "bwa=0.7.17"
conda install -c bioconda "gatk4=4.2.6.1"

# Install compatible version ranges
conda install -c bioconda "samtools>=1.12,<1.16"

# Install the latest patch version
conda install -c bioconda "blast=2.13.*"

Installing complex bioinformatics suites

# Install GATK with all dependencies
conda install -c bioconda gatk4

# Install Nextflow for workflow management
conda install -c bioconda nextflow

# Install complete R/Bioconductor environment
conda install -c conda-forge r-base r-essentials
conda install -c bioconda bioconductor-deseq2 bioconductor-edger

Mixing conda and pip for Python packages

# Install conda packages first (compiled dependencies)
conda install -c conda-forge numpy scipy pandas matplotlib
conda install -c bioconda pysam pyvcf

# Then install pip packages
pip install multiqc
pip install pydeseq2
pip install scanpy  # Single-cell analysis

Managing Tool Versions and Dependencies

Checking installed versions

# Check versions of key tools
samtools --version
bwa
gatk --version
python -c "import pandas; print(pandas.__version__)"

# List all packages with versions
conda list

# Check for available updates
conda search samtools

Handling version conflicts

# Create minimal environment for conflicting tools
conda create -n gatk-latest python=3.8
conda activate gatk-latest
conda install -c bioconda gatk4

# Use mamba for faster conflict resolution
conda install -c conda-forge mamba
mamba install -c bioconda complex-tool-set

Essential Bioinformatics Tool Categories

Sequence alignment and mapping

# Short read aligners
conda install -c bioconda bwa bowtie2 hisat2 star

# Long read aligners
conda install -c bioconda minimap2 ngmlr

# Multiple sequence alignment
conda install -c bioconda muscle mafft clustalw

Variant calling and analysis

# Variant callers
conda install -c bioconda gatk4 freebayes bcftools

# Variant annotation
conda install -c bioconda snpeff vep annovar

# Variant filtering and analysis
conda install -c bioconda vcftools bedtools

Assembly and annotation

# Genome assemblers
conda install -c bioconda spades megahit flye canu

# Assembly quality assessment
conda install -c bioconda quast busco

# Gene prediction and annotation
conda install -c bioconda augustus prokka maker

Transcriptomics tools

# RNA-seq quantification
conda install -c bioconda salmon kallisto htseq featurecounts

# Transcript assembly
conda install -c bioconda stringtie trinity cufflinks

# Differential expression (R packages)
conda install -c bioconda bioconductor-deseq2 bioconductor-edger

Quality Control and Visualization

# Quality control tools
conda install -c bioconda fastqc multiqc trimmomatic

# Visualization tools
conda install -c bioconda igv deeptools

# Statistical analysis (R)
conda install -c conda-forge r-base r-ggplot2 r-dplyr

Workflow Management Tools

# Workflow engines
conda install -c bioconda nextflow snakemake

# Container tools
conda install -c conda-forge singularity

# Environment management
conda install -c conda-forge jupyterlab notebook

Common Installation Issues

  • Dependency conflicts: Use separate environments for conflicting tools
  • Channel mixing: Stick to bioconda channel priority
  • Version pinning: Some tools require specific Python versions
  • Memory issues: Large tools may need more RAM during installation

Updating and Maintaining Bioinformatics Software

# Update specific tools (be careful with versions)
conda update -c bioconda samtools

# Update all tools in environment (risky for reproducibility)
conda update --all

# Check what would be updated without doing it
conda update --dry-run --all

# Downgrade if needed
conda install -c bioconda "samtools=1.12"

Reproducibility Warning

Be very careful when updating bioinformatics tools during an active analysis. Different versions can produce different results. Always test updates in a separate environment first.

Practice Exercises

  1. Install a complete variant calling pipeline (BWA, GATK4, Picard)
  2. Set up an environment for single-cell RNA analysis (scanpy, cellranger)
  3. Install phylogenetic analysis tools (RAxML, IQ-TREE, FigTree)
  4. Create a metagenomics analysis environment (Kraken2, Bracken, MetaPhlAn)
  5. Install and configure Nextflow with required dependencies

Page 6: Advanced Workflow Management

Project-Based Environment Organization

For complex bioinformatics projects, organize environments by analysis workflow rather than individual tools:

Multi-environment project structure

# Main analysis environment
conda create -n cancer-study-main python=3.9
conda activate cancer-study-main
conda install -c bioconda bwa gatk4 samtools bcftools

# Quality control environment  
conda create -n cancer-study-qc python=3.9
conda activate cancer-study-qc
conda install -c bioconda fastqc multiqc trimmomatic

# Visualization and reporting environment
conda create -n cancer-study-viz r-base=4.2
conda activate cancer-study-viz
conda install -c conda-forge r-ggplot2 r-dplyr jupyter

Environment Variables for Bioinformatics

Setting up project-specific variables

# Activate your project environment
conda activate cancer-study-main

# Set project-specific environment variables
conda env config vars set PROJECT_DIR=/home/user/cancer_study
conda env config vars set REFERENCE_GENOME=/data/genomes/hg38/hg38.fa
conda env config vars set SAMPLE_SHEET=/home/user/cancer_study/samples.csv
conda env config vars set RESULTS_DIR=/home/user/cancer_study/results

# Database paths
conda env config vars set GATK_BUNDLE=/data/gatk_bundle
conda env config vars set ANNOTATION_DB=/data/annotations

# Tool-specific settings
conda env config vars set JAVA_OPTS="-Xmx8g"
conda env config vars set OMP_NUM_THREADS=8

# List all environment variables
conda env config vars list

Automated Environment Setup

Advanced environment.yml with variables

name: rnaseq-pipeline
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - star=2.7.10a
  - salmon=1.9.0
  - samtools=1.15
  - stringtie=2.2.1
  - htseq=2.0.2
  - multiqc=1.13
  - fastqc=0.11.9
  - trim-galore=0.6.7
  - r-base=4.2.0
  - bioconductor-deseq2=1.38.0
  - jupyter=1.0.0
  - pip=22.0
  - pip:
    - pydeseq2
variables:
  PROJECT_DIR: "/home/user/rnaseq_project"
  REFERENCE_DIR: "/data/references/human"
  THREADS: "8"
  MEMORY: "32G"

Activation scripts for automated setup

# Create activation script directory
mkdir -p ~/miniconda3/envs/rnaseq-pipeline/etc/conda/activate.d/

# Create environment setup script
cat > ~/miniconda3/envs/rnaseq-pipeline/etc/conda/activate.d/setup.sh << 'EOF'
#!/bin/bash
echo "==================================="
echo "RNA-seq Pipeline Environment Active"
echo "==================================="
echo "Project Directory: $PROJECT_DIR"
echo "Reference Directory: $REFERENCE_DIR"
echo "Available threads: $THREADS"
echo "Memory allocation: $MEMORY"
echo "==================================="

# Create project directories if they don't exist
mkdir -p $PROJECT_DIR/{data,results,scripts,logs}
mkdir -p $PROJECT_DIR/results/{alignments,counts,diffexp,qc}

# Set up useful aliases
alias run-star="STAR --runThreadN $THREADS"
alias run-salmon="salmon quant --threads $THREADS"
alias check-samples="ls $PROJECT_DIR/data/*.fastq.gz | wc -l"

EOF

Template Environments for Common Analyses

Creating analysis templates

# Create template for genomics analyses
conda create -n genomics-template python=3.9
conda activate genomics-template
conda install -c bioconda samtools bcftools bedtools tabix
conda install -c conda-forge pandas numpy matplotlib

# Clone template for specific projects
conda create -n project1-genomics --clone genomics-template
conda create -n project2-genomics --clone genomics-template

# Create RNA-seq template
conda create -n rnaseq-template python=3.9
conda activate rnaseq-template
conda install -c bioconda star salmon stringtie htseq multiqc
conda install -c conda-forge jupyter pandas matplotlib seaborn

# Clone for different RNA-seq projects
conda create -n mouse-rnaseq --clone rnaseq-template
conda create -n human-rnaseq --clone rnaseq-template

Integration with Workflow Managers

Nextflow with conda environments

# Nextflow configuration for conda
# nextflow.config
conda.enabled = true
conda.cacheDir = "$HOME/conda-cache"

process {
    withName: 'FASTQC' {
        conda = 'bioconda::fastqc=0.11.9'
    }
    
    withName: 'TRIMMING' {
        conda = 'bioconda::trim-galore=0.6.7'
    }
    
    withName: 'ALIGNMENT' {
        conda = 'bioconda::star=2.7.10a bioconda::samtools=1.15'
    }
    
    withName: 'QUANTIFICATION' {
        conda = 'bioconda::salmon=1.9.0'
    }
}

Snakemake with conda environments

# Snakefile with conda environments
rule fastqc:
    input: "samples/{sample}.fastq.gz"
    output: "qc/{sample}_fastqc.html"
    conda: "envs/qc.yaml"
    shell: "fastqc {input} -o qc/"

rule align:
    input: "samples/{sample}.fastq.gz"
    output: "alignments/{sample}.bam"
    conda: "envs/alignment.yaml"
    threads: 8
    shell: "bwa mem -t {threads} reference.fa {input} | samtools sort -o {output}"

Performance Optimization for Large Datasets

Memory and CPU optimization

# Set memory limits for Java tools
conda env config vars set JAVA_OPTS="-Xmx32g -XX:ParallelGCThreads=4"

# Set thread counts for tools
conda env config vars set OMP_NUM_THREADS=16
conda env config vars set STAR_THREADS=16
conda env config vars set BWA_THREADS=16

# Optimize conda solver
conda install -c conda-forge mamba
conda config --set solver libmamba

Sharing and Archiving Analysis Environments

Complete project packaging

# Export exact environment specifications
conda env export --no-builds > analysis_environment.yml

# Create explicit package list
conda list --explicit > explicit_packages.txt

# Export minimal requirements
conda env export --from-history > minimal_requirements.yml

# Create archive with environment and scripts
tar -czf project_archive.tar.gz \
    analysis_environment.yml \
    scripts/ \
    README.md \
    parameters.txt

Publication Tip

When publishing research, include your environment.yml file as supplementary material. This allows other researchers to recreate your exact computational environment and reproduce your results.

Practice Exercises

  1. Create a multi-environment setup for a ChIP-seq analysis project
  2. Set up environment variables for a variant calling pipeline
  3. Create an activation script that sets up project directories automatically
  4. Design template environments for your most common analysis types
  5. Configure Nextflow or Snakemake to use conda environments

Page 7: Best Practices and Troubleshooting

Best Practices for Bioinformatics Conda Usage

Environment naming and organization

  1. Descriptive names: cancer-rnaseq-2024, covid-assembly-oxford
  2. Include analysis type: variantcalling, metagenomics, phylogeny
  3. Add organism or dataset: mouse-rnaseq, bacterial-assembly
  4. Version important analyses: paper-analysis-v1, paper-analysis-final
  5. Separate by workflow stage: preprocessing, analysis, visualization

Version control strategy

# Good: Pin critical tool versions
conda install -c bioconda "gatk4=4.2.6.1" "samtools=1.15" "bwa=0.7.17"

# Good: Use version ranges for minor updates
conda install -c bioconda "python>=3.8,<3.10" "numpy>=1.20,<1.24"

# Avoid: Constantly updating during analysis
# conda update --all  # Don't do this mid-analysis!

# Good: Test updates in separate environment
conda create -n test-updated --clone production-env
conda activate test-updated
conda update gatk4

Common Bioinformatics Issues and Solutions

Issue 1: Memory errors during large dataset processing

# Solution 1: Increase Java heap size for GATK/Picard
conda env config vars set JAVA_OPTS="-Xmx64g"

# Solution 2: Use tools with lower memory requirements
conda install -c bioconda sambamba  # Instead of samtools for some operations
conda install -c bioconda minimap2  # Often more memory efficient than BWA

# Solution 3: Process data in chunks
# Split large FASTQ files before processing
conda install -c bioconda seqtk
seqtk sample input.fastq.gz 1000000 > subset.fastq

Issue 2: Tool version conflicts in complex pipelines

# Solution 1: Use separate environments for conflicting tools
conda create -n old-gatk python=3.7 gatk=3.8
conda create -n new-gatk python=3.9 gatk4=4.2.6.1

# Solution 2: Use container-based solutions
conda install -c conda-forge singularity
# Pull specific tool versions in containers

# Solution 3: Use workflow managers with environment isolation
conda install -c bioconda nextflow snakemake

Issue 3: Bioconda package installation failures

# Solution 1: Use mamba for faster solving
conda install -c conda-forge mamba
mamba install -c bioconda problematic-package

# Solution 2: Specify compatible Python version
conda create -n newenv python=3.8  # Some tools need older Python
conda activate newenv
conda install -c bioconda tool-name

# Solution 3: Install from source as last resort
conda install -c conda-forge git cmake make gcc
git clone https://github.com/tool/repository.git
cd repository && make install

Performance Optimization for Bioinformatics

Speed up conda operations

# Use mamba for faster dependency resolution
mamba install -c bioconda large-tool-set

# Enable parallel downloads
conda config --set default_threads 8

# Use local channel cache
conda config --set channel_priority strict
conda config --set always_yes true

Optimize for large-scale analyses

# Set appropriate thread counts
conda env config vars set OMP_NUM_THREADS=32
conda env config vars set MKL_NUM_THREADS=32

# Configure tool-specific threading
conda env config vars set STAR_THREADS=32
conda env config vars set BWA_THREADS=32
conda env config vars set GATK_THREADS=32

# Set memory limits appropriately
conda env config vars set JAVA_OPTS="-Xmx128g"
conda env config vars set TMPDIR="/fast/tmp"

Maintenance and Monitoring

Regular maintenance tasks

# Monthly: Clean package cache (can be large)
conda clean --all

# Quarterly: Update non-critical environments
conda activate development-env
conda update --all

# Check environment disk usage
du -sh ~/miniconda3/envs/*/

# Archive completed project environments
conda env export -n completed-project > archive/project-env.yml
conda env remove -n completed-project

Monitoring environment health

# Check for broken environments
conda activate suspicious-env
conda list --explicit > test.txt

# Verify key tools work
samtools --version
python -c "import pandas; print('OK')"

# Check dependency conflicts
conda check

Reproducibility and Documentation

Complete documentation template

# Create project documentation
cat > PROJECT_README.md << 'EOF'
# Project: Cancer Genome Analysis

## Environment Setup
```bash
conda env create -f environment.yml
conda activate cancer-analysis
```

## Software Versions
- BWA: 0.7.17
- GATK: 4.2.6.1  
- Samtools: 1.15
- Python: 3.9.7

## Analysis Commands
```bash
# Quality control
fastqc samples/*.fastq.gz

# Alignment  
bwa mem reference.fa sample_R1.fastq.gz sample_R2.fastq.gz | samtools sort -o sample.bam

# Variant calling
gatk HaplotypeCaller -R reference.fa -I sample.bam -O variants.vcf
```

## Data Sources
- Reference genome: GRCh38
- Sample source: TCGA database
- Analysis date: 2024-01-15
EOF

Environment backup and recovery

# Create complete backup
mkdir -p backups/environments
conda env export -n important-analysis > backups/environments/important-analysis.yml

# Test recovery process
conda env remove -n important-analysis
conda env create -f backups/environments/important-analysis.yml

# Automated backup script
cat > backup_environments.sh << 'EOF'
#!/bin/bash
for env in $(conda info --envs | awk '{print $1}' | grep -v '#' | grep -v base); do
    conda env export -n $env > backups/environments/${env}.yml
done
EOF

Troubleshooting Advanced Issues

Debugging environment problems

# Verbose installation for debugging
conda install -v -c bioconda problematic-tool

# Check channel configuration
conda config --show channels
conda config --show channel_priority

# Reset conda configuration if needed
conda config --remove-key channels
conda config --add channels bioconda
conda config --add channels conda-forge

Emergency environment recovery

# If environment is corrupted, recreate from history
conda env export --from-history -n broken-env > recovery.yml
conda env remove -n broken-env
conda env create -f recovery.yml -n recovered-env

# If that fails, recreate from explicit list
conda list --explicit -n broken-env > explicit-recovery.txt
conda create --name recovered-env --file explicit-recovery.txt

Quick Reference: Essential Commands

# Environment management
conda create -n analysis-name python=3.9
conda activate analysis-name
conda deactivate
conda env list
conda env remove -n analysis-name

# Bioinformatics package management
conda install -c bioconda tool-name
conda install -c bioconda "tool-name=version"
conda search -c bioconda tool-name
conda list
mamba install -c bioconda tool-set

# Reproducibility
conda env export > environment.yml
conda env create -f environment.yml
conda env export --from-history > minimal.yml

# Maintenance
conda clean --all
conda update conda
mamba update tool-name

Final Project: Complete Bioinformatics Workflow Setup

Set up a complete variant calling analysis environment with the following requirements:

  1. Create environment named "variant-calling-hg38"
  2. Install: BWA, GATK4, Picard, Samtools, BCFtools, FastQC, MultiQC
  3. Set environment variables for reference genome and project directories
  4. Create activation script that sets up project directory structure
  5. Pin all critical tool versions for reproducibility
  6. Export environment file for sharing
  7. Create documentation with exact commands and versions

Solution Template:

# Create and setup environment
conda create -n variant-calling-hg38 python=3.8
conda activate variant-calling-hg38

# Install tools with specific versions
conda install -c bioconda \
    "bwa=0.7.17" \
    "gatk4=4.2.6.1" \
    "picard=2.27.4" \
    "samtools=1.15" \
    "bcftools=1.15" \
    "fastqc=0.11.9" \
    "multiqc=1.13"

# Set environment variables
conda env config vars set REFERENCE_GENOME="/data/genomes/GRCh38/GRCh38.fa"
conda env config vars set PROJECT_DIR="/home/user/variant_calling"
conda env config vars set GATK_BUNDLE="/data/gatk_bundle"
conda env config vars set JAVA_OPTS="-Xmx32g"
conda env config vars set THREADS="16"

# Export environment
conda env export > variant-calling-hg38.yml

# Test installation
gatk --version && echo "GATK OK"
samtools --version && echo "Samtools OK"
bwa && echo "BWA OK"

Congratulations!

You now have comprehensive knowledge of using conda for bioinformatics software management. You can create reproducible analysis environments, manage complex tool dependencies, and maintain organized computational workflows for your research projects.