Genomics

FastQC

Quality control tool for high throughput sequencing data

FastQC is a quality control tool for high throughput sequence data. It provides a modular set of analyses to give you a quick impression of whether your data has any problems before you do any further analysis.

Overview

FastQC can be run in one of two modes: interactive or non-interactive. In non-interactive mode, the program will process all specified files and produce an HTML report for each one.

Key Features:

  • Import data from BAM, SAM or FastQ files
  • Quick overview of quality control metrics
  • Summary graphs and tables
  • Identify problems in sequencing data
  • Export reports in HTML format

Docker Compose Configuration

version: '3.8'

services:
  fastqc:
    image: biocontainers/fastqc:v0.11.9_cv8
    container_name: dxflow-fastqc

    # Working directory
    working_dir: /data

    # Mount volumes for input/output
    volumes:
      - ./input:/data/input:ro
      - ./output:/data/output

    # Command to run FastQC
    command: >
      fastqc
      /data/input/*.fastq.gz
      --outdir /data/output
      --threads 4

    # Resource limits
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4G

Usage

Step 1: Prepare Data

Upload your FASTQ files:

# Create input directory
mkdir -p input output

# Upload your sequencing files
dxflow fs upload /local/sample_R1.fastq.gz input/
dxflow fs upload /local/sample_R2.fastq.gz input/

Step 2: Deploy Workflow

# Deploy FastQC workflow
dxflow compose create --identity fastqc-analysis fastqc.yml

# Start analysis
dxflow compose start fastqc-analysis

Step 3: Monitor Progress

# View logs
dxflow compose logs fastqc-analysis

# Check status
dxflow compose list

Step 4: Retrieve Results

# Download HTML reports
dxflow fs download output/ /local/fastqc-results/

Configuration Options

Basic Options

OptionDescriptionDefault
--threadsNumber of CPU threads1
--outdirOutput directoryCurrent directory
--extractExtract ZIP filesfalse
--noextractDo not extract ZIP filesfalse

Advanced Options

OptionDescription
--casavaFiles from Casava pipeline
--nofilterDo not filter sequences
--formatInput file format (fastq, bam, sam)
--contaminantsCustom contaminants file
--adaptersCustom adapters file
--limitsCustom limits file

Output Files

FastQC generates the following output files:

  • *_fastqc.html - HTML report with all graphs and tables
  • *_fastqc.zip - ZIP archive containing detailed data files
  • summary.txt - Summary of pass/warn/fail for all modules
  • fastqc_data.txt - Raw data for all analyses

Quality Metrics

FastQC evaluates several quality metrics:

Basic Statistics

  • File name, type, encoding
  • Total sequences, filtered sequences
  • Sequence length, %GC content

Per Base Sequence Quality

  • Quality score distribution across all bases
  • Identifies low quality regions
  • PASS: Median quality ≥ 25
  • WARN: Lower quartile < 10 or median < 25
  • FAIL: Lower quartile < 5 or median < 20

Per Sequence Quality Scores

  • Distribution of quality scores across all sequences
  • PASS: Most sequences have quality > 27
  • WARN: Peak quality < 27
  • FAIL: Peak quality < 20

Sequence Duplication Levels

  • Degree of duplication in the library
  • High duplication may indicate PCR amplification issues

Adapter Content

  • Presence of adapter sequences
  • Important for trimming decisions

Example Workflow

Complete analysis workflow:

# 1. Upload raw FASTQ files
dxflow fs upload raw_data/ input/

# 2. Deploy and run FastQC
dxflow compose create --identity qc fastqc.yml
dxflow compose start qc

# 3. Wait for completion (monitor logs)
dxflow compose logs -f qc

# 4. Download reports
dxflow fs download output/ results/

# 5. Review HTML reports in browser
# 6. Decide on trimming/filtering based on results

System Requirements

Minimum:

  • CPU: 2 cores
  • RAM: 2GB
  • Storage: 10GB

Recommended:

  • CPU: 4+ cores for faster processing
  • RAM: 4GB for large files
  • Storage: 50GB for multiple samples

Performance Tips

Optimize Processing:

  • Use multiple threads for faster analysis
  • Process multiple files in parallel
  • Use local SSD for better I/O performance

Batch Processing:

# Process all FASTQ files in parallel
command: >
  parallel -j 4 fastqc {} --outdir /data/output ::: /data/input/*.fastq.gz

Interpreting Results

Good Quality Data

  • Per base quality scores mostly in green zone (>28)
  • Even GC content distribution
  • Low duplication levels
  • No adapter contamination

Issues to Watch For

  • Declining quality at 3' end → Consider trimming
  • Unusual GC content → Possible contamination
  • High duplication → Library complexity issues
  • Adapter content → Trimming required
  • Overrepresented sequences → Possible contamination

Troubleshooting

Container fails to start:

  • Check input file permissions
  • Verify volume mount paths exist
  • Ensure sufficient disk space

Out of memory errors:

  • Increase memory limit in compose file
  • Process files individually
  • Use smaller file chunks

No output files:

  • Check command syntax
  • Verify output directory is writable
  • Review container logs for errors

Next Steps

After FastQC analysis:

  1. If quality is good: Proceed to alignment/assembly
  2. If trimming needed: Use Trimmomatic or fastp
  3. If contamination found: Filter/remove contaminant sequences
  4. If adapter present: Perform adapter trimming

References

Citation

If you use FastQC in your research, please cite:

Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data.
Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Need help? Check the troubleshooting section or report issues on GitHub.