After building over 50 production Nextflow pipelines for genomics labs, pharmaceutical companies, and research institutions, we’ve distilled what actually matters when learning Nextflow. This tutorial skips the theory and gets you writing real workflows.
Nextflow is a workflow management system that makes computational pipelines portable, reproducible, and scalable. Whether you’re processing genomics data, running machine learning experiments, or automating any data pipeline, Nextflow handles the complexity of distributed computing so you can focus on the science.
What You’ll Learn
- Install Nextflow and run your first pipeline
- Understand DSL2 syntax (the modern Nextflow)
- Write processes and connect them with channels
- Use operators to transform data
- Configure pipelines for Docker, Singularity, and Conda
- Run nf-core community pipelines
- Deploy to HPC clusters and cloud platforms
Prerequisites
Before starting, you should have:
- Basic command line familiarity
- Understanding of any scripting language (Bash, Python, R)
- Docker or Singularity installed (optional but recommended)
Installing Nextflow
Nextflow requires Java 11 or later. Installation is straightforward:
# Install with curl (recommended)
curl -s https://get.nextflow.io | bash
# Move to your PATH
sudo mv nextflow /usr/local/bin/
# Verify installation
nextflow -version
Alternative installation methods:
# Using Conda
conda install -c bioconda nextflow
# Using Homebrew (macOS)
brew install nextflow
# Using SDKMAN
sdk install nextflow
As of January 2026, the latest stable version is 25.12.x. Always check the official Nextflow documentation for the most current version.
Your First Nextflow Pipeline
Create a file called hello.nf:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process SAYHELLO {
output:
stdout
script:
"""
echo 'Hello, Nextflow!'
"""
}
workflow {
SAYHELLO()
}
Run it:
nextflow run hello.nf
You’ll see output like:
N E X T F L O W ~ version 25.12.0
Launching `hello.nf` [friendly_darwin] - revision: abc123
executor > local (1)
[ab/123456] process > SAYHELLO [100%] 1 of 1 ✔
Hello, Nextflow!
Congratulations—you’ve run your first Nextflow pipeline.
Understanding DSL2 Syntax
DSL2 (Domain Specific Language version 2) is the modern Nextflow syntax. It separates process definitions from their invocation, enabling modular, reusable code.
Enable DSL2
Always start your scripts with:
nextflow.enable.dsl=2
Key DSL2 Concepts
| Concept | Description |
|---|---|
| Process | A unit of execution (runs a script/command) |
| Channel | A queue that connects processes (data flows through channels) |
| Workflow | Defines how processes connect together |
| Module | A file containing reusable processes |
| Operator | Transforms channel data (map, filter, collect, etc.) |
Processes: The Building Blocks
A process defines what commands to run and how to handle inputs/outputs.
Basic Process Structure
process PROCESS_NAME {
// Directives (optional)
container 'ubuntu:latest'
cpus 4
memory '8 GB'
// Input declaration
input:
path input_file
// Output declaration
output:
path "output.txt"
// The actual command
script:
"""
cat ${input_file} > output.txt
"""
}
Process Directives
Directives configure how a process runs:
process ALIGN_READS {
// Container to use
container 'biocontainers/bwa:v0.7.17'
// Resource requirements
cpus 8
memory '32 GB'
time '4h'
// Publish outputs to a directory
publishDir "results/aligned", mode: 'copy'
// Error handling
errorStrategy 'retry'
maxRetries 3
// Labels for configuration
label 'high_memory'
input:
tuple val(sample_id), path(reads)
path reference
output:
tuple val(sample_id), path("${sample_id}.bam"), emit: aligned_bam
script:
"""
bwa mem -t ${task.cpus} ${reference} ${reads} | \
samtools sort -o ${sample_id}.bam
"""
}
Conditional Scripts
Use different scripts based on conditions:
process COMPRESS {
input:
path input_file
output:
path "*.gz"
script:
if (params.algorithm == 'gzip')
"""
gzip -c ${input_file} > ${input_file}.gz
"""
else if (params.algorithm == 'pigz')
"""
pigz -p ${task.cpus} -c ${input_file} > ${input_file}.gz
"""
else
error "Unknown compression algorithm: ${params.algorithm}"
}
Channels: Connecting Processes
Channels are asynchronous queues that transport data between processes. Understanding channels is essential for effective Nextflow programming.
Creating Channels
// From values
ch_numbers = Channel.of(1, 2, 3, 4, 5)
// From a list
ch_samples = Channel.of('sample1', 'sample2', 'sample3')
// From files
ch_fastq = Channel.fromPath('data/*.fastq.gz')
// From file pairs (common for paired-end reads)
ch_reads = Channel.fromFilePairs('data/*_{1,2}.fastq.gz')
// Emits: [sample_id, [read1.fastq.gz, read2.fastq.gz]]
// From a CSV/TSV file
ch_samplesheet = Channel
.fromPath('samplesheet.csv')
.splitCsv(header: true)
.map { row -> tuple(row.sample_id, file(row.fastq_1), file(row.fastq_2)) }
// Value channel (can be consumed multiple times)
ch_reference = Channel.value(file('reference.fasta'))
Channel Types
Queue channels can only be consumed once:
ch_data = Channel.of(1, 2, 3)
// First process consumes all data
// Second process gets nothing
Value channels can be consumed multiple times:
ch_reference = Channel.value(file('reference.fasta'))
// Both processes receive the same reference
Operators: Transforming Data
Operators manipulate channel contents. Here are the most commonly used ones:
map
Transform each element:
Channel.of(1, 2, 3, 4, 5)
.map { it * 2 }
.view()
// Output: 2, 4, 6, 8, 10
// With tuples
Channel.fromFilePairs('data/*_{1,2}.fastq.gz')
.map { sample_id, files ->
def meta = [id: sample_id, single_end: false]
[meta, files]
}
filter
Keep elements matching a condition:
Channel.of(1, 2, 3, 4, 5)
.filter { it > 2 }
.view()
// Output: 3, 4, 5
// Filter by file size
Channel.fromPath('data/*.fastq.gz')
.filter { it.size() > 1000000 } // Files > 1MB
collect
Gather all elements into a list:
Channel.of(1, 2, 3, 4)
.collect()
.view()
// Output: [1, 2, 3, 4]
// Common use: collect outputs for MultiQC
FASTQC.out.zip.collect()
flatten
Flatten nested structures:
Channel.of([1, [2, 3]], [4, 5])
.flatten()
.view()
// Output: 1, 2, 3, 4, 5
combine
Combine elements from two channels:
ch_samples = Channel.of('A', 'B')
ch_treatments = Channel.of('control', 'treated')
ch_samples.combine(ch_treatments).view()
// Output: [A, control], [A, treated], [B, control], [B, treated]
join
Join channels by a common key:
ch_reads = Channel.of(['sample1', 'reads1.fq'], ['sample2', 'reads2.fq'])
ch_bams = Channel.of(['sample1', 'sample1.bam'], ['sample2', 'sample2.bam'])
ch_reads.join(ch_bams).view()
// Output: [sample1, reads1.fq, sample1.bam], [sample2, reads2.fq, sample2.bam]
groupTuple
Group elements by a key:
Channel.of(['chr1', 'file1.vcf'], ['chr1', 'file2.vcf'], ['chr2', 'file3.vcf'])
.groupTuple()
.view()
// Output: [chr1, [file1.vcf, file2.vcf]], [chr2, [file3.vcf]]
Building a Real Workflow
Let’s build a simple RNA-seq quality control pipeline:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
/*
* Parameters
*/
params.reads = "data/*_{1,2}.fastq.gz"
params.outdir = "results"
/*
* Processes
*/
process FASTQC {
tag "${sample_id}"
container 'biocontainers/fastqc:v0.11.9'
publishDir "${params.outdir}/fastqc", mode: 'copy'
input:
tuple val(sample_id), path(reads)
output:
path "*.html", emit: html
path "*.zip", emit: zip
script:
"""
fastqc -t ${task.cpus} ${reads}
"""
}
process TRIMGALORE {
tag "${sample_id}"
container 'quay.io/biocontainers/trim-galore:0.6.7'
publishDir "${params.outdir}/trimmed", mode: 'copy'
input:
tuple val(sample_id), path(reads)
output:
tuple val(sample_id), path("*_val_{1,2}.fq.gz"), emit: trimmed_reads
path "*_trimming_report.txt", emit: log
script:
"""
trim_galore --paired --gzip ${reads}
"""
}
process MULTIQC {
container 'ewels/multiqc:latest'
publishDir "${params.outdir}/multiqc", mode: 'copy'
input:
path '*'
output:
path "multiqc_report.html"
path "multiqc_data"
script:
"""
multiqc .
"""
}
/*
* Workflow
*/
workflow {
// Create channel from input files
ch_reads = Channel.fromFilePairs(params.reads, checkIfExists: true)
// Run FastQC on raw reads
FASTQC(ch_reads)
// Trim reads
TRIMGALORE(ch_reads)
// Run FastQC on trimmed reads
FASTQC(TRIMGALORE.out.trimmed_reads)
// Collect all QC reports for MultiQC
ch_multiqc = FASTQC.out.zip.collect()
.mix(TRIMGALORE.out.log.collect())
.collect()
MULTIQC(ch_multiqc)
}
Run with:
nextflow run rnaseq_qc.nf --reads 'data/*_{1,2}.fastq.gz' --outdir results
Configuration
Nextflow uses configuration files to separate pipeline logic from execution settings.
Configuration File Hierarchy
Nextflow looks for configuration in this order (later overrides earlier):
$HOME/.nextflow/config(user defaults)nextflow.config(pipeline directory)-c custom.config(command line)--params-file params.yaml(parameters only)
Basic Configuration
Create nextflow.config:
// Pipeline parameters
params {
reads = "data/*_{1,2}.fastq.gz"
outdir = "results"
genome = "GRCh38"
}
// Process defaults
process {
cpus = 2
memory = '4 GB'
time = '1h'
// Per-process settings
withName: 'ALIGN_READS' {
cpus = 16
memory = '64 GB'
time = '8h'
}
// By label
withLabel: 'high_memory' {
memory = '128 GB'
}
}
// Execution profiles
profiles {
standard {
process.executor = 'local'
}
docker {
docker.enabled = true
docker.runOptions = '-u $(id -u):$(id -g)'
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
}
conda {
conda.enabled = true
}
slurm {
process.executor = 'slurm'
process.queue = 'normal'
}
aws {
process.executor = 'awsbatch'
aws.region = 'us-east-1'
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
}
}
// Execution reports
timeline {
enabled = true
file = "${params.outdir}/pipeline_info/timeline.html"
}
report {
enabled = true
file = "${params.outdir}/pipeline_info/report.html"
}
trace {
enabled = true
file = "${params.outdir}/pipeline_info/trace.txt"
}
dag {
enabled = true
file = "${params.outdir}/pipeline_info/dag.svg"
}
Using Profiles
# Run with Docker
nextflow run main.nf -profile docker
# Run on SLURM cluster with Singularity
nextflow run main.nf -profile slurm,singularity
# Multiple profiles (comma-separated)
nextflow run main.nf -profile test,docker
Modules: Reusable Code
DSL2 allows you to organize processes into modules for reuse.
Creating a Module
Create modules/fastqc.nf:
process FASTQC {
tag "${meta.id}"
label 'process_medium'
container 'biocontainers/fastqc:v0.11.9'
input:
tuple val(meta), path(reads)
output:
tuple val(meta), path("*.html"), emit: html
tuple val(meta), path("*.zip"), emit: zip
script:
"""
fastqc --threads ${task.cpus} ${reads}
"""
}
Importing Modules
In your main workflow:
include { FASTQC } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'
// Import with alias
include { FASTQC as FASTQC_RAW } from './modules/fastqc'
include { FASTQC as FASTQC_TRIMMED } from './modules/fastqc'
workflow {
FASTQC_RAW(ch_raw_reads)
FASTQC_TRIMMED(ch_trimmed_reads)
}
Subworkflows
Group related processes into subworkflows:
// subworkflows/qc.nf
include { FASTQC } from '../modules/fastqc'
include { MULTIQC } from '../modules/multiqc'
workflow QC {
take:
reads
main:
FASTQC(reads)
MULTIQC(FASTQC.out.zip.collect())
emit:
reports = MULTIQC.out.report
}
Use in main workflow:
include { QC } from './subworkflows/qc'
workflow {
QC(ch_reads)
}
Using nf-core Pipelines
nf-core is a community of Nextflow pipelines following best practices. With 90+ production-ready pipelines, you often don’t need to write your own.
Running nf-core Pipelines
# List available pipelines
nextflow run nf-core/rnaseq --help
# Run RNA-seq pipeline
nextflow run nf-core/rnaseq \
-profile docker \
--input samplesheet.csv \
--genome GRCh38 \
--outdir results
# Run with specific version
nextflow run nf-core/rnaseq -r 3.14.0 -profile singularity
# Test with minimal dataset
nextflow run nf-core/rnaseq -profile test,docker
Popular nf-core Pipelines
| Pipeline | Description |
|---|---|
| nf-core/rnaseq | RNA-seq analysis |
| nf-core/sarek | Variant calling for germline/somatic |
| nf-core/atacseq | ATAC-seq analysis |
| nf-core/chipseq | ChIP-seq analysis |
| nf-core/fetchngs | Download from SRA/ENA |
| nf-core/viralrecon | Viral genome analysis |
Creating Samplesheets
nf-core pipelines use standardized CSV samplesheets:
sample,fastq_1,fastq_2,strandedness
SAMPLE1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz,reverse
SAMPLE2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz,reverse
Essential Commands
Running Pipelines
# Basic run
nextflow run main.nf
# With parameters
nextflow run main.nf --reads 'data/*.fq.gz' --outdir results
# With profile
nextflow run main.nf -profile docker
# Resume from cache
nextflow run main.nf -resume
# Run specific entry point
nextflow run main.nf -entry WORKFLOW_NAME
# Run from GitHub
nextflow run nf-core/rnaseq -r main
# Pull latest version
nextflow pull nf-core/rnaseq
Debugging and Information
# View run history
nextflow log
# Show detailed log for specific run
nextflow log <run_name> -f hash,name,status,exit
# Clean work directory
nextflow clean -f
# Clean but keep specific run
nextflow clean -but <run_name>
# View configuration
nextflow config main.nf
# Show pipeline DAG
nextflow run main.nf -preview
Common Flags
| Flag | Description |
|---|---|
-resume | Resume from last checkpoint |
-profile | Use configuration profile |
-work-dir | Set work directory location |
-params-file | Load parameters from YAML/JSON |
-with-report | Generate HTML execution report |
-with-timeline | Generate timeline HTML |
-with-dag | Generate DAG visualization |
-with-tower | Monitor on Seqera Platform |
-ansi-log false | Disable ANSI colors (for logs) |
Error Handling and Debugging
Common Errors and Solutions
Process exits with non-zero code:
# Check the .command.log in work directory
cat work/ab/123456/.command.log
# Check the error output
cat work/ab/123456/.command.err
Out of memory:
process MEMORY_INTENSIVE {
memory { 8.GB * task.attempt }
errorStrategy 'retry'
maxRetries 3
// ...
}
File not found:
// Use checkIfExists
Channel.fromPath(params.reads, checkIfExists: true)
// Debug: print what the channel contains
Channel.fromPath(params.reads).view()
The Work Directory
Every process execution creates a directory under work/:
work/ab/123456789abcdef/
├── .command.sh # The actual script run
├── .command.run # Wrapper script
├── .command.log # Combined stdout/stderr
├── .command.out # Stdout only
├── .command.err # Stderr only
├── .exitcode # Exit status
└── output_file.txt # Output files (symlinked)
Resume Behavior
Nextflow caches task results based on:
- Input file content (checksum)
- Process script
- Process directives
- Container/conda environment
If any of these change, the task re-runs.
# Resume from cache
nextflow run main.nf -resume
# Force re-run everything
nextflow run main.nf -cache false
Best Practices
Based on our experience building production pipelines:
1. Use Containers
Always specify containers for reproducibility:
process ALIGN {
container 'quay.io/biocontainers/bwa:0.7.17'
// ...
}
2. Use Labels for Resource Management
process SMALL_TASK {
label 'process_single'
}
process BIG_TASK {
label 'process_high'
}
// In config:
process {
withLabel: 'process_single' { cpus = 1; memory = '2 GB' }
withLabel: 'process_high' { cpus = 16; memory = '64 GB' }
}
3. Validate Inputs
// Check files exist
Channel.fromPath(params.input, checkIfExists: true)
// Validate parameters
if (!params.genome) {
error "Please specify a genome with --genome"
}
4. Use emit for Named Outputs
output:
path "*.bam", emit: bam
path "*.bai", emit: bai
// Access in workflow:
ALIGN.out.bam
ALIGN.out.bai
5. Publish Important Results
publishDir "${params.outdir}/aligned", mode: 'copy', pattern: '*.bam'
6. Handle Failures Gracefully
errorStrategy { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries 3
maxErrors '-1' // Don't fail entire pipeline on one sample failure
7. Document Your Pipeline
/*
* Pipeline: RNA-seq Analysis
* Author: Your Name
* Description: Quality control and quantification of RNA-seq data
*/
Next Steps
You now have the fundamentals to build Nextflow pipelines. Here’s where to go next:
- Explore nf-core - Use existing pipelines before building your own
- Take the official training - training.nextflow.io offers in-depth courses
- Join the community - The nf-core Slack has active support channels
- Read the docs - The official documentation covers advanced topics
For a comparison with other workflow managers, see our Nextflow vs Snakemake guide.
Get Expert Nextflow Support
Building production-grade Nextflow pipelines requires expertise in bioinformatics, cloud infrastructure, and workflow optimization. Many teams spend months debugging container issues, optimizing HPC configurations, and scaling pipelines—time better spent on research.
Our Nextflow managed services help you:
- Design custom pipelines tailored to your research workflows
- Optimize existing pipelines for performance and cost efficiency
- Deploy to cloud platforms (AWS, Google Cloud, Azure) with proper configuration
- Integrate with Seqera Platform for monitoring and collaboration
- Train your team on Nextflow best practices
We’ve built pipelines processing petabytes of genomics data for pharmaceutical companies, research institutions, and clinical labs.
Get Nextflow consulting support →
Related Resources
- Nextflow vs Snakemake: Comprehensive Comparison
- Top Workflow Automation Tools 2025
- DevOps for Bioinformatics
External Resources: