Engineering

Nextflow Tutorial 2026: We Built 50+ Pipelines—Here's How to Start

Engineering Team

After building over 50 production Nextflow pipelines for genomics labs, pharmaceutical companies, and research institutions, we’ve distilled what actually matters when learning Nextflow. This tutorial skips the theory and gets you writing real workflows.

Nextflow is a workflow management system that makes computational pipelines portable, reproducible, and scalable. Whether you’re processing genomics data, running machine learning experiments, or automating any data pipeline, Nextflow handles the complexity of distributed computing so you can focus on the science.

What You’ll Learn

  • Install Nextflow and run your first pipeline
  • Understand DSL2 syntax (the modern Nextflow)
  • Write processes and connect them with channels
  • Use operators to transform data
  • Configure pipelines for Docker, Singularity, and Conda
  • Run nf-core community pipelines
  • Deploy to HPC clusters and cloud platforms

Prerequisites

Before starting, you should have:

  • Basic command line familiarity
  • Understanding of any scripting language (Bash, Python, R)
  • Docker or Singularity installed (optional but recommended)

Installing Nextflow

Nextflow requires Java 11 or later. Installation is straightforward:

# Install with curl (recommended)
curl -s https://get.nextflow.io | bash

# Move to your PATH
sudo mv nextflow /usr/local/bin/

# Verify installation
nextflow -version

Alternative installation methods:

# Using Conda
conda install -c bioconda nextflow

# Using Homebrew (macOS)
brew install nextflow

# Using SDKMAN
sdk install nextflow

As of January 2026, the latest stable version is 25.12.x. Always check the official Nextflow documentation for the most current version.


Your First Nextflow Pipeline

Create a file called hello.nf:

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

process SAYHELLO {
    output:
    stdout

    script:
    """
    echo 'Hello, Nextflow!'
    """
}

workflow {
    SAYHELLO()
}

Run it:

nextflow run hello.nf

You’ll see output like:

N E X T F L O W  ~  version 25.12.0
Launching `hello.nf` [friendly_darwin] - revision: abc123

executor >  local (1)
[ab/123456] process > SAYHELLO [100%] 1 of 1 ✔
Hello, Nextflow!

Congratulations—you’ve run your first Nextflow pipeline.


Understanding DSL2 Syntax

DSL2 (Domain Specific Language version 2) is the modern Nextflow syntax. It separates process definitions from their invocation, enabling modular, reusable code.

Enable DSL2

Always start your scripts with:

nextflow.enable.dsl=2

Key DSL2 Concepts

ConceptDescription
ProcessA unit of execution (runs a script/command)
ChannelA queue that connects processes (data flows through channels)
WorkflowDefines how processes connect together
ModuleA file containing reusable processes
OperatorTransforms channel data (map, filter, collect, etc.)

Processes: The Building Blocks

A process defines what commands to run and how to handle inputs/outputs.

Basic Process Structure

process PROCESS_NAME {
    // Directives (optional)
    container 'ubuntu:latest'
    cpus 4
    memory '8 GB'

    // Input declaration
    input:
    path input_file

    // Output declaration
    output:
    path "output.txt"

    // The actual command
    script:
    """
    cat ${input_file} > output.txt
    """
}

Process Directives

Directives configure how a process runs:

process ALIGN_READS {
    // Container to use
    container 'biocontainers/bwa:v0.7.17'

    // Resource requirements
    cpus 8
    memory '32 GB'
    time '4h'

    // Publish outputs to a directory
    publishDir "results/aligned", mode: 'copy'

    // Error handling
    errorStrategy 'retry'
    maxRetries 3

    // Labels for configuration
    label 'high_memory'

    input:
    tuple val(sample_id), path(reads)
    path reference

    output:
    tuple val(sample_id), path("${sample_id}.bam"), emit: aligned_bam

    script:
    """
    bwa mem -t ${task.cpus} ${reference} ${reads} | \
        samtools sort -o ${sample_id}.bam
    """
}

Conditional Scripts

Use different scripts based on conditions:

process COMPRESS {
    input:
    path input_file

    output:
    path "*.gz"

    script:
    if (params.algorithm == 'gzip')
        """
        gzip -c ${input_file} > ${input_file}.gz
        """
    else if (params.algorithm == 'pigz')
        """
        pigz -p ${task.cpus} -c ${input_file} > ${input_file}.gz
        """
    else
        error "Unknown compression algorithm: ${params.algorithm}"
}

Channels: Connecting Processes

Channels are asynchronous queues that transport data between processes. Understanding channels is essential for effective Nextflow programming.

Creating Channels

// From values
ch_numbers = Channel.of(1, 2, 3, 4, 5)

// From a list
ch_samples = Channel.of('sample1', 'sample2', 'sample3')

// From files
ch_fastq = Channel.fromPath('data/*.fastq.gz')

// From file pairs (common for paired-end reads)
ch_reads = Channel.fromFilePairs('data/*_{1,2}.fastq.gz')
// Emits: [sample_id, [read1.fastq.gz, read2.fastq.gz]]

// From a CSV/TSV file
ch_samplesheet = Channel
    .fromPath('samplesheet.csv')
    .splitCsv(header: true)
    .map { row -> tuple(row.sample_id, file(row.fastq_1), file(row.fastq_2)) }

// Value channel (can be consumed multiple times)
ch_reference = Channel.value(file('reference.fasta'))

Channel Types

Queue channels can only be consumed once:

ch_data = Channel.of(1, 2, 3)
// First process consumes all data
// Second process gets nothing

Value channels can be consumed multiple times:

ch_reference = Channel.value(file('reference.fasta'))
// Both processes receive the same reference

Operators: Transforming Data

Operators manipulate channel contents. Here are the most commonly used ones:

map

Transform each element:

Channel.of(1, 2, 3, 4, 5)
    .map { it * 2 }
    .view()
// Output: 2, 4, 6, 8, 10

// With tuples
Channel.fromFilePairs('data/*_{1,2}.fastq.gz')
    .map { sample_id, files ->
        def meta = [id: sample_id, single_end: false]
        [meta, files]
    }

filter

Keep elements matching a condition:

Channel.of(1, 2, 3, 4, 5)
    .filter { it > 2 }
    .view()
// Output: 3, 4, 5

// Filter by file size
Channel.fromPath('data/*.fastq.gz')
    .filter { it.size() > 1000000 }  // Files > 1MB

collect

Gather all elements into a list:

Channel.of(1, 2, 3, 4)
    .collect()
    .view()
// Output: [1, 2, 3, 4]

// Common use: collect outputs for MultiQC
FASTQC.out.zip.collect()

flatten

Flatten nested structures:

Channel.of([1, [2, 3]], [4, 5])
    .flatten()
    .view()
// Output: 1, 2, 3, 4, 5

combine

Combine elements from two channels:

ch_samples = Channel.of('A', 'B')
ch_treatments = Channel.of('control', 'treated')

ch_samples.combine(ch_treatments).view()
// Output: [A, control], [A, treated], [B, control], [B, treated]

join

Join channels by a common key:

ch_reads = Channel.of(['sample1', 'reads1.fq'], ['sample2', 'reads2.fq'])
ch_bams = Channel.of(['sample1', 'sample1.bam'], ['sample2', 'sample2.bam'])

ch_reads.join(ch_bams).view()
// Output: [sample1, reads1.fq, sample1.bam], [sample2, reads2.fq, sample2.bam]

groupTuple

Group elements by a key:

Channel.of(['chr1', 'file1.vcf'], ['chr1', 'file2.vcf'], ['chr2', 'file3.vcf'])
    .groupTuple()
    .view()
// Output: [chr1, [file1.vcf, file2.vcf]], [chr2, [file3.vcf]]

Building a Real Workflow

Let’s build a simple RNA-seq quality control pipeline:

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

/*
 * Parameters
 */
params.reads = "data/*_{1,2}.fastq.gz"
params.outdir = "results"

/*
 * Processes
 */
process FASTQC {
    tag "${sample_id}"
    container 'biocontainers/fastqc:v0.11.9'
    publishDir "${params.outdir}/fastqc", mode: 'copy'

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*.html", emit: html
    path "*.zip", emit: zip

    script:
    """
    fastqc -t ${task.cpus} ${reads}
    """
}

process TRIMGALORE {
    tag "${sample_id}"
    container 'quay.io/biocontainers/trim-galore:0.6.7'
    publishDir "${params.outdir}/trimmed", mode: 'copy'

    input:
    tuple val(sample_id), path(reads)

    output:
    tuple val(sample_id), path("*_val_{1,2}.fq.gz"), emit: trimmed_reads
    path "*_trimming_report.txt", emit: log

    script:
    """
    trim_galore --paired --gzip ${reads}
    """
}

process MULTIQC {
    container 'ewels/multiqc:latest'
    publishDir "${params.outdir}/multiqc", mode: 'copy'

    input:
    path '*'

    output:
    path "multiqc_report.html"
    path "multiqc_data"

    script:
    """
    multiqc .
    """
}

/*
 * Workflow
 */
workflow {
    // Create channel from input files
    ch_reads = Channel.fromFilePairs(params.reads, checkIfExists: true)

    // Run FastQC on raw reads
    FASTQC(ch_reads)

    // Trim reads
    TRIMGALORE(ch_reads)

    // Run FastQC on trimmed reads
    FASTQC(TRIMGALORE.out.trimmed_reads)

    // Collect all QC reports for MultiQC
    ch_multiqc = FASTQC.out.zip.collect()
        .mix(TRIMGALORE.out.log.collect())
        .collect()

    MULTIQC(ch_multiqc)
}

Run with:

nextflow run rnaseq_qc.nf --reads 'data/*_{1,2}.fastq.gz' --outdir results

Configuration

Nextflow uses configuration files to separate pipeline logic from execution settings.

Configuration File Hierarchy

Nextflow looks for configuration in this order (later overrides earlier):

  1. $HOME/.nextflow/config (user defaults)
  2. nextflow.config (pipeline directory)
  3. -c custom.config (command line)
  4. --params-file params.yaml (parameters only)

Basic Configuration

Create nextflow.config:

// Pipeline parameters
params {
    reads = "data/*_{1,2}.fastq.gz"
    outdir = "results"
    genome = "GRCh38"
}

// Process defaults
process {
    cpus = 2
    memory = '4 GB'
    time = '1h'

    // Per-process settings
    withName: 'ALIGN_READS' {
        cpus = 16
        memory = '64 GB'
        time = '8h'
    }

    // By label
    withLabel: 'high_memory' {
        memory = '128 GB'
    }
}

// Execution profiles
profiles {
    standard {
        process.executor = 'local'
    }

    docker {
        docker.enabled = true
        docker.runOptions = '-u $(id -u):$(id -g)'
    }

    singularity {
        singularity.enabled = true
        singularity.autoMounts = true
    }

    conda {
        conda.enabled = true
    }

    slurm {
        process.executor = 'slurm'
        process.queue = 'normal'
    }

    aws {
        process.executor = 'awsbatch'
        aws.region = 'us-east-1'
        aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
    }
}

// Execution reports
timeline {
    enabled = true
    file = "${params.outdir}/pipeline_info/timeline.html"
}

report {
    enabled = true
    file = "${params.outdir}/pipeline_info/report.html"
}

trace {
    enabled = true
    file = "${params.outdir}/pipeline_info/trace.txt"
}

dag {
    enabled = true
    file = "${params.outdir}/pipeline_info/dag.svg"
}

Using Profiles

# Run with Docker
nextflow run main.nf -profile docker

# Run on SLURM cluster with Singularity
nextflow run main.nf -profile slurm,singularity

# Multiple profiles (comma-separated)
nextflow run main.nf -profile test,docker

Modules: Reusable Code

DSL2 allows you to organize processes into modules for reuse.

Creating a Module

Create modules/fastqc.nf:

process FASTQC {
    tag "${meta.id}"
    label 'process_medium'
    container 'biocontainers/fastqc:v0.11.9'

    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("*.html"), emit: html
    tuple val(meta), path("*.zip"), emit: zip

    script:
    """
    fastqc --threads ${task.cpus} ${reads}
    """
}

Importing Modules

In your main workflow:

include { FASTQC } from './modules/fastqc'
include { MULTIQC } from './modules/multiqc'

// Import with alias
include { FASTQC as FASTQC_RAW } from './modules/fastqc'
include { FASTQC as FASTQC_TRIMMED } from './modules/fastqc'

workflow {
    FASTQC_RAW(ch_raw_reads)
    FASTQC_TRIMMED(ch_trimmed_reads)
}

Subworkflows

Group related processes into subworkflows:

// subworkflows/qc.nf
include { FASTQC } from '../modules/fastqc'
include { MULTIQC } from '../modules/multiqc'

workflow QC {
    take:
    reads

    main:
    FASTQC(reads)
    MULTIQC(FASTQC.out.zip.collect())

    emit:
    reports = MULTIQC.out.report
}

Use in main workflow:

include { QC } from './subworkflows/qc'

workflow {
    QC(ch_reads)
}

Using nf-core Pipelines

nf-core is a community of Nextflow pipelines following best practices. With 90+ production-ready pipelines, you often don’t need to write your own.

Running nf-core Pipelines

# List available pipelines
nextflow run nf-core/rnaseq --help

# Run RNA-seq pipeline
nextflow run nf-core/rnaseq \
    -profile docker \
    --input samplesheet.csv \
    --genome GRCh38 \
    --outdir results

# Run with specific version
nextflow run nf-core/rnaseq -r 3.14.0 -profile singularity

# Test with minimal dataset
nextflow run nf-core/rnaseq -profile test,docker
PipelineDescription
nf-core/rnaseqRNA-seq analysis
nf-core/sarekVariant calling for germline/somatic
nf-core/atacseqATAC-seq analysis
nf-core/chipseqChIP-seq analysis
nf-core/fetchngsDownload from SRA/ENA
nf-core/viralreconViral genome analysis

Creating Samplesheets

nf-core pipelines use standardized CSV samplesheets:

sample,fastq_1,fastq_2,strandedness
SAMPLE1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz,reverse
SAMPLE2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz,reverse

Essential Commands

Running Pipelines

# Basic run
nextflow run main.nf

# With parameters
nextflow run main.nf --reads 'data/*.fq.gz' --outdir results

# With profile
nextflow run main.nf -profile docker

# Resume from cache
nextflow run main.nf -resume

# Run specific entry point
nextflow run main.nf -entry WORKFLOW_NAME

# Run from GitHub
nextflow run nf-core/rnaseq -r main

# Pull latest version
nextflow pull nf-core/rnaseq

Debugging and Information

# View run history
nextflow log

# Show detailed log for specific run
nextflow log <run_name> -f hash,name,status,exit

# Clean work directory
nextflow clean -f

# Clean but keep specific run
nextflow clean -but <run_name>

# View configuration
nextflow config main.nf

# Show pipeline DAG
nextflow run main.nf -preview

Common Flags

FlagDescription
-resumeResume from last checkpoint
-profileUse configuration profile
-work-dirSet work directory location
-params-fileLoad parameters from YAML/JSON
-with-reportGenerate HTML execution report
-with-timelineGenerate timeline HTML
-with-dagGenerate DAG visualization
-with-towerMonitor on Seqera Platform
-ansi-log falseDisable ANSI colors (for logs)

Error Handling and Debugging

Common Errors and Solutions

Process exits with non-zero code:

# Check the .command.log in work directory
cat work/ab/123456/.command.log

# Check the error output
cat work/ab/123456/.command.err

Out of memory:

process MEMORY_INTENSIVE {
    memory { 8.GB * task.attempt }
    errorStrategy 'retry'
    maxRetries 3
    // ...
}

File not found:

// Use checkIfExists
Channel.fromPath(params.reads, checkIfExists: true)

// Debug: print what the channel contains
Channel.fromPath(params.reads).view()

The Work Directory

Every process execution creates a directory under work/:

work/ab/123456789abcdef/
├── .command.sh      # The actual script run
├── .command.run     # Wrapper script
├── .command.log     # Combined stdout/stderr
├── .command.out     # Stdout only
├── .command.err     # Stderr only
├── .exitcode        # Exit status
└── output_file.txt  # Output files (symlinked)

Resume Behavior

Nextflow caches task results based on:

  • Input file content (checksum)
  • Process script
  • Process directives
  • Container/conda environment

If any of these change, the task re-runs.

# Resume from cache
nextflow run main.nf -resume

# Force re-run everything
nextflow run main.nf -cache false

Best Practices

Based on our experience building production pipelines:

1. Use Containers

Always specify containers for reproducibility:

process ALIGN {
    container 'quay.io/biocontainers/bwa:0.7.17'
    // ...
}

2. Use Labels for Resource Management

process SMALL_TASK {
    label 'process_single'
}

process BIG_TASK {
    label 'process_high'
}

// In config:
process {
    withLabel: 'process_single' { cpus = 1; memory = '2 GB' }
    withLabel: 'process_high' { cpus = 16; memory = '64 GB' }
}

3. Validate Inputs

// Check files exist
Channel.fromPath(params.input, checkIfExists: true)

// Validate parameters
if (!params.genome) {
    error "Please specify a genome with --genome"
}

4. Use emit for Named Outputs

output:
path "*.bam", emit: bam
path "*.bai", emit: bai

// Access in workflow:
ALIGN.out.bam
ALIGN.out.bai

5. Publish Important Results

publishDir "${params.outdir}/aligned", mode: 'copy', pattern: '*.bam'

6. Handle Failures Gracefully

errorStrategy { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries 3
maxErrors '-1'  // Don't fail entire pipeline on one sample failure

7. Document Your Pipeline

/*
 * Pipeline: RNA-seq Analysis
 * Author: Your Name
 * Description: Quality control and quantification of RNA-seq data
 */

Next Steps

You now have the fundamentals to build Nextflow pipelines. Here’s where to go next:

  1. Explore nf-core - Use existing pipelines before building your own
  2. Take the official training - training.nextflow.io offers in-depth courses
  3. Join the community - The nf-core Slack has active support channels
  4. Read the docs - The official documentation covers advanced topics

For a comparison with other workflow managers, see our Nextflow vs Snakemake guide.


Get Expert Nextflow Support

Building production-grade Nextflow pipelines requires expertise in bioinformatics, cloud infrastructure, and workflow optimization. Many teams spend months debugging container issues, optimizing HPC configurations, and scaling pipelines—time better spent on research.

Our Nextflow managed services help you:

  • Design custom pipelines tailored to your research workflows
  • Optimize existing pipelines for performance and cost efficiency
  • Deploy to cloud platforms (AWS, Google Cloud, Azure) with proper configuration
  • Integrate with Seqera Platform for monitoring and collaboration
  • Train your team on Nextflow best practices

We’ve built pipelines processing petabytes of genomics data for pharmaceutical companies, research institutions, and clinical labs.

Get Nextflow consulting support →


External Resources:

Chat with real humans
Chat on WhatsApp