Tools & Database Creation

Overview

PhaLP 2.0 is a comprehensive database of phage lytic proteins built by integrating data from two major sources: curated protein databases and metagenomic datasets. This page describes the key tools and resources used to construct and analyze the database content.

Primary Data Sources

1. UniProt - Curated Protein Database

UniProt serves as our primary source for experimentally validated and well-annotated phage lytic proteins. UniProt provides:

High-quality protein sequences with manual curation
Functional annotations and literature references
Domain architecture from InterPro and Pfam
Gene Ontology (GO) terms and Enzyme Commission (EC) numbers
Cross-references to structures (PDB) and genomic data (NCBI)
Sequence clusters (UniRef) and sequence archives (UniParc)

2. EnVhog - Environmental Metagenomic Database

EnVhog is a specialized database of environmental viral proteins identified from metagenomic data. It represents a rich source of novel phage lytic proteins from diverse environments:

Metagenomic sequences from various environmental samples
Predicted protein structures using AlphaFold
Quality metrics including pLDDT scores for structure confidence
Expands the diversity of phage proteins beyond cultured phages
Provides access to understudied and novel protein families

Key Analysis Tools

SPAED - Segmentation of PhAge Endolysin Domains

SPAED is a specialized tool for identifying domains in phage endolysins using AlphaFold's Predicted Aligned Error (PAE) matrices. SPAED enables accurate domain boundary prediction by leveraging structural prediction confidence.

Key Features:

Analyzes PAE files from AlphaFold2/3 or ColabFold to identify domain boundaries
Hierarchical clustering approach for automatic domain segmentation
High-throughput processing of multiple proteins
Visualization tools for PyMOL to display predicted domains on structures
Extraction of domain sequences in FASTA format

Web Interface:

SPAED is available through a user-friendly web interface at www.spaed.ca for quick analysis of individual proteins.

Citation:

Alexandre Boulay, Emma Cremelie, Clovis Galiez, Yves Briers, Elsa Rousseau, Roberto Vázquez. "SPAED: harnessing AlphaFold output for accurate segmentation of phage endolysin domains." Bioinformatics, Volume 41, Issue 10, October 2025, btaf531. https://doi.org/10.1093/bioinformatics/btaf531

InterPro

InterPro was used to get a domain annotation of the delineated domains (by SPAED). Only hits to Pfam with an E-value < 0.001 were considered for downstream analyses, but hits to other databases were included in PhaLP 2.0.

Citation:

Jones, P. et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240.

SUBLYME - Software for Uncovering Bacteriophage LYsins in MEtagenomic datasets

SUBLYME is a machine learning tool specifically designed to identify bacteriophage lysins in metagenomic datasets. SUBLYME was trained using proteins from the PhaLP database and utilizes highly informative ProtT5 protein embeddings for accurate predictions.

Key Features:

Machine learning-based prediction using ProtT5 protein embeddings
Two-stage classification: lysin detection, then endolysin vs. VAL (Virion-Associated Lysin) classification
Includes Prodigal for gene prediction from genomic sequences
High-throughput processing with GPU support

Integration with EnVhog:

SUBLYME significantly expands the catalog of phage lytic proteins by mining the vast metagenomic resources available through the EnVhog database. This allows PhaLP 2.0 to include:

Novel lysin candidates from environmental samples
Lysins from uncultured phages
Expanded diversity of protein families and domain architectures
Proteins from diverse and extreme environments

Availability:

SUBLYME is available via PyPI (pip install sublyme) or as an Apptainer/Singularity container for complete genomic analysis pipelines.

MMseqs2 - Sequence Clustering

MMseqs2 (Many-against-Many sequence searching) is used in the SUBLYME pipeline for fast and sensitive sequence clustering and homology searching. MMseqs2 enables efficient processing of large metagenomic datasets by providing ultra-fast sequence comparisons while maintaining high sensitivity.

Citation:

Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Structural Data Integration

Experimental Structures

Some of the structures have experimentally determined protein structures, and are obtained from the Protein Data Bank (PDB).

Predicted Structures

For proteins without experimental structures, we integrate predicted structures from AlphaFold v2, using ColabFold v1.5.5.

Also structure predictions from PhaLP 1.0, using AlphaFold v3, are included.

Technologies & Frameworks

Database & Web Infrastructure

MySQL/MariaDB - Relational database management system
Django - Web framework and ORM
Bootstrap 5 - Responsive web interface
JavaScript - Interactive features and visualizations

External APIs

NCBI Entrez - Genomic and sequence data
InterPro API - Domain annotations
PDB API - Structure data and metadata

Citation & Acknowledgments

If you use PhaLP 2.0 in your research, please cite our publication. We also encourage users to cite the key resources and tools that make this database possible:

PhaLP 2.0 Database:

[Citation information will be added upon publication]

Key Tools & Resources:

SPAED: Alexandre Boulay et al. (2025). Bioinformatics, 41(10):btaf531. doi:10.1093/bioinformatics/btaf531
SUBLYME: github.com/Rousseau-Team/sublyme
EnVhog: Perez-Bucio R., Enault F., Galiez C. "EnVhogDB: an extended view of the viral protein families on Earth through a vast collection of HMM profiles." bioRxiv 2025-09-18. doi:10.24072/pcjournal.627
UniProt: The UniProt Consortium. Nucleic Acids Research.
AlphaFold: Jumper et al. (2021). Nature, 596:583-589.