Tools & Database Creation

Overview

PhaLP 2.0 is a comprehensive database of phage lytic proteins built by integrating data from two major sources: curated protein databases and metagenomic datasets. This page describes the key tools and resources used to construct and analyze the database content.

Primary Data Sources

1. UniProt - Curated Protein Database

UniProt serves as our primary source for experimentally validated and well-annotated phage lytic proteins. UniProt provides:

  • High-quality protein sequences with manual curation
  • Functional annotations and literature references
  • Domain architecture from InterPro and Pfam
  • Gene Ontology (GO) terms and Enzyme Commission (EC) numbers
  • Cross-references to structures (PDB) and genomic data (NCBI)
  • Sequence clusters (UniRef) and sequence archives (UniParc)

2. EnVhog - Environmental Metagenomic Database

EnVhog is a specialized database of environmental viral proteins identified from metagenomic data. It represents a rich source of novel phage lytic proteins from diverse environments:

  • Metagenomic sequences from various environmental samples
  • Predicted protein structures using AlphaFold
  • Quality metrics including pLDDT scores for structure confidence
  • Expands the diversity of phage proteins beyond cultured phages
  • Provides access to understudied and novel protein families

Key Analysis Tools

SPAED Logo
SPAED - Segmentation of PhAge Endolysin Domains

SPAED is a specialized tool for identifying domains in phage endolysins using AlphaFold's Predicted Aligned Error (PAE) matrices. SPAED enables accurate domain boundary prediction by leveraging structural prediction confidence.

Key Features:

  • Analyzes PAE files from AlphaFold2/3 or ColabFold to identify domain boundaries
  • Hierarchical clustering approach for automatic domain segmentation
  • High-throughput processing of multiple proteins
  • Visualization tools for PyMOL to display predicted domains on structures
  • Extraction of domain sequences in FASTA format

Web Interface:

SPAED is available through a user-friendly web interface at www.spaed.ca for quick analysis of individual proteins.

Citation:

Alexandre Boulay, Emma Cremelie, Clovis Galiez, Yves Briers, Elsa Rousseau, Roberto Vázquez. "SPAED: harnessing AlphaFold output for accurate segmentation of phage endolysin domains." Bioinformatics, Volume 41, Issue 10, October 2025, btaf531. https://doi.org/10.1093/bioinformatics/btaf531


InterPro

InterPro was used to get a domain annotation of the delineated domains (by SPAED). Only hits to Pfam with an E-value < 0.001 were considered for downstream analyses, but hits to other databases were included in PhaLP 2.0.

Citation:

Jones, P. et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240.

SUBLYME - Software for Uncovering Bacteriophage LYsins in MEtagenomic datasets

SUBLYME is a machine learning tool specifically designed to identify bacteriophage lysins in metagenomic datasets. SUBLYME was trained using proteins from the PhaLP database and utilizes highly informative ProtT5 protein embeddings for accurate predictions.

Key Features:

  • Machine learning-based prediction using ProtT5 protein embeddings
  • Two-stage classification: lysin detection, then endolysin vs. VAL (Virion-Associated Lysin) classification
  • Includes Prodigal for gene prediction from genomic sequences
  • High-throughput processing with GPU support

Integration with EnVhog:

SUBLYME significantly expands the catalog of phage lytic proteins by mining the vast metagenomic resources available through the EnVhog database. This allows PhaLP 2.0 to include:

  • Novel lysin candidates from environmental samples
  • Lysins from uncultured phages
  • Expanded diversity of protein families and domain architectures
  • Proteins from diverse and extreme environments

Availability:

SUBLYME is available via PyPI (pip install sublyme) or as an Apptainer/Singularity container for complete genomic analysis pipelines.


MMseqs2 - Sequence Clustering

MMseqs2 (Many-against-Many sequence searching) is used in the SUBLYME pipeline for fast and sensitive sequence clustering and homology searching. MMseqs2 enables efficient processing of large metagenomic datasets by providing ultra-fast sequence comparisons while maintaining high sensitivity.

Citation:

Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Structural Data Integration

Experimental Structures

Some of the structures have experimentally determined protein structures, and are obtained from the Protein Data Bank (PDB).

Predicted Structures

For proteins without experimental structures, we integrate predicted structures from AlphaFold v2, using ColabFold v1.5.5.

Also structure predictions from PhaLP 1.0, using AlphaFold v3, are included.

Technologies & Frameworks

Database & Web Infrastructure

  • MySQL/MariaDB - Relational database management system
  • Django - Web framework and ORM
  • Bootstrap 5 - Responsive web interface
  • JavaScript - Interactive features and visualizations

External APIs

  • NCBI Entrez - Genomic and sequence data
  • InterPro API - Domain annotations
  • PDB API - Structure data and metadata

Citation & Acknowledgments

If you use PhaLP 2.0 in your research, please cite our publication. We also encourage users to cite the key resources and tools that make this database possible:

PhaLP 2.0 Database:

[Citation information will be added upon publication]

Key Tools & Resources:

  • SPAED: Alexandre Boulay et al. (2025). Bioinformatics, 41(10):btaf531. doi:10.1093/bioinformatics/btaf531
  • SUBLYME: github.com/Rousseau-Team/sublyme
  • EnVhog: Perez-Bucio R., Enault F., Galiez C. "EnVhogDB: an extended view of the viral protein families on Earth through a vast collection of HMM profiles." bioRxiv 2025-09-18. doi:10.24072/pcjournal.627
  • UniProt: The UniProt Consortium. Nucleic Acids Research.
  • AlphaFold: Jumper et al. (2021). Nature, 596:583-589.