Poster: bio_embeddings: Making Protein Language Models Accessible to the Wider Research Community

bio_embeddings: Making Protein Language Models Accessible to the Wider Research Community

Konstantin Schütze,Christian Dallago,Michael Heinzinger,Burkhard Rost TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany

Abstract

Recently, Language Models (LMs) have been adapted from use in natural language to work with protein sequences instead. Protein LMs show enormous potential in generating descriptive vector representations (embeddings) for proteins from just their sequences at a fraction of the time compared to previous approaches. Access to protein LMs is scattered throughout the web, a limiting factor to their use. bio_embeddings makes them readily accessible to the wider research community. The bio_embeddings pipeline offers a unified interface to protein LMs to simply and quickly embed large protein sets, to project the embeddings in lower dimensional spaces, to visualize proteins on interactive scatter plots, and to extract annotations using either supervised models, or unsupervised techniques. The package includes a webserver, yaml-based pipelines and a Python api, all with a focus on documentation and user experience.

Keywords: language model,protein embedding,embedding,protein language model,python pipeline,feature extraction,protein visualization