期刊文献

MORF_ESM: Prediction of Morfs in Disordered Proteins Based on a Deep Transformer Protein Language Model 收藏

MORF_ESM:基于深层变压器蛋白质模型的无序蛋白质MORF的预测
摘要
Molecular recognition features (MoRFs) are particular functional segments of disordered proteins, which play crucial roles in regulating the phase transition of membrane-less organelles and frequently serve as central sites in cellular interaction networks. As the association between disordered proteins and severe diseases continues to be discovered, identifying MoRFs has gained growing significance. Due to the limited number of experimentally validated MoRFs, the performance of existing MoRF’s prediction algorithms is not good enough and still needs to be improved. In this research, we present a model named MoRF_ESM, which utilizes deep-learning protein representations to predict MoRFs in disordered proteins. This approach employs a pretrained ESM-2 protein language model to generate embedding representations of residues in the form of attention map matrices. These representations are combined with a self-learned TextCNN model for feature extraction and prediction. In addition, an averaging step was incorporated at the end of the MoRF_ESM model to refine the output and generate final prediction results. In comparison to other impressive methods on benchmark datasets, the MoRF_ESM approach demonstrates state-of-the-art performance, achieving 0.024∼0.181" role="presentation" style="font-size: 121%; position: relative;">0.024∼0.1810.024∼0.181 higher AUC than other methods when tested on TEST1 and achieving 0.052∼0.3" role="presentation" style="font-size: 121%; position: relative;">0.052∼0.30.052∼0.3 higher AUC than other methods when tested on TEST2. These results imply that the combination of ESM-2 and TextCNN can effectively extract deep evolutionary features related to protein structure and function, along with capturing shallow pattern features located in protein sequences, and is well qualified for the prediction task of MoRFs. Given that ESM-2 is a highly versatile protein language model, the methodology proposed in this study can be readily applied to other tasks involving the classification of protein sequences.
摘要译文
分子识别特征(MORF)是无序蛋白质的特定功能段,它们在调节无膜细胞器的相变中起着至关重要的作用,并经常用作细胞相互作用网络中的中心位点。随着蛋白质无序蛋白与严重疾病之间的关联继续被发现,确定MORF的意义越来越重要。由于经过实验验证的MORF数量有限,因此现有MORF预测算法的性能还不够好,仍然需要改进。在这项研究中,我们提出了一个名为MORF_ESM的模型,该模型利用深度学习蛋白表示来预测无序蛋白质中的MORF。该方法采用验证的ESM-2蛋白质语言模型来以注意图矩阵的形式生成残基的嵌入表示。这些表示形式与用于特征提取和预测的自我学习的文本模型结合使用。此外,在MORF_ESM模型的末尾合并了一个平均步骤,以完善输出并生成最终预测结果。与基准数据集上的其他令人印象深刻的方法相比,MORF_ESM方法表明了最先进的性能,达到0.024〜0.181“角色=” eSTREATION”样式=“ font-size:121%;位置:相对;“> 0.024〜0.1810.024〜0.181 AUC比在test1上进行测试并实现0.052〜0.3“ remo =“呈现”样式=“ font-size:font-size:121%;位置:相对:相对;”> 0.052;”> 0.052在Test2上测试时,AUC比其他方法高约0.30.052〜0.3。这些结果表明,ESM-2和TextCNN的组合可以有效提取与蛋白质结构和功能相关的深层进化特征,以及捕获蛋白质序列中的浅模式特征,并且有资格符合MORF的预测任务。鉴于ESM-2是一种用途广泛的蛋白质语言模型,因此本研究中提出的方法可以很容易地应用于涉及蛋白质序列分类的其他任务。
Chun Fang (0000-0002-0161-0412) [1];Jiasheng He [2];Hayato Yamana [3];. MORF_ESM: Prediction of Morfs in Disordered Proteins Based on a Deep Transformer Protein Language Model[J]. Journal of Bioinformatics and Computational Biology, 2024,22(02): 2450006