期刊文献

A new pipeline for structural characterization and classification of RNA-Seq microbiome data 收藏

RNA-SEQ微生物组数据的结构表征和分类的新管道
摘要
BackgroundHigh-throughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNA-sequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionality-associated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs.ResultsHere, we develop and exemplify a new approach, applicable to compositional data, for the classification of new samples into two groups with different biological settings. We propose a new metric to characterize and quantify the overall correlation structure deviation between these groups and a technique for dimensionality reduction to facilitate graphical representation. We conduct simulation experiments with synthetic data to assess the proposed method’s classification accuracy. Moreover, we illustrate the performance of the proposed approach using Operational Taxonomic Unit (OTU) count tables obtained through 16S rRNA gene sequencing data from two microbiota experiments. Also, compare our method’s performance with that of two state-of-the-art methods.ConclusionsSimulation experiments show that our method achieves a classification accuracy equal to or greater than 98% when using synthetic data. Finally, our method outperforms the other classification methods with real datasets from gene sequencing experiments.
摘要译文
背景高通量测序能够分析许多生物系统的组成,例如微生物社区。在这些系统内的依赖关系的识别需要分析和同化构成该系统的所有变量之间的底层交互模式。然而,由于传统的相互作用度量(例如,相关性)在分析相对部分而不是绝对丰度的情况下,该任务在考虑来自DNA测序实验的数据的组成性质时,这项任务会产生挑战。组成相关的挑战延伸到分类任务,因为它通常涉及数据集的主要描述变量之间的相互作用的表征。新样本/患者进入对应于不同生物环境或表型(例如,控制和病例)的二进制类别的分类可以帮助研究人员在治疗/药物的发展中.Resultsss,我们开发和举例说明一种适用于组成数据的新方法,对于具有不同生物环境的两组的新样本的分类。我们提出了一种新的指标来表征和量化这些组之间的总相关结构偏差和维度减少的技术,以便于图形表示。我们用合成数据进行仿真实验,以评估所提出的方法的分类准确性。此外,我们说明了使用来自两种微生物群实验的16S rRNA基因测序数据获得的操作分类学单位(OTU)计数表的所提出的方法的性能。此外,将方法的性能与两种最先进的方法进行比较.ConclusionsImulation实验表明,当使用合成数据时,我们的方法达到等于或大于98%的分类精度。最后,我们的方法优于来自基因测序实验的实际数据集的其他分类方法。
Sebastian Racedo[1];Ivan Portnoy[1];Jorge I. Vélez[1];Homero San-Juan-Vergara[1];Marco Sanjuan[1];Eduardo Zurek[1]. A new pipeline for structural characterization and classification of RNA-Seq microbiome data[J]. BioData Mining, 2021,14(1): 1-18