Analyzing large biological datasets with association networks
Tatiana V. Karpinets, Byung H. Park, and Edward C. Uberbacher
2012 May 24, Nucleic Acids Research 40(17): 1-8
Due to advances in high-throughput biotechnologies biological information is being collected in databases at an amazing rate, requiring novel computational approaches that process collected data into new knowledge in a timely manner. In this study, we propose a computational framework for discovering modular structure, relationships and regularities in complex data. The framework utilizes a semantic-preserving vocabulary to convert records of biological annotations of an object, such as an organism, gene, chemical or sequence, into networks (Anets) of the associated annotations. An association between a pair of annotations in an Anet is determined by the similarity of their co-occurrence pattern with all other annotations in the data. This feature captures associations between annotations that do not necessarily co-occur with each other and facilitates discovery of the most significant relationships in the collected data through clustering and visualization of the Anet. To demonstrate this approach, we applied the framework to the analysis of metadata from the Genomes OnLine Database and produced a biological map of sequenced prokaryotic organisms with three major clusters of metadata that represent pathogens, environmental isolates and plant symbionts.
Computational framework for analysis of annotations collected in biological databases. Steps 1 and 2 (blue) are described in the text in more detail. Step 3 uses a classic Apriori algorithm for learning Arules from the type-value formatted transactions. Step 4 employs known visualization and clustering tools to analyze the generated Anets. Step 5 uses filtering tools available in spreadsheet applications.
Biological maps of sequenced prokaryotic organisms based on their metadata collected in the GOLD. The maps are based on the Anet generated from the metadata using Pearson correlation as the similarity measure, and two P-value thresholds: 0.05 (a) and 0.01 (b). The maps link environmental, physiological, genomic and phenotypic characteristics based on similarity of profiles of their co-occurrences in the sequenced prokaryotic organisms and reveal similar communities/clusters of the annotations indicated by color. Names of seven most populated clusters were assigned by manual curation of PROJECT_RELEVANCE annotations within each cluster.
Analyzing large biological datasets with association networks Tatiana V. Karpinets, Byung H. Park, Edward C. Uberbacher;Nucleic Acids Res. 2012 September; 40(17): e131. PMCID: PMC3458522