Speaker: Ziying Liu
National Research Council of Canada, Institute for Information Technology
Knowledge Discovery in Genomics: A Multi-strategy Approach
Knowledge discovery is the process of discovering useful and ideally previously unknown knowledge from historical or real-time data obtained from all sources such as biological experiments or clinical trials. This involves development of various strategies of data preprocessing and analysis such as classification. In high throughput genomics applications, knowledge discovery processes help in various research and development activities, among which are (i) identifying relationships between genes and their functions based on time-series, which can be drug response over time, developmental stages, etc, and (ii) investigating gene responses to various treatments at one discrete time point.

In this presentation we provide an overview of knowledge discovery in genomics and emphasize on multi-strategy approaches in which a number of methods are applied to identify differentially-expressed genes from a given dataset. It is well-recognized that different methods for the identification of differentially-expressed genes produce different lists of genes. Recently, several studies attempted to compare the similarity and dissimilarity of the gene lists produced by the different methods; however, none has addressed the problem as how to best consolidate the results of these methods. In this research, we develop a novel consolidation strategy based on the principle of characteristic similarity, such as gene co-regulation. First, a set of core genes is formed based on certain confidence criteria (e.g., the common genes identified by all methods for a stringent analysis or common to more than one method for a less stringent study). This set of core genes is then clustered according to their common characteristics (e.g., common expression behaviour, similarity in functions and gene regulation) through data mining, literature mining, pathway database search, and/or functional similarity evaluation based on GO annotations. Second, common characteristics (e.g., DNA motifs for each of the co-expressed gene clusters) of each cluster are identified. A gene identified by each individual method is to be either selected or excluded depending on the extent of certain characteristics (e.g., motifs, functional similarity) it has in common with one of the core clusters.

The effectiveness of the proposed methodology is demonstrated through its application to two in-house datasets from our research projects concerning the identification of bio-makers for breast and brain cancers.