Kerr, Gráinne (2009) Computational analysis of gene expression data. PhD thesis, Dublin City University.
Abstract
Gene expression is central to the function of living cells. While advances in sequencing and expression measurement technology over the past decade has greatly facilitated the further understanding of the genome and its functions, the characterisation of functional groups of genes remains one of the most important problems in modern biology. Technological advancements have resulted in massive information output, with the priority objective shifting to development of data analysis methods. As such, a large number of clustering approaches have been proposed
for the analysis of gene expression data obtained from microarray experiments, and consequently, confusion regarding the best approach to take. Common techniques
applied are not necessarily the most applicable for the analysis of patterns in microarray data. This confusion is clarified through provision of a framework for the
analysis of clustering technique and investigation of how well they apply to gene expression data. To this end, the properties of microarray data itself are examined,
followed by an examination of the properties of clustering techniques and how well they apply to gene expression.
Clearly, each technique will find patterns even if the structures are not meaningful in a biological context and these structures are not usually the same for different
algorithms. Also, these algorithms are inherently biased as properties of clusters reflect built in clustering criteria. From these considerations, it is clear that cluster
validation is critical for algorithm development and verification of results, usually based on a manual, lengthy and subjective exploration process. Consequently, it is
key to the interpretation of the gene expression data. We carry out a critical analysis of current methods used to evaluate clustering results. Clusters obtained from real
and synthetic datasets are compared between algorithms.
To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges between gene-sample node couples corresponding to significant expression measurements of interest. In this research, this method
of representation is extensively studied and methods are used, in combination with probabilistic models, to develop new clustering techniques for analysis of gene expression
data in this mode of representation. Performance of these techniques can be influenced both by the search algorithm, and, by the graph weighting scheme and both merit vigorous investigation. A novel edge-weighting scheme, based on
empirical evidence, is presented. The scheme is tested using several benchmark datasets at various levels of granularity, and comparisons are provided with current a popular data analysis method used in the Bioinformatics community. The
analysis shows that the new empirical based scheme developed out-performs current edge-weighting methods by accounting for the subtleties in the data through a data-dependent threshold analysis, and selecting ‘interesting’ gene-sample couples based on relative values.
The graphical theme of gene expression analysis is further developed by construction of a one-mode gene expression network which specifically focuses on local interactions among genes. Classical network theory is used to identify and examine organisational properties in the resulting graphs. A new algorithm, GraphCreate, is presented which finds functional modules in the one-mode graph, i.e. sets
of genes which are coherently expressed over subsets of samples, and a scoring scheme developed (using bi-partite graph properties as a basis) to weight these modules. Use of this representation is used to extensively study published gene expression datasets and to identify functional modules of genes with GraphCreate. This work is important as it advances research in the area of transcriptome analyiii
sis, beyond simply finding groups of coherently expressed genes, by developing a general framework to understand how and when gene sets are interacting.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2009 |
Refereed: | No |
Supervisor(s): | Ruskin, Heather J. and Crane, Martin |
Uncontrolled Keywords: | microarray data analysis; gene expression data; supervised and unsupervised clustering methods; graph theory; |
Subjects: | Biological Sciences > Bioinformatics Computer Science > Computer simulation |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
Funders: | National Institute for Cellular Biotechnology (NICB) |
ID Code: | 14837 |
Deposited On: | 17 Nov 2009 15:11 by Martin Crane . Last Modified 27 Sep 2019 11:35 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
5MB |
Preview |
PDF (3rd party copyright material has been removed)
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
4MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record