ABSTRACTThe Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-through put gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-through put gene expression and genomic hybridization experiments.GEO is not intended to replace in house gene expression databases that benefit from coherent data sets,and which are constructed to facilitate a particular analytic method, but rather complement these byacting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms,samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A seriesorganizes samples into the meaningful data setswhich make up an experiment. The GEO repository ispublicly accessible through the World Wide Web a thttp://www.ncbi.nlm.nih.gov/geo.BACKGROUNDMolecular biological experiments utilizing high-through puthybridization array- and sequencing-based techniques havebecome extremely popular in recent years (1–3). These tech-niques have been used to measure the molecular abundance ofmRNA and genomic DNA either in absolute or relative terms.Mainly contributing to this popularity is the highly parallelnature of these techniques and the concomitant conservation oftime and resources brought about by the large number ofsimultaneous (or near-simultaneous) molecular samplingevents performed under very similar conditions.For a number of years there has been a growing desire forthese high-throughput data sets to be made publicly availableonce research findings have been published in the scientific literature—similar to journal and public funding requirementsfor the public release of biological sequence data. There have also been calls for the establishment of a public repository for (at least the gene expression microarray subset of) these data sets (4–6),and journals and public funding agencies have begun to makepublic availability of high-throughput data a condition of publication (7) or funding (e.g. NINDS request for proposalsBAA-RFP-NIH-NINDS-01–03, p. 76 at http://www.ninds.nih.gov/funding/2rfp_01_03.pdf), respectively. Recognizing the desire that this data should be made widely available, several laboratories and institutions have constructed primary and secondary Internet resources to distribute these high-throughput data sets(Table 1). Over the last several years, there has been an international effort to catalog the minimal set of information which is necessary in order for microarray experiments to be properly interpreted and to be comparable with one another (6). The codification and publication of this set of guidelines will be invaluable as aguide for high-throughput gene expression and genomic hybridization data producers and data repositories. We feel,however, that over-zealous application of these guidelines insetting standards and requirements must be avoided because it will stifle a rapidly developing and technically challenging field.  Therefore, our primary goal in creating the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo) was to attempt to cover the broadest spectrum of high-through put experimental methods possible and remain flexible and responsive to future trends, rather than setting rigid requirements and standards for entry. In taking this approach, however, we recognize that there are obvious, inherent limitations to functionality and analysis that can be provided on such heterogeneous data sets.Hence, GEO is not intended to replace or match primary and secondary resources that operate on homogeneous data sets, but instead to serve as a complementary tertiary resource for the storage and retrieval of public high throughput gene expression and genomic hybridization data.REPOSITORY DESIGN GEO segregates data into three principle components, platform,sample and series (Table 2), each of which is accessioned(i.e. given a unique and constant identifier) in a relational data-base (Fig. 1). To achieve an open and flexible design that allows storage and retrieval of very diverse data types, the data are not fully granulated within the database. Instead, a tab-delimited ASCII table is stored for each platform and each sample. The table consists of multiple columns with accompanying column header names. The data within this table are currently partially extracted for indexing, but may be further extracted for more extensive search and retrieval. In addition, any number of supplementary columns may be provided by the submitter for the inclusion of additional, submitter-defined information.An instance of a platform is, essentially, a list of probes that define what set of molecules may be detected in any experiment utilizing that platform. For example, the platform data table may contain GEO-defined columns identifying the position and biological reagent contents of each probe (spot) such as a GenBank accession number, open reading frame (ORF) name and clone identifier, as well as submitter-defined columns.Platform accession numbers have a ‘GPL’ prefix.An instance of a sample describes the derivation of the set of molecules that are being probed and utilize platforms to generate molecular abundance data. Each sample has one, and only one, parent platform which must be previously defined.For example, a sample data table may contain columns indicating the final, relevant abundance value of the corresponding spot defined in its platform, as well as any other GEO-defined(e.g. raw signal, background signal) and submitter-defined columns. Sample accession numbers have a ‘GSM’ prefix.An instance of a series organizes samples into the meaningful data sets which make up an experiment, and are bound together by a common attribute. Series accession number shave a ‘GSE’ prefix.