Institution

Abstract:

All the computational tasks envisioned here begin with running a battery of sequence analysis and protein structure/function prediction programs for entirely sequenced organisms. We will begin with a set of 45,000 proteins of particular biological interest; this will create over 50Gb of results. Next, we will select a few representative eukaryotic and prokaryotic organisms (about 100) with about 500,000 proteins. Our ultimate selection will be constrained by what we can achieve in this project. We pursue two important research objectives by this project. The first is related to the selection of targets for large-scale structural genomics. Structural genomics is an initiative that attempts to experimentally determine high-resolution structures of proteins for which we today have no experimental and little or no in silico evidence about structure. Another set of constraints to optimize is to find targets that are biologically important, experimentally feasible, and for which one single experimental structure will lead to many in silico models thereby extending the experimental information. Toward this end we need the output of many of our methods and we need to cluster a large fraction of all known proteins. The second scientific objective pertains to the study of proteins that have a particular structural feature, namely that they are natively unstructured, i.e. adopt regular three-dimensional structures only upon folding to substrates. Such proteins are particularly abundant in higher eukaryotes. In fact, the abundance of these proteins in higher eukaryotes is, aside from alternative splicing and the number of proteins, the most dramatic difference of the proteomes of higher eukaryotes and of simple bacteria. Again, in this context we need to process a variety of prediction methods systematically for a great biological diversity of entirely sequenced organisms. All data generated will be made publicly available. Thereby other potential users less experienced in using the battery of tools that we want to apply can benefit from this work. Indeed, these data will be available to the readers of the next issue of the CELL journal. The editors of CELL have already made the necessary provisions of linking to the predictions coming out of this project via a new automatic tagging system named Reflect.