Institution

Abstract:

RESEARCH OBJECTIVES: (1) Find targets for structural genomics. (2) Explore the evolution of disorder. TECHNICAL OBJECTIVE: Port all software in use by the Rost lab to the environment of the TUM. PROJECTED RESOURCES: Data: 10TB total;  * 1.5 TB database and resource needed to run jobs;  * 1 TB of scratch space;  * 8.5 TB of long-term storage space (until I arrive at the TUM and will mount my own RAIDS to put these data) CPU: ~400,000 hours (see below) BACKGROUND and RATIONALE: Today, we have data for about 1000 organisms, for about 700 of those, we have high-resolution information about their protein sequences (note that for most higher eukaryotes the protein sequences are still not assigned, i.e. the genome is not mapped to the proteome).  All the computational tasks envisioned here begin with running a battery of sequence analysis and protein structure/function prediction programs for entirely sequenced organisms. These will need databases and other resources  adding up to about 1.5TB. For each protein in the ~700 organisms, we will have to execute all programs. The total number of proteins in the 700 organisms is ~1 million. The first job will require running all these proteins one by one and it will create over 7TB of data. The second job will require clustering, i.e. cross-connecting the data; this will generate another 1TB. Estimating the CPU time is difficult for me. On our current cluster (2.33 GHz multithreaded) we need about 20 min per protein, i.e. the whole job would exceed 1 M * 1/3hour = 330000 CPU hours. The reason why the number is difficult to estimate is that I do not know how this multithreaded 2.33GHz will scale to your machine. Clustering is difficult to estimate, as we never did it on this scale. We currently consider splitting the job which would supposedly add another ~50,000 CPU hours. We pursue two important research objectives by this project. The first is related to the selection of targets for large-scale structural genomics. Structural genomics is an initiative that attempts to experimentally determine high-resolution structures of proteins for which we today have no experimental and little or no in silico evidence about structure. Another set of constraints to optimize is to find targets that are biologically important, experimentally feasible, and for which one single experimental structure will lead to many in silico models thereby extending the experimental information. Toward this end we need the output of many of our methods and we need to cluster a large fraction of all known proteins. The second scientific objective pertains to the study of proteins that have a particular structural feature, namely that they are natively unstructured, i.e. adopt regular three-dimensional structures only upon folding to substrates. Such proteins are particularly abundant in higher eukaryotes. In fact, the abundance of these proteins in higher eukaryotes is, aside from alternative splicing and the number of proteins, the most dramatic difference of the proteomes of higher eukaryotes and of simple bacteria. Again, in this context we need to process a variety of prediction methods systematically for a great biological diversity of entirely sequenced organisms. All data generated will be made publicly available. Thereby other potential users less experienced in using the battery of tools that we will apply will still benefit from this work. As soon as we will get closer to planning this undertaking, we will reach out to MIPS and UniProt and discuss ways of making data available through those major outlets for protein data.