Peter F. Lemkin(1), Geoffrey A. Orr(2), Mark P. Goldstein(3), G. Joseph Creed(4), James E. Myrick(5), Carl R. Merril(4)
Keywords: Protein Disease Database, PDD, World-Wide-Web,
hypertext, Internet, relational database, fold change, two-dimensional
electrophoresis, acute phase proteins, databases, factual/standards,
proteins/genetics, human, electrophoresis, gel, two-dimensional, blood
proteins/analysis, CSF proteins/analysis, urinary proteins/analysis.
Note: See the first part of these 2 part papers The Protein Disease Database of human body
fluids: I. Rationale for the development of this database
Correspondence:
(1)
P. F. Lemkin,
Image Processing Section/LMMB,
Bld 469 Room 150, NCI-FCRDC/NIH, Frederick, MD 21702.
(2) PRI/Dyncorp, NCI-FCRDC, Frederick, MD (now at UNISYS, Reston, VA);
(3) Monoclonetics Int'l., Houston, TX;
(4) LBG, NIMH/NIH Neuroscience Center, Washington, D.C.;
(5) NCEH/CDCP Div. Environ. Health Lab. Sci., Atlanta, GA. Note: This work was first introduced at the Sept. 5-7, 1994
conference "2-D Electrophoresis: from Protein Maps to Genomes" in
Siena, Italy - a working conference on two-dimensional electrophoresis
and its link to genomes.
We are using the Internet World Wide Web (WWW) and the Web browser paradigm as an access method for wide distribution and querying of the Protein Disease Database. The WWW hypertext transfer protocol and its Common Gateway Interface make it possible to build powerful graphical user interfaces that can support easy-to-use data retrieval using query specification forms or images. The details of these interactions are totally transparent to the users of these forms. Using a client-server SQL relational database, user query access, initial data entry and database maintenance are all performed over the Internet with a Web browser. We discuss the underlying design issues, mapping mechanisms and assumptions that we used in constructing the system, data entry, access to the database server, security, and synthesis of derived two-dimensional gel image maps and hypertext documents resulting from SQL database searches.
Emphasis of the database is on proteins from human body fluids such as plasma, serum, CSF, and urine. These matrices were selected because of the ease of sample collection and data availability. The initial focus described in the companion paper is on the acute phase proteins (APP) (Mackiewicz, 1993). The purpose of collecting these data in the form of relative concentrations ("fold changes" defined below) is to allow investigators to explore relationships between disease states and protein patterns that are observed in a battery of protein assays. Improvements in 2-DE gel reproducibility makes it more feasible to compare 2-DE protein gels between laboratories and provides another method for putative protein identification.
While the initial focus is on the APP, emphasis is being made on collecting data on diseases of interest to the NIMH (neurological diseases), the NCI (cancer), and the CDC (toxicant exposures and genetic diseases).
We will first describe the data and the paradigm for the database. Then we discuss the use of the World Wide Web on the Internet for delivering access of such an interactive database system to users. Figure 1 illustrates the general scheme for using quantitative fold change information derived from the literature.
The PDD is currently running on a dedicated SparcStation-2 computer with SOLARIS, that hosts both the WWW server and the RDBMS server used by the Working PDD. The Staging PDD server and database runs on a separate SparcStation-2 with SUNOS. Data are entered and checked on the Staging PDD and then manually copied to the working PDD. This will become more automated as we show later in the discussion. The initial database did not have much data as we were primarily concerned with eliminating bugs and developing a smoothly functioning data entry paradigm.
While great progress has been made in bibliographic National Library of Medicine (MEDLINE and the Entrez subset) databases and highly focused single-entity databases (SWISS-PROT, GenBank, etc), these sources, while useful, remain of limited utility in performing searches driven by quantitative relationships between entities, such as those between multiple protein concentration and/or activity profiles and disease states.
The use of such multiple entity quantitative research tools is possible only through the method of literature evaluation and summarization that is closer to the area of meta-analysis (Mann, 1990; Peto, 1993; Powe, 1994) than to the traditional literature review. In meta-analysis, results from a number of studies (typically large-scale clinical trials) are combined to support statistical decisions taking into account the fact that the studies were probably performed with different protocols. The similarity of this method with meta-analysis is strong in the sense that both methods are retrospective, generally rely on "data mining" the published literature, and must deal with issues of publication bias, research quality, and comparability of data from different sources.
These methods differ in two important ways. While meta-analyses normally seek a conclusion, the approach taken in the PDD allows searches of the protein disease literature to suggest hypotheses by finding protein patterns associated with diseases. These patterns may then be furthermore investigated using standard research methods. Further, while the statistical requirements of meta-analysis argue in favor of studying a well constrained set of therapies/diseases, the PDD approach encourages examination of relationships among sets of proteins and sets of diseases.
mean of disease values Fold-change = ----------------------- mean of normal values
Qualitative changes are indicated by a fold change of 0 (if the protein is missing in the disease state) or infinity (if missing in the normal). An estimated value is just that - a value estimated by the data in a particular study. There may be several different studies on the same diseases and proteins. We therefore need to search the PDD database on a range of fold-values where the normal and abnormal states have non-overlapping ranges. The fold change ranges may be calculated from the estimated value ranges may may be calculated directly in the PDD when adequate data is present. The worst case fold change ranges [Fl : Fu]n and [Fl : Fu]a can be estimated from the upper (u) and lower (l) bounds of the normal (N) and abnormal (A) values:
[Fl : Fu]n = [Au / Nl : Au / Nu]. and [Fl : Fu]a = [Al / Nl : Au / Nl].
We recognize the statistical limitations of this method and strongly warn the user against reaching conclusions based solely on fold change data presented in the PDD because of differences in study design and protein detection methods (cf. Merril, 1995). Results from a PDD search should be treated as partially complete prescreening data which are subject to further analysis by the researcher to determine if they are relevant (much as is done in a Medline search).
For a literature reference to be used, it must be in refereed or certified literature and at a minimum present: 1) quantitative mean protein changes between normal and disease (or condition) states (including lower, upper, estimated, standard deviation) values, 2) a p-value associated with these changes, and 3) the number of patients (for both control and disease) used in the study.
Data from some of these well maintained genomic databases are used to supplement PDD data when possible. If such federated data are available from remote WWW servers, then they are used internally by the PDD for various functions as well as being available to users as hypertext links. These include: 1) proteins-to-spots in viewable reference 2-DE gel maps - both in the PDD and in external databases, 2) proteins-to-identifiers in external databases available over the Internet such as the ExPASy system for SWISS-PROT (Bairoch, 1994) and SWISS-2-DPAGE (Appel, 1993; Appel, 1994), GDB(TM) (Fasman, 1994), GenBank (Benson, 1994), and others, 3) synonyms of diseases or conditions linked to the Unified Medical Language System (UMLS(TM)) of the National Library of Medicine (UMLS, 1994), and 4) literature references to external MEDLINE(TM)$ servers for recovering titles, authors, abstracts, and journal references. The use of a synonym database to resolve different terms will be critical for using the database for a wide variety of biomedical literature for both data capture and queries where terms for the same concept will have different names. The types of external genomic databases that have been discussed are examples of members of this loose federation of databases. The PDD uses this information and makes its disease correlation data available for other federation database servers to use.
The WWW developed at CERN in Geneva, Switzerland is a network-based hypertext system for dynamically linking various information sources. The network in this case is the Internet that spans the world. An information source is any accessible body of multimedia files (text, image and sound) or databases that may be accessed from a "server" computer located on the Internet. Currently, most information sources are available free although there is some interest in eventually charging for high quality commercially valuable information sources. The WWW was initially conceived by Tim Berners-Lee to link high energy physics researchers, but its use has expanded exponentially to include all types of information (Berners-Lee, 1994). [For more information on the The World Wide Web, see URL http://www.w3.org/hypertext/WWW/TheProject.html maintained by the W3 Consortium.]
The Mosaic Web browser program, (Andreessen, 1994; Schatz, 1994) was developed at the National Center for Supercomputing Applications at University of Illinois (NCSA). Another popular Web browser is Netscape by Netscape Communications Corporation. Whereas WWW is the collection of information sources distributed over the Internet, Web browers are computer tools which run on your computer and allows you to browse this information (just as you might browse a book, journal or newspaper). Since there are a several WWW browsers available in addition to Mosaic and Netscape, in the remainder of the paper we will refer to browsers with similar capabilities as generic Web browsers. Browsers are available for UNIX, Macintosh and Windows-PC computers. Web browsers access WWW servers using the Internet TCP/IP protocol and a client-server message passing method. The latter use of TCP/IP implies that a full Internet connection is required. (See Dougherty, 1994 for a good introductory discussion on this.) Web browsers offer simple intuitive point and click graphical user interfaces to the WWW and the Internet. The user clicks on underlined text or active images called hypertext references and the Web browser responds by following the link to the associated information source and retrieving new information from WWW servers corresponding to those text or image objects. Associated with each hypertext reference is a Uniform Resource Locator (URL) that points to the exact information source at a particular Internet site. The Internet therefore may be thought of as a repository for distributed hypermedia. Figure 2 illustrates the relation of the Web browser to the PDD's WWW information source.
There are several advantages in using a Web browser as a user interface. It is the first universal interface for the Internet that encourages world-wide collaboration. All information sources are handled transparently from the user's perspective as hypertext, which is all they usually see. The behavior of the Web browser program is the same regardless of whether it is running on a Windows PC, a Macintosh or UNIX workstation. Finally, all of these cross-platform Web browsers are available free from the Internet for those who want to download them using anonymous FTP (ftp.ncsa.uiuc.edu/Web). [The name ftp.ncsa.uiuc.edu is an Internet address of an "anonymous FTP" site. Anyone may attach to this site by running FTP on his/her computer and connecting to it. The login name is "anonymous" and the password is your e-mail address. Typing "Help" will list the commands and instruct you on how to use them. After attaching, move to the directory /Web. At this point you may look at the "directory" of files available and "get" files you are interested in, using "binary" transfer mode. Typically there is a README file that could be retrieved, read and then used to determine that files you should finally retrieve.] The Netscape browser by Netscape Communications Corporation is available with anonymous FTP (ftp.mcom.com). Commercially supported versions of WWW browsers are also starting to become available.
The HTTP and HTML paradigms supports data entry forms, in-line and popup images in documents, and point-and-click interactive images as executable programs (called Common Gateway Interface or CGI programs). This paradigm enables us to implement fill-in form queries as well as data entry. The interactive images allow us to interrogate 2-DE gel maps by clicking on spots to access the database associated data. Finally, the CGI program extensions allow us to interact with the database in more interesting ways than just retrieving a document. In this context, Figures 2 and 8 show the collection of interdependent HTML documents, special HTML templates, and CGI programs using the RDBMS client-server SQL database used in the PDD.
The main steps in PDD-data entry and validation are: 1. Readers record this data on a standard paper form developed by the PDD Editorial Board. Use of a paper record helps enforce data quality because we can go back to the paper form at any point if necessary to verify data captured in the database. This of course will not eliminate errors made by the reader or in the articles themselves.
2. After the paper form is filled out by the Reader, the data are then entered into a Staging PDD database using the Web browser with PDD data entry forms.
3. Data are proofread and corrected by another individual using the PDD data browser in the Staging PDD.
4. The PDD Editorial Board members can review (i.e. browse) and validate new data in the Staging PDD by indicating that they are acceptable (or not) for migration the publicly accessible Working PDD.
5. At this point, the new data set is allowed to be migrated to the Working PDD. The actual transfer will be done on a scheduled basis with checkpointing of the Working PDD before and after each update, to ensure the integrity of the database by allowing restoration of a previous version if required. Finally, periodic audits of the Working PDD for data quality are also planned. Such careful checking during data entry will help ensure data quality and robustness of the data in the database.
Initial data entry to the Staging RDBMS can only be performed by authorized personnel on remote Internet hosts using a distributed data entry paradigm. Note that these preliminary data are stored in the Staging database - not the Working PDD database. This allows several groups (eg. NIMH in Washington DC, CDC in Atlanta, NCI/IPS in Frederick MD, and other sites) to participate in the data collection, review, and entry effort. All data captured and entered have time-stamps, and reader's, data entry clerk's, proofreader's, and reviewer's names automatically attached to RDBMS records for accounting purposes. Data entry using the SimpleForm program is discussed further in Appendix B. We can expect some backlogs of work to occur in some stages of the pipeline because of personnel bottlenecks. This problem might be handled by distributing the load to different individuals on the Internet since the work can be done from anywhere on the Internet.
Search results may be further analyzed and reported back to the user as lists of objects and their attributes, 2-DE gel maps, hypertext links to external federated WWW databases, and numerical tables or graphs derived from these data. These hypertext links to federated databases are generated by adding specific identifiers associated with the data in the PDD to the base Uniform Resource Locator. The federated database WWW servers then build a dynamic hypertext link to the PDD.
Summarizing the steps in this two stage process, users:
Since only the Web browser and an Internet connection are required to access the PDD, your direct on-line investigation of other aspects of the PDD system is encouraged. Figures 3, 4, 5, 6 illustrate some of the aspects of the query forms as seen by users when requesting a search for diseases that change as a function of fold changes of a set of proteins.
Most bibliographic databases such as Medline are single-entity databases and are organized around proteins, sequences, genes, or diseaes etc. Here, the data mined from the literature is organized around the concept of the quantitative relations of protein measurements to disease conditions.
The PDD represents a merging of several technologies that lets the user query for quantitative associations between protein patterns and diseases. The initial design of the PDD was specified as a stand-alone system for use on a workstation or high-end personal computer, and the database would have been distributed on CD-rom. However, as we observed the explosion of the WWW and the Web browser paradigm on the Internet, more interesting possibilities for a better distribution scheme became apparent. [The number of Internet hosts from 1974-84 was 0 to 1000; from 1984-92 was 1000 to 1,000,000; and from 1992-2000 is expected to reach 100,000,000 as estimated by the Internet Society.] Use of the Web browser as a client to the WWW server allowed us freedom from the problems of supporting different versions of software on different system platforms (Unix, MS Windows, Macintosh). Nor would we have to deal with CD-roms (with the many problems they entail including time, expense, distribution, inability to correct errors,and slow update cycles). We saw, along with many others, the utility of the WWW and Web browser client/server paradigm as a powerful data distribution medium for this type of biomedical data. We, therefore, redesigned the PDD to use the new Internet paradigm.
We selected the RDBMS data model for the PDD because the type of data we would use in protein disease correlations fits naturally into the relational model. Our decision to construct the initial underlying RDBMS SQL software engine in our laboratory enabled us to optimize it for the WWW environment (discussed in Appendix A), as well as to easily and inexpensively move it to more powerful UNIX multiprocessor platforms as computational demands require. We are in the process of converting the database engine to a commercial RDBMS to take advantage of a more robust environment.
Other related issues arose. First, since all access to the database is from the Web browser for both users and data entry personnel, we had to implement security measures to protect the integrity of the PDD database from accidental or malicious use. Security is required at several levels and is discussed in Appendix C. These security measures limit user access to the data entry facility as well as prevent them from changing the hidden SQL in the generated forms from their Web browser window. Second, because of the dynamic nature of queries, we had to dynamically generate query and other forms based on current content of the RDBMS. This was solved by developing a HTML template language used for expanding templates into HTML using data from the RDBMS as required (as discussed in Appendix D).
Because the WWW HTTP protocol is stateless, it is generally more difficult to build a graphical user interface that is as intuitively satisfying as can be done if the system were constructed entirely under X-Windows or Microsoft Windows. Therefore, solutions were implemented for the PDD that work, that are not as interactive as we would like. As the HTTP/HTML protocols improve and these problems are addressed by the general WWW community, we will improve our interface to take advantage of these advances.
Use of concentration or activity fold change is a convenient means of making protein measurement unit differences transparent and allowing comparability between studies, but this comparability can be misleading. This "normalization" method uses only an estimate of central tendency or (mean, median, or mode) to represent the distribution of data, and ignores the role of dispersion (sample variance and other distributional characteristics such as skewness and kurtosis). The PDD does record sample ranges, estimated values, and standard deviations when reported, but this information is frequently not included in the articles. Because of these factors, comparisons between studies cannot be reliably made at the quantitative level unless details of the protocols used in the studies described in the papers are taken into account by the PDD user. For these reasons the user should always consult the original papers presented by the PDD query result.
It is also true that the database will likely be biased in favor of positive results, since negative findings tend to be undereported (Szklo, 1991; Dickersin, 1993; DuRant, 1994). The user of the PDD must consider all these factors, and bear in mind that use of the PDD may result in erroneous conclusions unless the data and experimental details provided in the reference publications are critcially analysed.
The goal of the PDD is to provide a powerful search tool for the protein disease biomedical literature, and to allow rapid identification of patterns of protein disease correlations. The degree of sophistication of the search is dependent the detailed level of quantitative data collected and the software provided. The user must also be reminded that powerful tools do not automatically provide accurate results.
At the current stage of development, use of the PDD should be limited to exploratory use, i.e. rapid and flexible identification of relationships that should be further studied through direct review of the literature, and most certainly through standard laboratory and clinical research techniques.
We believe that this database is an early precursor of systems that will ultimately be used to help guide the diagnosis and treatment process through identification of relevant literature. Despite the limitations discussed above, the PDD provides a useful tool to the researcher of today, and a versatile testbed for more powerful systems of the future.
Although we have initially concentrated on the APP class of body fluid proteins, we don't limit the PDD to the APP and are pursuing other families of proteins. We do not require that proteins be found or identified in 2-DE gels to be used in the PDD, only that they may be assayed in human body fluids and that there be useful protein disease correlations based on a quality biomedical literature source.
It should be pointed out that although most proteins are not identified in 2-D PAGE gels, increasing numbers of proteins are being identified by various methods including microsequencing, immunoblotting, amino acid composition, mass spectrometry, and other methods (Celis, 1992). These identified proteins are being included in Internet-accessible databases such as ExPASy (Appel, 1994) and others listed in Table~1. Because more and more 2-DE gel groups are running similar IPG gel protocols (Chiari, 1992) and incorporating better cross-linkers in the gel chemistry (Hochstrasser, 1988), it is becoming easier to compare gels of similar material between groups to identify many of the proteins.
In the PDD, data proceeds from the Readers (who collect and record data), to Data Entry personnel, to Proof-readers, to the PDD Editorial Board, and finally to the PDD that is accessed by end users.
The fundamental philosophy here is that no database is completely error-free, and that while aggressive steps must be taken to prevent data errors from occurring, a full program designed to detect and correct errors must also be carried out. So, we are using two separate databases: the Staging database, where data are entered and verified, and the Working database, which provides data to the PDD system as seen by users. Data are transferred from the Staging to the Working database only after three key elements of data quality have been examined and verified.
The first of these, data entry accuracy, is established and maintained by performing a 100% verification of all data entry before its transfer to the Working database. This verification will be performed by Proof-readers who compare each record entered with the paper form from which it was transcribed. These Proof-readers, who are trained to access the Staging database through their Web browser, could be office staff associated with the Readers.
Next, interpretation and data recording quality are maintained by the Editorial Board and by Readers, who will conduct periodic quality audits of data held as paper form records. These audits involve comparison of the paper form with the source article by someone other than the original Reader, and will require resolution of inconsistencies or ambiguity of interpretation.
Finally, research article quality is maintained by the PDD Editorial Board, which will publish and enforce guidelines and a checklist of inclusion/exclusion criteria for papers used in the database. Compliance with these standards is verified before data from a paper are placed in the Working database.
In each of these cases, deviations from established criteria and inconsistencies between article, paper form, or computer record will be reported to a central data manager/database administrator, who coordinates resolution of the error, and correction of the database. The challenge is to create a distributed data entry system that preserves data quality while not also at the same time creating a bureaucracy. We have and will be developing computer tools to help support this paradigm.
Other aspects of this data acquisition model will be reported in future papers. Also being refined are mechanisms for comparing 2-DE gel map images on the PDD database (or other Web servers) with user gels on the user's own computers. This makes sense if user gels were produced with a similar protocols. It would help users locate putative protein identifications in their own gels, suggesting targeted experiments to confirm these identifications.
Although the initial RDBMS was adequate for testing the feasibility of the system, we are migrating the RDBMS to a commercial system such as Oracle. Although we expect this to be somewhat slower, it should be more robust and easier to maintain.
The Uniform Resource Locator (URL) used by Web browser to access to the PDD is: http://www-lecb.ncifcrf.gov/PDD When optional registration is used, a user name and password are returned to the user for future accessing the PDD. If they choose not to register, they would just enter the system by clicking on the Access Protein Disease Database button and the PDD will assign the user name demo with password demo to them for the duration of the session.
We solicit comments from users on errors, new data or types of data desired, and other suggestions. We may be contacted by E-mail at pdd@ncifcrf.gov.
2. Appel, R. D., Sanchez, J.-C., Bairoch, A., Golaz, O., Miu, M., Vargas, J. R. & Hochstrasser, D. F. (1993). SWISS-2DPAGE: A database of two-dimensional gel electrophoresis images. Electrophoresis, 14(11), 1232-1238.
3. Appel, R. D., Sanchez, J.-C., Bairoch, A., Golaz, O., Rivier, F., Pasquali, C., Hughes, G. J. & Hochstrasser, D. F. (1994) SWISS-2DPAGE database of two-dimensional polyacrylamide gel electrophoresis. Nucleic Acids Res., 22(17), 3581-3582.
4. Bairoch, A. & Boeckmann, B. (1994). The SWISS-PROT protein sequence data bank. Nucleic Acids Res., 22(17), 3578-3580.
5. Benson, D. A., Boguski, M., Lipman, D. J. & Ostell, J. (1994). GenBank. Nucleic Acids Res., 22(17), 3441-3444.
6. Berners-Lee, T. J., Cailliau, R., Luotonen, A., Henrick. F. & Secret, A. (1994). World Wide Web. Comm. Assoc. Comp. Mach., 37(8), 76-82.
7. Celis J. E. (Ed). (1992). Special issue: Two-dimensional Gel Protein Databases. Electrophoresis, 13, 891-1062.
8. Celis J. E. (Ed). (1994). Special issue: Electrophoresis in Cancer Research. Electrophoresis, 15, 305-556.
9. Chiari, M. and Righetti, P. G. (1992). The immobiline family: "vacuum" to "plenum" chemistry. Electrophoresis, 13, 187-191.
10. Dickerson, K. and Min, Y.I. (1993). Publication bias: The problem that won't go away. Ann. NY Acad. Sci., 703, 135-148.
11. Dougherty, D., Koman, R. & Ferguson, P. (1994). The Mosaic Handbook for the X Window System, O'Reilly & Associates, Sabastopol, CA. ISBN-1-56592-094-5; ... for Microsoft Windows, ISBN-1-56592-095-3; ... for the Macintosh, ISBN-1-56592-096-1.
12. DuRant, R. H. (1994). Checklist for the evaluation of research articles. J. Adolescent Health, 15, 4-8.
13. Elmasri, R. & Navathe, S. B. (1994). Fundamentals of Database Systems, 2nd ed. Benjamin/Cummings Pub., NY.
14. Fasman, K.H., Cuticchia, A.J. & Kingsbury, D.T. (1994). The GDB Human Genome Data Base anno 1994. Nucleic Acids Res., 22(17), 3462-3469.
15. Hochstrasser, D., Harrington, M. G., Hochstrasser, A.C., Miller, M.J. and Merril, C.R. (1988). Methods for increasing the resolution of two-dimensional protein electrophoresis. Anal. Biochem., 173, 424-435.
16. Hochstrasser, D. & Tissot, J. (1993). Clinical Applications of two-dimensional gel electrophoresis. In Advances in Electrophoresis - Vol 6, A. Chrambach, M.J. Dunn, B.J. Radola (Eds), VCH Pub., NY, pp 267-375.
17. Krol, E. (1992). The Whole INTERNET User's Guide & Catalog. O'Reilly & Associates, Sabastopol, CA. ISBN-1-56592-025-2.
18. Mackiewicz, A., Kushner, I. & Baumann, H. (1993). Acute Phase Proteins - Molecular Biology, Biochemistry, and Clinical Applications. CRC Press: Boca Raton, Florida.
19. Mann,, C. (1990). Meta-analysis in the breach. Science, 249, 476-480.
20. Mansfield, B. K. (ed). (1994). Future plans for databases. Human Genome News, 6(3), 4.
21. McCray, A. T. and Razi, A. (1995). The UMLS Knowledge Source Server. To appear in: Proceedings of MEDINFO '95, Vancouver, B.C., Canada, July 23-25, 1995. [Info on UMLS may be obtained from wth@nlm.nih.gov].
22. Merril, C., Goldstein, M., Myrick, J., Creed, J. & Lemkin, P. F. (1995). The Protein Disease Database of human body fluids: I. Information management needs and fundamental design considerations. Applied Theoretical Electrophoresis.
23. Pallini, V., Bini, L. & Hochstrasser, D. (1994). Proceedings: 2D electrophoresis: from protein maps to genomes. Univ. of Siena, Italy, Sept 5-7, 1994.
24. Pennington, S. (1994). 2-D protein gel electrophoresis: an old method with future potential. Trends Cell Biol., 4, 439-441.
25. Peto, R., Collins, R. and Gray, R. (1993). Large scale randomized evidence: Large, simple trials and overviews of trials. Ann. NY Acad. Sci., 703, 314-340.
26. Powe, N. R., Turner, J. A., Maklan, C.W . and Ersek, M. (1994). Alternative methods for formal literature review and meta-analysis in AHCPR patient outcomes research teams. Medical Care, 32(7), JS22-JS37.
27. Schatz, B. R. and Hardin, J. B. (1994). NCSA Mosaic and the World Wide Web: Global hypermedia protocols for the Internet. Science, 265, 895-901.
28. Szklo, M. (1991). Issues in publication and interpretation of research findings. J. Clin. Epidemiol., 44, Supl.I, 109S-113S.
1. memory-based relational tables for speed and simplicity rather than disk-cache based. Data tables are loaded into memory from disk files when it starts up the database, and checkpointed (written) back to disk files when SQL UPDATES are performed,
2. a grammar based parser that supports an increasingly complete SQL subset,
3. query optimization support in query evaluation routines,
4. a multi-threaded architecture that can take advantage of multiprocessor computers for increased throughput,
5. SQL client/server access uses a TCP/IP socket interface with a simple message based protocol protected by time-stamped encryption,
6. special HTML relational table output formatting support for WWW servers,
7. runs on SUNOS and SOLARIS on SUN UNIX workstations, and
8. POSIX standard C as implementation language allowing porting to other hardware and operating systems.
This RDBMS design is modular and therefore easy to modify to support object-oriented operations for enhancing the PDD. Being memory based, it is also faster than disk-based cached RDBMS (although the latter in the form of commercial systems such as Oracle or Sybase, or the Berkeley Postgres client-server systems could be substituted). We are investigating migrating the RDBMS to a commercial system such as Oracle. Figure 7 shows a simplified version of the RDBMS entities used to describe the PDD data. The full relational schema may be accessed from the PDD server itself using the Web browser interface.
Query processing is a two stage process. The first stage selects objects to be used in the query (eg. picking a specific list of proteins, as well as options specifying how to compute the search and how to show the results). The CreateQuery CGI then dynamically synthesizes specific query forms using these parameters to generate the second stage query. Figure 3 shows a first stage query form. Figure~4 shows the resulting second stage query form that was generated.
Submitting the second stage query invokes the EvalSQLquery CGI program that expands and evaluates the resulting SQL query statements by:
As discussed before in Data Capture, Readers fill out paper forms as they read the papers and extract the relevant information. These paper forms are next proof-read, reviewed by the PDD Editorial Board and then entered using the SimpleForm CGI program that processes the data by mapping it to the RDBMS schema. There are two steps in entering data: basic data (references, proteins, protein assay) and protein disease correlations. Correlations may only be entered after the basic data have been entered, since data is selected from scrollable lists of basic data.
The data are entered in the simple Web browser fill-in forms either by typing in a text input window, cutting it from a scrollable review window and pasting it into the text input window, or selecting it from pull-down or scrollable menus (not shown). The SimpleForm program then translates the fill-in form data into the internal format required by the SQL schema used by the PDD to update the RDBMS tables and checkpoint them to disk files. Figure 9 shows the data mapping used in the PDD for entering data.
Selected objects lists may be used for several purposes. Search results of these objects are appended to the relevant list (eg. proteins found are added to the Selected Protein List). In addition, proteins selected by clicking on spots in a 2-DE gel map are also appended to the protein list. When generating queries, a selected object list may be used to specify what to used in a search (cf. Figure 3 ). Once a list exists, users may then review, edit or specify objects that are in a list. For the list of proteins, this may generate a 2-DE gel map in the specified sample domain (eg. plasma, CSF, urine) showing proteins currently selected. The list of independent variables mentioned above in querying the PDD database may be set from the selected objects list. So the list is used for both saving results and requesting searches.
We have designed the system to support several classes of users that are divided into three groups: system maintainers, PDD data maintainers and general users. The ability to change SQL statements is restricted to system maintainers. The ability to enter data is reserved for system and PDD data maintainers to protect database integrity. Furthermore, our goals of distributed maintenance of the database using Web browsers conflicted with using less secure versions of Web browsers more readily available (although this will change with time as more secure versions of these programs are offered). Therefore, it was necessary to implement a minimum level of security as part of our CGI programs for the PDD.
When the PDD is first accessed, it requires a Username, Password and remote Internet "IP" host address (a default demo, demo is used if nothing is specified). It then repeatedly passes this and other user-specific parameters as time-encrypted information to later PDD operations. [So, attempting to access intermediate PDD forms in your Web browser "hotlist" or "bookmarks" at a later time will not work.] These user specific parameters are passed from one PDD template expansion to the next, using HTML "hidden" encrypted variables. System and PDD data maintenance is further limited to PDD database maintainers by using protected templates limiting access to specific hosts and accounts. Protected directories are also employed where appropriate, using the standard HTTP server password authentication mechanism. As the WWW protocols security improves, we will be incorporating better methods to protect the database.
The $form-variable$ instances in the SQL statement are expanded to their corresponding values specified by the user in the query form. When the SQL statement is evaluated by the PDD, the form conditional expressions {+...}+ or {-...}- are used to modify (include or omit) that part of the SQL query where the conditional form-variables exist or not respectively.
There are also cases where the Web browser form submitted to the PDD WWW server may supply multiple instances of a form variable. These multi-list variables are used in {&...}& or {|...}|} expressions are expanded to multiple relational clauses in the "WHERE" Boolean expression of the SQL query statement. Finally, the user may optionally view the resulting SQL query statement template. Only privileged users are allowed to modify SQL query statements in the forms to protect the integrity of the PDD database. This is enforced by computing a time-stamped sequential checksum function of the SQL query statement by the BuildForm program and later comparing this checksum with the EvalSQLquery program when the SQL is returned for evaluation. We now illustrate the syntax of the template language by examples and with a complete example of a SQL query in Table~5.
(condition.cond_name = '$cond_name$') AND
(condition.cond_name = 'Bladder Carcinoma') AND
{+ form-variable
...SQL code ...
}+
{- form-variable
...SQL code ...
}-
{&
(user_value .EQ. '$user_value$') AND
}&
((user_value = "var1") AND
(user_value = "var2") AND
(user_value = "var3")
) AND
{|
(user_value .EQ. '$user_value$') AND
}|
((user_value = "var1") OR
(user_value = "var2") OR
(user_value = "var3")
) AND









Table 1. Some World Wide Web 2-D electrophoretic gel database servers. This list of 2-D electrophoresis WWW databases is being maintained by our Laboratory and is available in URL http://www-lecb.ncifcrf.gov/EP/EPemail.html.
Table 2. Primary types of queries that could be answered with the PDD.
Table 3. Examples of queries that could be answered with the PDD. A relational database offers flexibility in posing a wide variety of types of queries.
Table 4. Criteria for data quality of biomedical literature. These criteria are applied to biomedical literature that is to be included in the PDD. Additional data (such as external genome database identifiers and other study statistics) may also be included if available. These criteria are used by the PDD Editorial Board and Readers in validating data to be entered for data quality.
Table 5. Example of evaluation of SQL query in template. This illustrates evaluation using the $...$ form-variable syntax. The initial SQL query statement is given in a). Let the HTML check-box settings returned to the PDD server be use_source_name="off" and use_fold_change="on", and with fold change range values between 2.0 and 5.0. Then, EvalSQLquery would expand the code to the SQL shown in b) that is now of the form required by the SQL server.
a) Initial SQL query statement
SELECT protein.prot_id, protein.prot_name,
dis_correl.fold_change, condition.cond_name
FROM protein, dis_correl, condition, source
WHERE
{+ use_source_name
(dis_correl.source_id = source.source_id) AND
(source.source_name = '$source_name$') AND
}+
((
{+ use_fold_change
(dis_correl.fold_change .GT. $fold_change_lower_bnd$) AND
(dis_correl.fold_change .LT. $fold_change_upper_bnd$) AND
}+
(condition.cond_name = '$cond_name$') AND
(condition.cond_id = dis_correl.cond_id) AND
(dis_correl.prot_id = protein.prot_id))
)
b) Expanded SQL query statement
SELECT protein.prot_id, protein.prot_name,
dis_correl.fold_change, condition.cond_name
FROM protein, dis_correl, condition, source
WHERE
((
(dis_correl.fold_change > 2.0) AND
(dis_correl.fold_change < 5.0) AND
(condition.cond_name = 'Bladder Carcinoma N0') AND
(condition.cond_id = dis_correl.cond_id) AND
(dis_correl.prot_id = protein.prot_id))
)