THE ONLINE ARCHIVE OF CALIFORNIA PROJECT
A PROTOTYPE UNION DATABASE OF ENCODED ARCHIVAL FINDING AIDS

Pilot Project Proposal Presented to the
UC Digital Library Executive Working Group
January 1996


Proposal Originally written for The UC-EAD Project

Description: Encoded Archival Description (EAD), developed by an R&D team at UC Berkeley, is emerging as the national standard for encoding of archival finding aids in electronic format. This project will develop an implementation toolkit to facilitate and standardize EAD implementation throughout UC libraries; build a prototype union database; and create a broad-based UC constituency to test, evaluate, and establish EAD as the foundation of the UC digital library for archival materials.

Project Coordinator:
Charlotte B. Brown
Acting Head of Special Collections
UCLA Library, Box 951575
Los Angeles, CA 90095-1575
Phone: 310/825-7265
Fax: 310/206-1864
Email: ecz5cbb@mvs.oac.ucla.edu
Project Administrator:
Brian Schottlaender
Associate University Librarian
Collections & Technical Services
UCLA Library, Box 951575
Los Angeles, CA 90095-1575
Phone: 310/825-1202
Fax: 310/206-4109
Email: ecz5bri@mvs.oac.ucla.edu

Institutional participants: The Special Collections Units and University Archives of the nine campuses of the University of California System, the UCLA Department of Library and Information Science, and limited participation by selected non-UC libraries.

Collaborating faculty/librarians (UC EAD Governing Board):
UCB
UCD
UCI
UCLA
UCR
Jack Von Euw
John Skarstad
Jackie Dooley
Charlotte B. Brown
Sid Berger
UCSB
UCSC
UCSD
UCSF
UCLA DLIS
David Tambo
Rita Bottoms
Lynda Claassen
Robin Chandler
Anne Gilliland-Swetland

Period of project: July 1, 1996 - June 30, 1998

Funds requested:Cost sharing:
Year 1:
Year 2:
$357,545
$237,374
Year 1:
Year 2:
$250,712
$259,574
Total requested:$594,919 Total cost sharing:$510,286

Project Categories: Digitization and Scholarly Communication. Constituencies to be served: UC students, faculty, and staff; citizens of California, including K-12 students and teachers; researchers and other information seekers worldwide.

Need to be filled or problem to be solved: Standardized encoding of finding aids for archival primary source materials held at all UC campuses, coordination of centralized Internet access to this data, and linkages between finding aids and other digital information.

Existing activities that will be built on or extended by the proposed project, if applicable: The Berkeley Finding Aid Project, The NEH California Heritage Digital Image Access Project, The Bentley Historical Library Fellowship Project on Encoded Archival Description, The Berkeley Finding Aid Conference, The UCLA Planning Conference on UC Implementation of EAD, and the Library of Congress National Digital Library Program Meeting on EAD.

Other sources of support for the project, if applicable: Other than extensive cost sharing from each of the nine UC campuses, no direct sources of additional funding have yet been identified.

Table of Contents

0. BUDGET

1. PURPOSE

1.A.     Anticipated outcome of project
1.B.     Services to be provided
1.C.     Results of completion; Scaling up
1.C.1.      Governance
1.C.2.      Budget justification

2. BACKGROUND AND SIGNIFICANCE

2.A.     Needs to be met and audience served
2.B.     The state of the art
2.B.1.      Brief history of the archival access problem

3. PRELIMINARY EFFORTS

3.1.     The Berkeley Finding Aid Project
3.2.     Encoded Archival Description and the Bentley Fellowship Program
3.3.     The California Heritage Digital Image Access Project
3.4.     The Next Step: The UC EAD Union Database, or "Virtual Archive"
3.5.     Related work
3.5.1.      National Inventory of Documentary Sources in the United States
3.5.2.      National Union Catalog of Manuscript Collections
3.5.3.      National Digital Library Initiative
3.5.4       Council on Library Resources grant
3.5.5.      Research Libraries Group
3.5.6.      Southeastern Library Network

4. PROJECT PLAN

Step I   Training workshop & production support
Step II  Creation of USMARC collection-level records and encoding of finding aids
Step III Assembling data
Step IV  Creation of virtual collections within the virtual archive
Step V   Sustaining continuing operations

5. EVALUATION

6. PROJECT TIMELINE

1. PURPOSE.

This proposal seeks funding for a two-year pilot project to develop a UC-wide prototype union database of 30,000 pages of encoded archival finding aid data. This database will serve as the foundation for development of a full-scale digital archive for the University of California System (UC) available via the Internet to diverse user communities. The timing is opportune, given the growing use of the archival and special collections of the nine UC campuses, the strong focus on special collections materials in digital library development throughout the world, and especially the emergence from within UC of a national (potentially international) standard to enable information interchange and long-term maintenance of finding aid data.

Finding aids describe in considerable detail the content of archival collections of manuscripts, photographs and other pictorial materials, moving images, and other collections of unique primary source materials that comprise the basis and inspiration for new research in the arts, humanities and social sciences. UC libraries are the repositories for an extraordinary body of such materials, including the papers of major historical and contemporary California political figures, as well as internationally prominent literary authors and luminaries in every conceivable field; the records of major organizations that have affected the course of society; and collections of images documenting the past and present of communities and activities throughout California and the world.

The emerging encoding standard for finding aids, known as Encoded Archival Description (EAD), consists of a data model and SGML Document Type Definition (DTD) for archival finding aids; EAD was developed under the direction of Daniel Pitti at the UC Berkeley Library. Now in its final developmental stages, EAD has been widely heralded throughout the U.S. archival community as a critical development and the state of the art in archival automation. It has earned the strong support of several key national agencies, including the Society of American Archivists, the Commission on Preservation and Access, and the Library of Congress (LC). LC considers EAD a key component in implementation of the National Digital Library Initiative (see Section 3.5), so much so that the Library has agreed to take on long-term maintenance of the DTD (similar to its role for the USMARC format) and has contracted with an SGML consulting firm to develop the tag library. Since EAD was developed within UC, it is appropriate that UC also take the lead in its implementation in order to demonstrate EAD's potential for structuring archival information in digital library settings and to promote the richness of UC archival collections.

At the Planning Conference on UC Implementation of EAD held at UCLA on September 28-29, 1995, the UC Archivists' Council, Heads of Special Collections, several systems librarians, and representatives from the UC Division of Library Automation (DLA), Stanford, USC, and the Getty and Huntington libraries unanimously agreed that a pilot project to prototype a union database of finding aids using EAD would play a critical role in facilitating rapid implementation of the new standard by research libraries throughout California.

The non-UC libraries are included in the project's training and information sharing aspects, but not in the funding of equipment and staff. Additional research libraries in California could also be accommodated in these aspects of the project, further extending the project's influence in building an immense Statewide online resource. The project team's expertise will be broadened by the participation of Anne Gilliland-Swetland, Assistant Professor in UCLA's Department of Library and Information Science, who will coordinate the documentation survey and evaluation phases of the project. While a specific role has not been defined for DLA, it is likely that several areas of common interest between DLA and the UC EAD Project will emerge that can be developed as the project progresses.

1.A.
Describe the objectives of the proposed project as specifically as possible. What is the anticipated product or outcome?

The project's overall goal is the development of an implementation toolkit and prototype union database, providing a testbed to evaluate the effectiveness of Encoded Archival Description, the emerging standard for electronic archival finding aids, as an encoding scheme for a union database providing integrated access to archival holdings from all UC Special Collections and University Archives units. To achieve this goal, the project will pursue the following specific objectives:

  • Devise an implementation toolkit and training package for EAD to facilitate System-wide conversion of existing archival finding aids and future encoding of new finding aids as they are produced;
  • Devise policies and procedures for decentralized creation and maintenance of encoded finding aids in order to ensure their ongoing integration into the virtual archive as new finding aids are created and new collections are acquired;
  • Devise effective mechanisms to link and integrate related collections contributed by different institutions so that they can be navigated as a single archive;
  • Create a single site for students and researchers to search for detailed information about UC archival primary source materials, regardless of where the originals reside;
  • Create a standards-based, scalable technical architecture that supports the "virtual archive" model.

1.B.
What service will be provided? To whom?

By creating a prototype union database of encoded finding aids, the UC EAD Project will provide UC faculty, students, researchers, and staff with Internet access to descriptive information about its extensive archival collections. In doing so, the project will address two serious researcher dilemmas: the geographic distribution of unique primary source collections, and the lack of standardization in the written documentation that describes them. The UC EAD database will be, in effect, a "virtual archive", integrating hundreds, perhaps thousands, of archival finding aids for collections held Statewide.

Given the rapid growth of Internet access, the client group served by the participants is extremely broad, including: the entire UC student, faculty, and staff community; scholarly researchers around the world; other students of all ages and educational levels; teachers and faculty from all types of schools and colleges; representatives of for-profit businesses; and individual citizens pursuing their personal interests. All will have ready access to descriptive information previously available only to those able to travel to collection sites. One of the most exciting aspects of this project is that it will open to UC undergraduates a domain previously known principally to advanced researchers. The work of George Landow of Brown University and others has shown the powerful effect of electronic access to primary source materials on undergraduates' ability to learn and write about humanities subjects. UC EAD will bring the archives and special collections of the entire University into classrooms throughout the State; the database of archival finding aids will be a potent research tool serving a vital function in undergraduate instruction.

UC already offers programs to help make its archival materials more available to the State's populace that will be strengthened by the proposed project. For example, for the past three years, UCLA's special collections staff have been assisting teachers in developing enrichment units for K-12 studies and California Frameworks. Each year, in coordination with the California History-Social Science Project and with the Summer Institute of the UCLA University Elementary School, approximately 50 teachers learn to use finding aids to identify and consult UCLA's special collections. In the north, the cooperative UC Berkeley Library and California State Library InFoPeople Direct Access Internet Project assists public libraries of the State in gaining direct Internet access.

1.C.
Will the result be completion of a one-time task or an ongoing operation? If the latter, how will it be sustained after project funding ends? If the project is successful, can it be scaled up or replicated to serve a larger constituency?

This project will establish the foundation for creation of a UC digital archive of primary source materials, which ultimately can provide access to all of the archival, manuscript, pictorial, and other special collections held by UC libraries. As such, a primary objective of the project is to provide the necessary training, establish the essential procedures and policies, and develop the requisite administrative commitments from all UC libraries so that the project's prototype union database can serve as the basis of UC's digital archival access system. The project is designed to demonstrate that a standard originating in two successful UC Berkeley research projects can be scaled up so that universal access can be provided in a decentralized production environment.

Because the data being generated in this project is in a standard format (SGML) its durability and portability are ensured regardless of changes in platforms. Though the union database will begin its life centralized on one server, nothing about the technical design of the project will preclude the decentralized maintenance of its resources when technology makes that possible. When the client that searches the MELVYL online catalog is ANSI/NISO Z39.50 compliant, and when URN/URL resolvers have been developed, it will be possible to link USMARC cataloging records to SGML finding aids for UC collections and to link both, in turn, to digital information in a multitude of formats, including digital surrogates of primary source materials, encoded electronic texts of scholarly interpretation, and digital data in other formats.

1.C.1. Governance.

To achieve the project's goal of ensuring UC- wide participation, as well as ongoing development and maintenance of this important resource, each campus has designated one representative from Special Collections and/or University Archives to serve on the project's Governing Board. The Board will also include a representative from the faculty of the UCLA Department of Library and Information Science (DLIS). This group will be responsible for policy decisions once the project is underway. Additional staff on each campus will assist, further contributing to the cost-sharing totals (designated in the budget as Campus Representatives, 1-3 FTE per campus @ 25% each; number of staff and percentage of time will vary from campus to campus). The UCLA Library has offered to serve as the project's "home" campus, hosting the Project Manager and other project administrators. UCLA DLIS will coordinate the research and evaluation aspects of the project.

1.C.2. Budget Justification.

The UC EAD Project will develop the infrastructure of staff expertise and basic equipment necessary for UC- and California-wide implementation of Encoded Archival Description on an ongoing basis. The project has intense support from all UC Special Collections and University Archives units, but an infusion of resources is required in order to begin implementation. Although the costs are greater than those associated with simple cataloging or data conversion projects, the project will be a major investment in implementation of a new technology enabling vastly improved access to some of UC's most important library holdings. The same technology, once thoroughly tested in this project, can be readily used to provide the same kind of flexible and universal access to the myriad specialized research collections that currently are held by the University's academic departments, independent research units, and museums. The capability to create links among collections over the network and produce "virtual archives" will be extensible to these disparate and currently relatively inaccessible, but immeasurably valuable, research collections.

The five project staff members to be hired are absolutely essential, given cutbacks in the UC libraries' staffing in recent years. Also, considering the current state of UC library budgets, even relatively minimal costs such as the ca. $5,000 necessary for purchase of basic hardware and software for one workstation, or travel to two meetings, could inhibit a campus's participation unless funding is received. A two-year project is highly advisable given the start-up time necessary for implementation of a new and sophisticated technology.

The $93,750 included (under Other Costs) for SGML mark-up and scanning/OCR will be distributed to the campuses who are doing their own mark-up and/or having manual finding aids scanned or re-keyed. It is estimated that 50% of the 30,000 pages to be encoded during the project are presently in manual form. The five project staffers will encode 15,000 pages, and the campuses will encode the other 15,000 pages, hence only the cost of the latter appears explicitly in the budget.

Cost sharing totals derive primarily from the need for Berkeley to hire and provide space and other essentials to the five project staff members, as well as for the inevitable additional administrative and equipment requirements that will have to be met by each participating campus and unit.

2. BACKGROUND AND SIGNIFICANCE.

2.A.
Describe the need that the proposed project will meet and assess its urgency: Who will use it? What will they be able to do that they cannot do now? Be specific.

The UC EAD Project seeks to meet the need of UC students, faculty, staff, and researchers worldwide for ready, economical access to the University's vast holdings of primary source materials. At present, information seekers must travel to nine separate campuses to gain access to the information that will now be available in one Internet-accessible database. The project also seeks to meet the need of library and archives staff for training and documentation to enable them to create and maintain electronic finding aids on an ongoing basis. The UC EAD database will also provide UC librarians and archivists with detailed information about the holdings of other UC Special Collections, thus helping them better determine how to spend scarce collection development funds to most effectively serve all UC faculty and students.

During the last decade, widespread implementation of the USMARC Archives and Manuscripts Control (AMC) format began to address the problem of geographic distribution of collections by facilitating creation of nearly 700,000 collection-level summary descriptions in the nation's bibliographic utilities. However, the finding aids upon which the collection-level records are based and which serve as the primary mechanism for describing, controlling, and providing access to archival collections, have remained largely in print form, accessible only onsite at the owning repository. Because of USMARC limitations, collection-level bibliographic records provide a brief and therefore limited description; much of the content of any given collection is described only in the full-text finding aids upon which the USMARC record is based. The next logical step to facilitate locating and identifying primary source materials is the creation of an online union database of finding aids accessible over the Internet. The need to take such a step is nationally recognized in the archival community.

2.B.
Discuss the current state of the art for which the project is proposed. Will the project build on or add to existing work, whether in or outside the University of California? If the project will extend existing activities, describe how will leverage be accomplished: how will the result differ from current practice? How will it fill recognized gaps in service or technology? Projects may establish standards for fields that are now being developed in heterogeneous formats. If your project may play this role, describe the current situation and why the proposed standards are likely to be adopted. Projects may serve as test cases for new concepts. If this is true for your project, describe the concept and how your pilot might be scaled up if it is successful. What might be the pedagogical, service, technological, and institutional consequences of and requirements for large-scale implementation?

The UC EAD Project will build on recent efforts at both the national level and within UC. In particular, it will utilize the technological infrastructure developed by the UC Berkeley Library in its Department of Education-funded Berkeley Finding Aid Project (BFAP) and its NEH-funded California Digital Image Access Project, in which the data model, the SGML DTD, the database, and the capability to create the "virtual archives" that will form the basis of this project were developed. Using this technology, the UC EAD Project will build a large union database of hundreds of finding aids comprised of thousands of pages of text providing detailed description and access for millions of manuscripts and other primary sources in the University's vast and dispersed research collections. It will provide students and researchers with greatly improved bibliographic and physical access to these unique collections. The database will demonstrate, for the first time, the technical feasibility of unified intellectual access to geographically-distributed collections through the creation of a centralized database of catalog records and finding aids that index and describe these materials.

2.B.1. Brief History of the Archival Access Problem.

One of the most vexing problems confronting information seekers is the variety and number of sites that they must search to discover whether the material they need has been saved in an archive, and if so, where it is located and how they can gain access to it. This particularly plagues those who use unique primary source materials, and it has long been recognized that integrating the catalog records for distributed collections into one catalog greatly simplifies the researcher's task.

In 1951, the National Historical Publications and Records Commission (NHPRC) began to compile a union register of archive and manuscript collections held by the nation's repositories. The objective was to provide central, intellectual access to primary source materials, focusing on collection-level description rather than on sub-collection or item-level descriptions, since individually cataloging each item in large collections is prohibitively expensive. The Bancroft Library's Sierra Club archives of over one million items, for example, well illustrates this basic economic problem.

After gathering collection-level data from 1,300 repositories nationwide in the 1950's, NHPRC published A Guide to Archives and Manuscripts in the United States in 1961. The Commission decided to revise the directory in 1974, but after assessing the situation found that the number of repositories and records had grown so dramatically that compiling even collection-level descriptions would be prohibitively expensive. The Commission decided instead to focus on repository-level information, a much coarser level of access. Despite this shift, NHPRC continued to envision a “national collection-level data base on archives and manuscripts,” but for various reasons, the idea was abandoned in 1982.

In 1951, the same year that NHPRC began planning the Directory, the Library of Congress began to plan the National Union Catalog of Manuscript Collections (NUCMC). NUCMC was intended to be for manuscripts what the National Union Catalog (NUC) was for printed works. Winston Tabb, Assistant Librarian of Congress for Collections Services, described a major factor in the decision to develop NUCMC: “Scholars, particularly in the field of American history, were instrumental in urging the establishment of a center for locating, recording, and publicizing the holdings of manuscript collections available for research. They had long been frustrated by the difficulties of locating specific manuscripts and even of identifying repositories possibly containing primary-source materials.”

In 1958, the Library began to implement its plans with a grant from the Council on Library Resources. The Manuscript Division was established in the Library's Descriptive Cataloging Division and given responsibility for initiating and maintaining the NUCMC program. The catalog would provide collection-level description for collections held in U.S. repositories, and, for particularly important manuscripts, item-level descriptions. Like the NUC, the catalog would consist of catalog cards published in book form and available by subscription. The first volume of NUCMC was published in 1962. In 1994, the Library announced that volume 29 (1993) would be the last printed volume of NUCMC, given that powerful networked computer systems can now deliver the union list much more effectively.

The emergence of nationally-networked computer databases has provided us with the means to build union catalogs that can be made available everywhere, greatly facilitating the compiling of union databases. Over the course of the 1980's, the OCLC and RLG databases have emerged as de facto union catalogs to not only the nation's bibliographic holdings but to a good share of the world's as well. Eleven years ago, the records in the national utilities almost exclusively represented published print materials; the primary source materials in the nation's archives and manuscript repositories were not represented. This changed, thanks to the National Information Systems Task Force (NISTF) of the Society of American Archivists, which paved the way for the development of the USMARC Archives and Manuscripts Control (AMC) format, making it feasible for archives and manuscript repositories to provide brief, synoptic bibliographic records for collections. To facilitate consistent preparation of the data content of AMC records, Steven L. Hensen of the Library of Congress developed a set of cataloging rules entitled Archives, Personal Papers, and Manuscripts (APPM). Coupled with the AMC format, APPM has enabled contribution of over 400,000 collection-level records to the Research Libraries Group's RLIN database and nearly 260,000 to the OCLC database. Through the utilities, scholars now have access to a growing accumulation of brief descriptions of the nation's archive and manuscript collections.

As revolutionary as this accomplishment has been, however, it represents but the first step in helping scholars to easily locate the primary source materials they need for study and research. The generalized descriptions in AMC records inform researchers where a collection exists, but to determine what materials are actually in a collection, they need access to the detailed inventories, registers, indexes, and guides that are generally called finding aids.

Hensen has described the inadequacy of an archival access system that depends on MARC AMC records alone: “The MARC/AACR2 catalog record ... can serve as simply one level of descriptive detail in a system of hierarchical pointers that leads inexorably from index terms to full text.” Hensen has further pointed out the ready way to overcome this limitation by automating all levels of the archival access hierarchy. “In such a system the user would ... [move] from subject headings or index terms to full MARC cataloging records to ever increasing levels of detail ... faced with such additional options as moving from the catalog record to abstracts, other indexes, tables of contents, etc. and eventually to digitized images of the collection material itself--all without leaving the computer terminal. This is a particularly happy development for archivists who, with their guides, finding aids, and records series descriptions, already have in place the sort of hierarchically intermediate descriptions that would logically sit in such a system between catalog records and full text ... It is from this idea that the concept of the 'virtual library' has sprung.”

AMC collection-level records and finding aids will work together in the hierarchical archival access and navigation model. The AMC record occupies the top position in the model and leads, through a note, to the detailed collection information in the finding aid; the finding aid, in turn, leads to the materials in the collection. The descriptive information in the collection-level record is derived from the collection's finding aid, but only a very small portion of the information contained in the find aid finds it way into the bibliographic record. The archive of National Municipal League Records (1890-1991, bulk dates 1929-1988) in the Auraria Library in Denver, Colorado, serve as a dramatic example of the summary nature of collection-level cataloging: the finding aid comprises over 1,400 pages and 30,000 personal names, but the AMC record for this collection is only two pages long, with nine personal name access points!

Clearly, the next logical step in the evolution of electronic access to primary source materials is a union database of finding aids accessible over the Internet, anywhere in the world, at any time of the day or night. Moreover, adoption of standards for encoding and navigating finding aids, creation of inter-institutional protocols for cooperative database construction, and demonstration of the technical capability to create, in a cost-effective manner, a union database of finding aids are necessary preconditions for creating an operational union database of digital surrogates, including digital preservation copies, of primary source materials. Fortunately, the technological and intellectual foundation for such a database has been built in earlier research projects at Berkeley.

3. PRELIMINARY EFFORTS.

Describe how and by whom the feasibility and effectiveness of the proposed project have been demonstrated. Have you or your colleagues conducted pilot or test implementations? What project planning activities have you already conducted? Have you made arrangements to collaborate with specific individuals and institutions? Are there prerequisites that must be in place for the project to be feasible? If so, how will they be provided for? Refer to any planning or evaluative studies that discuss early phases of the proposed project or analyze similar efforts at other institutions.

3.1.
The Berkeley Finding Aid Project.

Recognizing that the archival finding aid, rendered in a standard, platform-independent electronic form, would add a key layer of access and control in the complex environment of the Internet, a team of researchers at Berkeley led by Daniel Pitti embarked on the Berkeley Finding Aid Project (BFAP) in 1993. It was a propitious time to undertake a project to develop a prototype standard for encoding finding aids: Standard Generalized Markup Language (SGML), an international standard since 1987, had become firmly established in a wide array of government and private enterprises, and a robust and rabidly growing market of software tools to support it was emerging. Politically, the archival community had reduced its resistance to technology as a result of positive experience with the AMC format and was ready to embrace the idea of an encoding standard for finding aids.

An encoding scheme assumes content to be structured, and this presented the Berkeley researchers with a fundamental problem: no finding aid content standard comparable to APPM exists, and so a prototype data model defining the structure and content of finding aids had to be the first step. Pitti identified the requirements that would need to be satisfied by any technique used to deliver enhanced archival description to network users. These include the ability to present the descriptive information found in archival finding aids, to preserve the hierarchical relationships that exist between levels of descriptive detail, to represent descriptive information that is inherited by one hierarchical level from another, to navigate within a hierarchical information architecture, and to conduct element-specific indexing and retrieval. Working with representative finding aids contributed by the archive and library community, Pitti identified both the basic elements and the logical relationships among the elements, developing a prototype data model.

Candidate techniques considered by BFAP included Gopher presentation of ASCII data, HTML (hypertext markup language) tagging, USMARC tagging, and use of SGML (Standard Generalized Markup Language). The latter technique, an international standard (ISO 8879), emerged as being able to meet all of the functional requirements of archival finding aids, and as being supported by a large and growing number of software products that run on a variety of platforms. Based on these results, Pitti and his colleagues at Berkeley elected to test the use of SGML in encoding archival finding aids.

SGML is a set of rules for defining and expressing the logical structure of documents, and thereby for enabling software products to control the searching, retrieval, and structured display of those documents. The rules are applied in the form of codes (or tags) that can be embedded in an electronic document to identify and establish relationships among component parts. Because consistent tagging of like documents is key to successful electronic processing of them, SGML encourages such consistency by introducing the concept of document type definition (DTD). A DTD prescribes the ordered set of SGML tags available for encoding the parts of each example in a class of similar documents. Archival finding aids that share similar parts and structure form a class of documents for which a DTD can be developed.

The encoding scheme that BFAP produced is in the form of an SGML DTD and is based on the content model developed by Pitti. The March 1995 version of the BFAP DTD defined a class of documents that, in general, consist of an optional title page, the description of a unit of archival material, and optional back matter. A title page conforming to the draft DTD could comprise any of number of taggable elements, such as repository or finding aid type. A DTD-conformant unit description could comprise a brief description of the unit (incorporating taggable elements analogous to those of a MARC catalog record), a longer narrative description of the unit and any segregatable parts (incorporating such taggable elements as title, dates, and scope and content), and formatted container lists. As the BFAP DTD took shape, it was tested in the encoding of electronic finding aids, and by March 1995, a critical mass of data had been encoded.

To test the DTD, project staff developed a prototype database to evaluate search, navigation, and display of electronic finding aids. Project planners decided that the UNIX environment was the best one for sharing the database on the Internet at this time (but developed the prototype so that it can be made available using client/server or web technology in the near future). For browsing and online presentation, Electronic Book Technologies' DynaText, an SGML-based publishing system, was selected as the best available software for online presentation of SGML-based documents. After evaluating nine products, Exceed's Hummingbird X Server software was selected to allow PC clients to connect to the UNIX server over the campus TCP/IP-based local area network to search and display finding aids in the DynaText database, which has been mounted on the library's UNIX based server, a Digital Equipment Corporation DecStation 5000/240. An XV client that serves as image browser for images linked to the SGML encoded finding aids is also stored on the DEC. For SGML authoring and conversion and DTD development, the excellent suite of SGML tools produced by ArborText was chosen.

The Berkeley Finding Aid Project has been deemed an unqualified success, having achieved its primary goal of driving the development of a new standard, with the Library of Congress and the Society of American Archivists rapidly moving to complete the standards process. Explaining BFAP's success in his paper, "NISTF II: The Berkeley Finding Aid Project and New Paradigms of Archival Description and Access," delivered at the Berkeley Finding Aid Conference on April 4th, 1995 (text available at Duke's World Wide Web Site: http://odyssey.lib.duke.edu/news/bfap.html), Steven Hensen has stated that the work of BFAP not only represents a logical culmination of previous work to develop standards within the archival profession but also, at the same time, moves the archival world out to the leading edge of general information access research. Hensen has also argued that BFAP is a direct descendant of NISTF, which resulted in the development of the AMC format as a data standard for collection-level cataloging records and of Hensen's own APPM, the standard cataloging rules used by archivists.

3.2.
Encoded Archival Description (EAD) and the Bentley Fellowship Program.

Two developments early in 1995 served to transfer ownership of the work of BFAP to the national archive and library communities, where the finding aid standards development effort rightfully belongs. In February, the Bentley Historical Library awarded a fellowship to Pitti, a team of archivists, and an SGML expert to evaluate formally the finding aid DTD and data model, and, based on the evaluation, to develop a revised DTD and data model for formal submission as community standards. In April, the Commission on Preservation and Access (CPA) sponsored a conference at Berkeley to gather representatives of the archives and library communities for a preliminary evaluation of the results of BFAP and to make recommendations concerning next steps. The consensus was that Berkeley had achieved its limited objective of demonstrating the desirability and feasibility of an encoding standard for finding aids. While endorsing the Bentley research team, conference attendees also recommended that interested institutions begin testing the DTD, using their experience gained thereby to inform the ongoing standards development process.

When it met in Ann Arbor in July, the Bentley Team agreed to collaborate in the production of: 1) finding aid encoding standard design principles; 2) a revised finding aid data model; 3) a revised finding aid document type definition; 4) finding aid encoding guidelines and examples; and 5) an article describing the Team's understanding of the structure and content of finding aids. Early agreement was reached on the principles that should underlie design of an encoding standard and the structure of finding aids as documents to be encoded was then analyzed in detail. Next, the Team evaluated the encoded elements that had been incorporated in the BFAP model. Having stated as one of the design principles that complementary standards should be taken into account whenever possible, the Team agreed that when elements have a close analog in the Text Encoding Initiative (TEI) guidelines, the element name and, when appropriate, the element content model, should be taken from those guidelines. By combining descriptive and generic elements with attributes in a simplified document structure, the Bentley Group was able to distill from the BFAP model the essential finding aid tag library, and Pitti then began to recast the Ann Arbor results into a revised data model and finding aid DTD, renamed the Encoded Archival Description (EAD).

The Bentley Team emphasized the importance of documentation, including a tag library and application guidelines, to make the DTD viable. Such documentation should be "friendly" enough to enable users barely acquainted with SGML to apply the DTD both routinely and intermittently in their work. While the Team focussed on elements to ease conversion of traditional finding aids, it also looked for SGML techniques that could begin to improve the delivery of finding aid information, particularly in an online environment. The Team speculated about future possibilities, involving attachment of online "help" scripts to explain descriptive practice as reflected in finding aids, links to central glossaries and shared administrative histories, and presentation of new views that might transform hierarchical data into archival family trees.

The Bentley group discussed several topics associated with prospects for profession-wide adoption and maintenance of an encoding standard for finding aids. They expect to circulate widely the design principles and revised data model; this began with a presentation at the September 1995 Annual Meeting of the Society of American Archivists. On October 16, Pitti released a revised DTD, but because it is still very much under development, this release was announced only to a small group of volunteer "early implementors," and only for testing purposes. Further work on the EAD DTD is being carried out under the auspices of the Library of Congress National Digital Library Program (see Section 3.5).

3.3.
The California Heritage Digital Image Access Project.

While the Berkeley Finding Aid Project focused only on the finding aid text itself, it set the stage for a future in which collection- level records lead to finding aids, and finding aids lead through hypermedia links to digital surrogates of primary source materials in a variety of native formats (such as pictorial materials, three- dimensional objects, manuscripts, typescripts, sound recordings, and motion pictures). The NEH-funded California Heritage Digital Image Access Project currently underway at the UC Berkeley Library is linking collection-level records in the Berkeley catalog to finding aid texts, and finding aid text to digital representations of primary source materials documenting California culture and history. The primary objectives of the Project are twofold: first, to implement in the networked computer environment the three-tiered archival access model in order to determine whether it effectively achieves its purpose of describing, controlling, and providing access to primary source materials; and second, to extend the finding aid DTD development initiated in the earlier BFAP to encompass prototype standards for controlling and accessing digital representations of primary source materials.

3.4.
The Next Step: The UC EAD Union Database, or “Virtual Archive”.

With the success of BFAP and the work of the California Heritage Project to make the finding aid the fulcrum of a dynamic electronic archival access system, it has become clear that it will be necessary to build a large union repository of electronic finding aids, centrally at first and later, as technology matures, in a decentralized environment. Through the prototype union database created in this project, UC archivists, librarians, students, and researchers will experience the potential of the virtual archive of the University's rich collections of primary source materials. They will also develop the expertise necessary to enlarge the database, maintain its content, and preside over its continued development. The UC EAD union database project extends the work of the earlier projects beyond the confines of a single campus in an attempt to achieve integrated access to many distributed collections.

3.5.
Related Work.

Although before BFAP there had been some effort in the archival community to provide network access to machine-readable finding aids, these efforts were unilateral and not based on standards. BFAP project staff understood that archive and library community collaboration and cooperation on the Internet would depend upon standards-based communication. The finding aid encoding and content standards development now underway will provide the base needed to realize not only centralized access to collection-level summary records but also to the hierarchical and detailed descriptions found in finding aids. A summary of other recent work related to BFAP and EAD follows.

3.5.1. National Inventory of Documentary Sources in the United States (NIDS).

NIDS, a for-profit publication compiled and produced by Chadwyck-Healey Inc., represents the only effort to date to provide union access to finding aids representing collections in U.S. repositories, but it is not online. Finding aids are contributed by participating archives and libraries and are then filmed and reproduced on microfiche. An accompanying printed index provides access to the microfilm images. NIDS is divided into three parts: part one includes finding aids in the National Archives, seven presidential libraries, and the Smithsonian Institution Archives; part two contains registers for collections in the Manuscript Division of the Library of Congress; part three consists of finding aids for state archives, state libraries, state historical societies, academic libraries and other repositories. To date, 327 repositories, including 210 American institutions have contributed almost 52,000 finding aids.

3.5.2. National Union Catalog of Manuscript Collections (NUCMC).

First published in 1962, NUCMC provides collection-level summary descriptions of manuscript collections in U.S. repositories in a published book format. From 1962 to 1991, 65,325 records were contributed by 1,369 repositories. Accompanying cumulative indices provide access through names, places, and subjects. Publication of the printed title ceases with the 1993 issue, since the Library of Congress concluded that the RLIN machine-readable database of collection-level AMC records has made the print publication obsolete. The NUCMC program continues at the Library of Congress, with records being contributed to RLIN.

3.5.3. National Digital Library Initiative.

While the Library of Congress' National Digital Library Initiative will attempt to digitize a wide range of materials, a prominent feature will be the digitizing of primary source materials that document the American cultural and historical experience. Library of Congress staff from the National Digital Library Initiative, the Manuscripts Division, the Prints and Photographs Division, and Information Technology Services have worked closely on the EAD DTD and data content model with Pitti and other Berkeley staff. As mentioned earlier, the Library has agreed to take on long-term maintenance of the DTD and has contracted with an SGML consulting firm (ATLIS) to develop the tag library. Recently (November 1-3), the Library hosted meetings to review the version of the DTD that was released on October 16 and to recommend changes that would enable development of an "alpha" version of the DTD. The "alpha" version will be distributed in early 1996, again to a small group of volunteers but this time for a more extended period of testing. Feedback from this phase of testing will then be folded into the development of a"beta" version of the DTD, which will be made generally available for wide community use and testing in spring 1996. Participants in the EAD meetings at LC/NDL are currently drafting an update on EAD development, a revised data model, and a development time line.

3.5.4. Council on Library Resources grant (CLR).

In December 1995, CLR funded the development of detailed application guidelines for the EAD DTD. The guidelines will be written by Anne Gilliland-Swetland of UCLA DLIS, who is working closely with the Bentley fellowship team and LC staff. A draft is scheduled to be completed by the end of April 1996.

3.5.5. Research Libraries Group (RLG).

In March of 1995, RLG hosted a Primary Sources Forum to discuss the future of access to primary source materials. There was consensus that RLG should begin testing the BFAP hierarchical access model and methods, and this recommendation was sent to RLG's Advisory Council. As a result, RLG submitted a proposal to the National Telecommunications and Information Administration, Telecommunications Information Infrastructure Assistance Program (NTIA/TIIAP) for a two-year project to create an initial test database of 50 machine-readable finding aids contributed by 10 RLG members. Although this proposal was not funded, RLG's commitment to implementing EAD remains strong. A finding aid union database project is currently being developed by RLG's Primary Sources in American Literature Task Force (PSALT), and UC EAD participants have been tentatively invited to participate in training sessions planned for this team.

The recommendation to RLG's Advisory Council also had an impact on PSALT. In March 1993, members of RLG had begun a one-year project to survey the collections of America's primary source materials with a long-term goal of finding a way to provide the nation's scholars with full access to all collections of primary source materials that document American literature. The Resources for Research in American Literature Project (RRAL) had carried out an intensive survey of some of the collections of a number of major repositories, and although the project succeeded in gathering a great deal of useful information about the collections it surveyed, it also discovered that it is difficult to get institutions to complete intensive surveying and recataloging projects. In following up on the work of RRAL, PSALT has recognized that an approach must be used that will make use of existing finding aids and minimal "gateway" collection-level records rather than intensive survey/cataloging projects. In addition to its EAD-style finding aid project, PSALT is looking at the possibility of creating a 'location register' (similar to a British resource of the same name) comprising very brief collection-level records containing fields noting the existence of finding aids and making it possible to link with machine-readable versions of them (as is being done in Berkeley's California Heritage Project).

3.5.6. Southeastern Library Network (SOLINET).

In February 1995, SOLINET hosted a planning workshop for the Southeast Special Collections Access Project (SESCA). Held in Atlanta and attended by archivists from throughout the Southeastern United States, the workshop's purpose was to develop a plan of action for building a regional finding aid database. SOLINET invited Daniel Pitti to speak and to demonstrate SGML-based machine-readable finding aids. As an outgrowth of the planning workshop, SOLINET partnered with the Southern Growth Policies Board (SGPB) to submit a proposal to NTIA/TIIAP for a two-year demonstration project to link and integrate distributed regional information resources. The project was funded by NTIA and will focus on library special collections and public and government information. Berkeley will maintain contact with SOLINET, providing training and other assistance as requested. The SOLINET/SGPB project will assist Berkeley in its continuing evaluation and development of the EAD DTD and will provide independent testing and verification of BFAP and California Heritage Digital Image Access Project conclusions.

4. PROJECT PLAN.

Describe what you intend to do and how long it will take. Will you install hardware and software? Will you reorganize functions or change the way you deliver services? Who will do what? How will work be coordinated among sites? Describe which aspects of the project appear certain to produce desired results and whether some aspects entail greater or unknown risks. How will the results of the project be useful throughout the University? Will the project may serve as a learning experience for units not directly involved in I? If the project will lead to continuous operations, how will they be sustained after initial funding ends?

The UC EAD prototype union database project will undertake the following activities:

Preparation for the project will begin before the official start of operations with a documentation survey of collections carried out by Special Collections and University Archives staff throughout UC and by the other participating libraries; it will be coordinated by Anne Gilliland-Swetland (November 1995 through August 1996). The survey's purpose is to compile detailed information about the existing state of UC finding aids and identify variables that might affect conversion decisions (e.g., numbers of finding aids already in digital form, level of heterogeneity between finding aids, local resources available for conversion project). This data will then be used to select those finding aids to be included in the prototype union database.

While it is difficult to predict exactly how much data will be encoded in the course of the project, based on the results of a brief preliminary survey conducted in October 1995 we have established a conservative goal of 30,000 pages of finding aid text to be included in the prototype union database. In that survey, the nine UC campuses identified nearly 2,900 finding aids (72,400 pages) as priority candidates for inclusion (see Table 1). These finding aids describe collections in a wide variety of subject areas and formats, but represent particularly rich resources in the areas of regional history, music and the performing arts, literature, anthropology, and the medical and natural sciences. We estimate that the total number of existing finding aids at all nine campuses is in excess of 3,680. At a predicted average of 25 pages per finding aid (predicated on the preliminary data), the total number of pages of existing finding aid text representing UC library holdings that might eventually be included in a union database could exceed 92,000. The goal of 30,000 pages to be included in the prototype, therefore, represents approximately one third of the potential size of the database of converted UC finding aids (see tables 2-6 for additional data on the state of existing finding aids at UC campuses). Similar data are being gathered by some potential non-UC participants.

The plan of work calls for a tightly scheduled sequence of events occurring over twenty-four months and will require the close cooperation of many staff members. Project administration will reside at UCLA, with Acting Head of Special Collections Charlotte B. Brown serving as Project Coordinator and Associate University Librarian Brian Schottlaender as Project Administrator. They will coordinate all administrative aspects of the project, including Governing Board communications, arrangements for training, purchase of equipment, and technical issues that must be discussed with staff at Berkeley. Daniel Pitti will play the role of Database Designer.

The project will hire a Finding Aid Conversion Specialist (FACS) at the Programmer Analyst II level and four Electronic Publishing Assistant I staffers (equivalent to Library Assistant II) to assist the FACS with encoding. Project staff will be housed at the Berkeley Library in its Electronic Text Unit. The FACS will administer the technical support unit at Berkeley, train the participants, supervise and carry out finding aid conversion for some participants, assist in the resolution of conversion problems, and monitor production schedules and data quality. Because of their experience with EAD data production, existing staff from the Berkeley Library's Electronic Text Unit will assist the FACS with training and act as technical advisors to the project.

UCLA, Berkeley, Davis, San Diego, and the non-UC participants will create their own data, adhering to the project's standards and following the production timetable. Irvine, Riverside, San Francisco, Santa Barbara, and Santa Cruz, all of which have limited staff in their Special Collections and University Archives units, will send machine-readable finding aids to the support unit at Berkeley for conversion to SGML by the FACS and assistants. All participants' encoded data will be integrated on a central server, Berkeley's Sun Sparccenter 2000, upon which the DynaText display, retrieval, and navigation software has already been installed.

The plan of operation follows. Although it suggests a linear progression, many activities will overlap, and various participants will be working concurrently on the different aspects.

Step I: Training Workshop & Production Support.

A one-week intensive training workshop for the Campus Representatives covering all aspects of the project will be given in June 1996 by the FACS with the assistance of Daniel Pitti and staff from Berkeley's Electronic Text Unit. The curriculum will include use of EAD, mark-up of finding aids, SGML authoring, SGML conversion, conversion methods for different types of source documents (including scanning, OCR, and database conversion), and discussion of various production methods. Scripts and macros developed at Berkeley will be shared with all participants, with instruction in modification to suit local needs. The participating non-UC libraries will attend.

After the workshop, the FACS will visit each UC campus to monitor progress and give further instruction. Those campuses sending their finding aids to Berkeley for conversion will be included to facilitate creation of new finding aids that will be easily convertible to EAD. Close contact will be maintained between the finding aid conversion specialist and the project coordinators at the participating institutions via project listserv, WWW home page, email, and telephone. Non-UC participants will be assisted as time permits.

Step II: Creation of USMARC Collection-level Records and Encoding of Finding Aids.

A USMARC collection-level catalog record will be created, it if does not already exist, for each collection. This is a key element in the overall digital library scenario that envisions links between existing online catalogs, encoded finding aids, and digitized source materials.

All finding aids will be encoded in compliance with EAD. Decentralized authoring will be done using the project's standard pc-based SGML authoring environment (see Budget for the hardware configuration and recommended SGML authoring software). The project will purchase this standard hardware and software package for project staff and each UC campus, while non-UC participants will use their own funds. (Hardware and software will be purchased, distributed and installed by July 1996).

Methods for converting existing finding aids will vary depending upon the native data format. Berkeley staff have experience converting finding aids from both word processing (WordPerfect and Word) and database programs (dBase and Access). Even within a single finding aid, several conversion methodologies may be necessary, given that no standards have previously existed for finding aid structure and content. Large blocks of text can be efficiently converted using "cut-and-paste." Container and other lists, which are characterized by repetition and high-density tagging, are efficiently converted used WordPerfect macros and perl scripts. Finding aids that exist only on paper can be scanned and OCR'd; the resulting word processing files can then be encoded with SGML.

Creation of USMARC records and encoding of finding aids will proceed at various sites as described earlier and will take twenty months (July 1996 - February 1998).

Step III: Assembling Data.

SGML finding aids gathered from all participants will be FTP'd to Berkeley, where they will be loaded into the DynaText browser/database system which runs on Berkeley's Sun Sparccenter 2000 server. The finding aids will be loaded, indexed, and archived in the DynaText database by a Berkeley Library Systems Office Programmer/Analyst II. The loading of finding aids will run roughly concurrently with the encoding process (August 1996 - March 1998).

Step IV: Creation of Virtual Collections within the Virtual Archive.

Working with UCLA DLIS researchers, selected participants will experiment with ways that hypermedia can be used to create virtual collections based on intellectual or provenantial links among finding aids. (The hypertext/hypermedia linking of related documents is familiar to users of the World Wide Web.)

Other methods may also be explored. For example, for collections dispersed among two or more institutions, participants will experiment with cooperatively creating a single finding aid, using separate components to describe the materials held at different repositories. For intellectually-related collections, "meta-finding aids" may be constructed that describe and interrelate two or more collections and provide hypertext links to the finding aids for each collection.

Selection of candidates for the virtual collection experiments will occur during the documentation survey (November 1995 - August 1996), and encoding will take place throughout the encoding phase (July 1996 - September 1997). The usefulness of virtual collections will be enhanced if non-UC libraries participate and relationships are found between UC and non-UC holdings.

Step V: Sustaining Continuing Operations.

The UC EAD Project will have set in place the necessary infrastructure of training, hardware and software, and expertise so that the Special Collections and University Archives units in the nine UC libraries will be able to continue adding new EAD-encoded finding aids to the union database on an ongoing basis. In addition to having been made vastly more accessible to the public, the quality and consistency of the product will be improved by the use for the first time of standards for content and encoding of data. The ultimate project goal is to see this state-of-the-art approach incorporated as an essential operational activity of these units.

5. EVALUATION.

Describe what measures will indicate that the project has been successful and how you intend to capture them. Who will do the evaluation? By what methods? How will the evaluation be shared?

The evaluation team, led by Anne Gilliland-Swetland of UCLA DLIS, will include archivists as well as DLIS faculty and students who have expertise in systems evaluation and user needs assessment, and who will work with the participating repositories to test the pilot database from a structural as well as a user perspective. The evaluation team will also work jointly with interested researchers from other UC campuses, including the evaluation team from the Berkeley Finding Aid Project. As the union database is developed, project evaluators will prepare guidelines for participating repositories and a variety of end-user satisfaction assessment mechanisms such as transaction log analyses and focus group interviews. This first phase will take place roughly September 1996-July 1997.

During the second phase (August 1997 - November 1997), the evaluators will build evaluation mechanisms into the pilot, distribute guidelines and questionnaires to participants, arrange for Internet access to the growing database, and conduct focus interviews of end-users. They will solicit feedback for database refinement and process the survey data. In the final phase (December 1997 - April 1998), the team will conduct follow-up individual and group interviews to solicit further feedback. They will also process user data, and collect and analyze budgetary and workflow data from each of the participating units.

The evaluation team will summarize and present the results of each of the three phases of the evaluation process in the final report of the project. The project will be completed by June 30, 1998 and project results will be disseminated in professional forums, research articles, and online.

6. PROJECT TIMELINE.

11/95-8/96 Documentation survey of UC collections; selection of “virtual collection” candidates (10 months)
3/96-7/96 Purchase, distribute, & install hardware & software (4 months)
6/96 Training workshop (1 week)
6/96-8/96 Organize finding aids for conversion (3 months)
7/96-2/98 Encode finding aids (20 months)
8/96-3/98 Load finding aids in DynaText on Berkeley Sun Sparccenter 2000 (20 months)
9/96-6/98 Testing and evaluation (see Section 5 for details)
6/98 Project ends.
11/97-12/98Project staff write and deliver papers; demonstrate prototype at conferences and meetings; final project report.


Jackie M. Dooley, Head of Special Collections and University Archives Main Library, P.O. Box 19557, Univ. of California, Irvine, CA 92713-9557
Internet: jmdooley@uci.edu Phone: 714/824-4935 Fax (*new*): 714/824-2542