My comments are below in [].
Merrilee
--------------------------------------------------------------------------
On April 23, 1998, representatives from Penn State met to discuss the
MoA2 White Paper. We have many concerns about issues either not raised in
the White Paper or raised but not fully addressed.
1) Our major concern is that this model does not appear applicable to
archival
collections. Throughout the document the authors refer to book examples
and TEI but the kinds of materials we will be contributing to this project
do not look nor act like books. Archival materials are not described or
handled on an individual piece basis but rather as aggregate collections.
We question if we are using the right framework for archival collections
given the examples used in the White Paper.
[What we have failed to make clear, and need to clarify in subseqent
drafts of the paper, is that EAD will be used as an overarching metadata
framework for the archival collections used in MoA2. Nothing in this
project will bring us away from conventional archival practice of
collection arrangement and nagivation.]
2) This model doesn't follow user behavior. For archival patrons who rely
on subject access to find relevant collections, parts of collections, and
items within collections, we need to provide subject access in context
which isn't addressed in the White Paper. Having a search mechanism point
a user to a specific date in a minute book or diary does not help them find
what the author has written about on that date. And this level of
page-by-page subject analysis is way too labor intensive to provide.
[Again, what we need to convey is a layering of discovery mechanisms for a
user. MARC records for the collection level, EAD for collection
description and organization, and then, if resources allow, pointers to
pages that relate to a specific date in a journal. Once the user has
browsed to the desired date, they can bring forth the page image and read
the text. Or if a transcription exsists (still more resources) then they
can read and search the electronic text, perhaps side by side on the
screen with the scanned page images. We are interested in identifying
metadata at a community level that will make this sort of access possible
IN ADDITION TO (not instead of) conventional methods of providing metadata
and access. Again, we need to clarify that this is the intention of the
project, not to set out in a new direction but to build upon agreed upon
community standards, and to develop ways to enhance and build upon how
users search and navigate in the current environment.
I'd like to underscore that since MARC and EAD are so well documented in
our community, this should not be the focus of this paper; but clearly
there is room for confusion in the current draft, and we need to spend
some more time developing the idea that we are not undertaking a
departure.]
3) The individual level of description required for each image is not
scalable for a model. The time to encode the metadata for each image is
much more labor-intensive than regular processing of archival collections.
The White Paper does not address how much time it will take to provide
metadata for each image at the funding level we are receiving. The size of
the testbed should be reduced for the amount of work required and the
concomitant reduced level of funding. The funding provided is inadequate
for the amount of time it will take to process the collections at the
proposed
level of detail. The project seems too ambitious.
The testbed should be scaled back to a workable number of images and
instead concentrate on creating the architectural structure of the metadata
for the objects. Scanning in thousands of images doesn't answer the
question of whether this kind of digital archives (not library!) is
feasible given the nature of the collections.
[We anticipate, given our experience at Berkeley, that the metadata will
vary little from one group of objects to another. The metadata will be
captured in a database (a common practice in large scale conversion
projects) where it will be possible to batch capture metadata for groups
of objects or for the duration of the project (for example, information
about a scanner used in the project, such as scanner name and type will be
entered once for the project, and will not change unless or until the
scanner is replaced). We hope that our experience will hold, and that
this level of metadata capture does not prove to onerous for any of the
participants.
An important part of this project is identifying metadata at different
levels--the minimum (which should be a real minimum) AS WELL AS a richer
metadata environment. The minimum is probably the type of information
which is already being captured--this project is trying to generate
discussion from which standards can emerge. The richer
metadata will allow for the development of tools and use of tools.
Again, the richer metadata, while less frequently collected and used,
should be discussed by the community so that standards can develop.]
4) Does the software exist for the level of description expected or will
this need to be developed?
[Berkeley has used databases to hold image capture metadata for all of its
major imaging projects since the California Heritage Project. We are
working on this model and expanding the number of metadata fields to fit
the proposed model. This database will be provided for any of the
participants who do not currently have such a system. We will also
develop scripts which will take the metadata from the database and create
the components of the digital library objects--these scripts will also be
shared.]
5) There needs to be a shared template so everyone is providing the
same type of information. Is there a template in place? Has Berkeley
designed
a template or will each repository have to design its own?
[The database I mentioned above should serve as such a template.]
6) The intellectual piece of analysis needs to be addressed. At what
level will correspondence, minute books, and diaries, for example, be
described in the metadata? If there is to be a shared template, who will
have
input into the design and level of analysis?
[Each institution will contribute their own MARC records and EAD-encoded
finding aids for each collection in the our project, following usual
archival practice at each institution.]
7) Assuming the standard is 24-bit for certain images, can the Internet as
it exists handle importing this size of data? What assumptions are we
making about document delivery?
This from Howard: "The size of image we capture is independent of the
size of image we deliver. If we don't want to have to go back and rescan
these items a few years from now when user capabilities are higher, we'd
better try to do as deep a capture of the informational value as we can
today. If we capture 24-bit images, we can still deliver 8-bit images to
users (though many public institutions are delivering 24-bit images
today). There is a similar issue for compression, where we can capture
and store uncompressed images but deliver them in compressed form."
8) How are we going to decide among all five repositories to agree on
metadata for each object? Will only those repositories contributing
photographs discuss and agree on the depth of description that will be used
for these images or will all repositories discuss and come to an agreement
on every object?
[I think that we again need to clarify the types of metadata that are
being discussed in this paper. Descriptive metadata (MARC records and
finding aids) are already in place within the archival community.
This project is focussing on structural and administrative metadata. And
there is no reason why all of the participants cannot contribute to a
given portion of the discussion. Since the collections of the
participating institutions are different in relation to the archival
theme, I would not expect everyone to have photographs to contribute.
However, I think we will all be scanning photographs in the future, and
have a vested interest in this process, as well as experience and thoughts
to share.]
9) Do we understand correctly that the images will reside locally but the
metadata and search engine will be at Berkeley? In that case, is there a
search engine sophisticated enough to do what is proposed in the White Paper?
[The MARC records and EAD-encoded finding aids will reside at Berkeley.
The searching mechanism for these will be OCLC's SiteSearch and Inso's
DynaWeb respectively. The images will reside locally. The digital
library objects we will be creating, and the Java tools to go with them,
will live on the UNIX server at Berkeley.]
10) Scanned images of handwritten documents are not searchable. Using OCR
increases the cost and the technology doesn't exist yet that can accurately
read handwriting. Barring rekeying the text of handwritten documents, how
will the search engine seek and find the objects in context?
[Access will primarily be through MARC records and finding aids. We will
have as a sample a few fully transcibed documents. We will also have
documents that have a minimal level of transcription, such as dates in a
diary, or chapter titles in a book, to allow for some greater level of
navigation than mere page turning. We will want to have just enough of
these richer objects to test how they work. But the majority of objects
in the archive will be images only.]
Since three of the five participating institutions are on the east coast,
it might be cost-effective to have the summer meeting in New York or
Pennsylvania. However, we might benefit from a visit to Berkeley to
observe the operations there.
[We will plan a group meeting on the East Coast for sometime this summer.]