EAD and Databases
Perl and ADO on MS Windows

Alvin Pollock
Digital Publishing Group

This is an old tutorial I wrote describing Perl's database functionality especially as applied to EAD. It is somewhat out of date now. It is geared exclusively toward EAD version 1.0 and SGML. As I have learned more and done this kind of programming many many times I have learned better ways to do things but sadly have not had the time to update this tutorial. Many of the concepts should still prove useful and so this tutorial remains. I recommend especially the fundamentals section which should give users with perl experience and database experience a nice introduction on applying the first to the second.

I've actually done some work when I can on an updated tutorial. Look for something new next year.

While you're here, please visit our EAD Tools page for all of the publicly available tools we use here at UC Berkeley.

Jump to
Introduction

Here in the UC Berkeley Library's Digital Publishing Group we have combined relational databases with EAD quite successfully and applied this combination to many interesting uses. The core of our strategy has been to create methods for easily importing EAD into databases and methods for getting EAD out of databases. While we've been able to export EAD for some time, as have many other repositories, being able to import EAD is a fairly new ability for us and has opened up a wide range of possibilities. This tutorial will describe these techniques in detail.

While many repositories are beginning to explore using databases to author new finding aids which can be exported to EAD, there has always remained the problem of what to do with legacy finding aids already encoded in EAD. Corrections and updates had to be applied directly in EAD rather than within the authoring database. The ability to import older EAD finding aids into the database means that repositories can unify their authoring under a single tool. For those repositories which use a commercial collection management system or one of a home-grown variety, importing older EAD-encoded finding aids means they now have ability to manage all of their collections within the software. Finally, databases offer exciting possibilities for displaying and manipulating finding aids. One such application described here is a web-enabled finding aid database which affords a repository much more power in publishing their finding aids online than traditional stylesheet-driven methods.

Perl and ADO

All of the methods described here use the perl programming language and the Microsoft ActiveX Data Objects (ADO) API.

Why perl?

The techniques I describe here can be implemented in just about any language with database support. MS Visual Basic, C++, and Java, are all very popular choices. However, many institutions working with EAD are very likely to already possess substantial perl expertise. We have found perl absolutely essential for retrospective conversion of our legacy finding aids. Our encoders use perl on a daily basis. In my experience perl is equally entrenched in other institutions involved with EAD. But many individuals may not even be aware of perl's excellent support for database programming.

Using perl for database programming has many advantages and a few disadvantages. Perl is far slower than C++ and somewhat slower than Visual Basic. If speed is essential, either of these languages makes a better choice. But perl is without doubt the language of choice for text manipulation. Many of the example techniques here will leverage perl's powerful regular expressions and a syntax which makes manipulating text strings easier than most any other language.

One of the most common complaints from programmers new to perl is all the "funny symbols." But those $'s and @'s and %'s are exactly what makes perl so powerful when manipulating text. Because of these symbols, perl can support variable interpolation. In perl I can write something like this:

print "<a href=\"$href\" target=\"$target\" title=\"$title\">\n";
Or if escaping quotes with backslashes isn't your cup of tea:
print qq(<a href="$href" target="$target" title="$title">\n);
You'd need to write that in Visual Basic like this:
Print "<a href=""" & Href$ & """ target=""" & Target$ & _
"""title=""" & Title$ & """>" & vbLf

While neither java or C++ supports variable interpolation, at least you can escape quotes with a backslash.

Perl supports a wonderful text construct called here documents, the darling of cgi programmers world-wide. Using a here document, I can print this:

print <<EOF;
<c0$depth level="$level">
  <did>
    <container type="carton">$carton</container>
    <container type="folder">$folder</container>
    <unitid type="$level">$unitid</unitid>
    <unittitle>$unittitle, <unitdate>$unitdate</unitdate></unittitle>
    <physdesc>$physdesc</physdesc>
  </did>
  <scopecontent>
    <p>$scopecontent</p>
  </scopecontent>
EOF

I can't even bring myself to duplicate this in Visual Basic, but in java you would need to do this:

eadComponent = "<c0" + depth + "level=\"" + level + "\">\n" +
               "<did>\n" +
               "<container type=\"carton\">" + carton + "</container>\n" +
               "<container type=\"folder\">" + folder + "</container>\n" +
               "<unitid type=\"" + level + "\">" + unitid + "</unitid>\n" +
               "<unittitle>" + unittitle + ", <unitdate>" + unitdate + 
               "</unitdate></unittitle>\n" +
               "<physdesc>" + physdesc + "</physdesc>\n" +
               "</did>\n" +
               "<scopecontent>\n" +
               "<p>" + scopecontent + "</p>\n" +
               "</scopecontent>\n"

Certainly those programmers for whom java or Visual Basic is the language of choice and who may have no more than a passing acquaintance with perl are already familiar with issues such as these and constructing strings like the one above is second nature. It's all a matter of selecting the right tool for the right job and leveraging existing expertise to its fullest extent.

Why ADO?

Just as there are many choices in the programming language you may wish to employ, there are many other database APIs to choose from besides ADO. In fact ADO is a relatively new API in comparison to OBDC, JDBC, RDO, DBI, et al. It really comes down to a matter of preference. As I am a fairly lazy programmer and am loathe to spend more time than I have to learning a new technology, I was attracted to ADO's ease of use and shallow learning curve. It bundled up typically burdensome database programming tasks into high-level abstract functions. Anyone who has used INSERT INTO SQL syntax to add new records to a complex database will immediately appreciate ADO's AddNew() method. And Microsoft's addition of Data Shaping to manipulate and retrieve hierarchical data is especially useful for EAD. ADO is the easiest method I know to program databases.

ADO's biggest drawback, however, is that it is a Microsoft-only technology. Don't expect to apply this tutorial on your Sun or Linux workstation. It's Windows only, sadly. Most repositories nowdays use or have easy access to Microsoft Windows. Importing EAD-encoded finding aids into a relational database will not usually prove burdensome. But serving out databases over the web might. ADO works with just about any ODBC-compliant database. If you're using MS Access or MS SQL Server, that's not a problem. But if you have a significant investment in a Unix-based database such as Oracle or Sybase, ADO may not be much help. You can still import and export EAD into Oracle or Sybase on Windows using ADO, but you won't be able to do it on Unix.

Next: What you'll need to get started 




Have any comments on this tutorial? Submit them Here.

Copyright © 2001 UC Regents.
All rights reserved.
Last update August 4, 2001.