Berkeley Digital Library SunSITE

Library Services in Theory and Context

Chapter 8: Retrieval

Having considered the assembling of copies of materials into collections, we shall now consider retrieval—the process whereby data or documents are found in the attempt to deal with inquiries that have been posed.

Definition: the retrieval process

We view the retrieval process as having three stages:

  1. Susceptible person with an inquiry
  2. Expressions of inquiry in the language of the retrieval system
  3. Set of retrieved signals

It is important to distinguish between these three stages because the transformations between them are significant.

By a "susceptible person with an inquiry," we refer to the fact that library services are for people to use. The "susceptibility" indicates that the person with the inquiry is vulnerable to becoming informed. For the purposes of analysis we take the unorthodox step of separating the retrieval process (this and the previous chapter) from the process of becoming informed (next chapter). Therefore, in our analysis, the susceptibility is in a strict sense not directly relevant to the retrieval process, narrowly viewed. Nevertheless, it is clearly important to the system as a whole and is, therefore, well worth noting.

The first transformation that takes place—between stages 1 and 2—is the formulation or expression of the inquiry in the language of the retrieval system. It would be an exaggeration to say that this transformation is well understood. Some features, however, can be outlined.

A person is likely to advance only those inquiries which he or she thinks the system may be able to cope with. This is no more than rational behavior on the part of the user. The expectations with respect to the system may not be accurate. Indeed, they might be wildly unrealistic. Nevertheless, we can expect the user's decision to advance the inquiry to be conditioned by:

This is recognized by librarians who feel the need to publicize the scope of the resources available in their libraries, and in the provision of instruction in library use. They may be correct in asserting that users are often unaware of the power and scope of the retrieval tools available. However, it is also possible for librarians who are intimately familiar with the bibliographical power of their libraries' apparatus to underrate the efforts involved from the users' perspective and to overrate the relevance, range, and accuracy of the retrieval systems. We have discussed this briefly in chapter 6 and will do so again in chapter 13.

The manner in which the inquiry is presented will be influenced by the linguistic, cultural, and technical orientation and personal knowledge of the user; and the user's expectations as to what form of inquiry will be acceptable and/or intelligible to the retrieval process. (This appears to be the case whether the inquiry is made to a librarian or to the retrieval system.)

It is not to be expected that the user will have sufficient understanding of the retrieval system to know exactly what form of expression will produce the response most appropriate to the inquiry. Indeed, only the designer of the retrieval system or someone intimately familiar with it could be expected to approach such knowledge.

Since every individual has a particular conceptual framework and vocabulary, since the retrieval system has its own technical vocabulary, and since the user cannot be expected to have an intimate knowledge of the retrieval system, a greater or lesser degree of distortion is to be expected in the transformation of the inquiry into the language of the retrieval system.

Librarians recognize that some degree of distortion is to be expected. The short-term solution is for the librarian to initiate a dialog or negotiation with the user. This is commonly referred to as the "reference interview". 1 Inevitably, as in any human interaction, there is more involved than a simple exchange of verbal statements. (See, for example, Shosid's psychological analysis of the reference encounter. 2)

At some stage the inquiry is expressed in the language of the retrieval system, since there is no other way of using it. In a sense a retrieval system can be regarded as responding to any inquiry addressed to it. But, if alien terms are used (e.g., subject terms not represented in a subject catalog), there is no positive response. The retrieval system will only be responsive to terms it recognizes.

The second transformation is the process between stages 2 and 3 of our definition—the yielding of a set of retrieved signals in response to an inquiry in the language of the retrieval system. We use the word "set" advisedly since the retrieval system may yield nothing—an empty set.

Once the inquiry has been posed in the language of the retrieval system, the outcome is automatic. The search is predetermined. Headings in the catalog lead directly or indirectly to catalog records. In this sense, retrieval systems are automatic devices, machines, in which any given input will yield one and only one output, which is predictable given a sufficient understanding of how the system works. The only exception to this is that there may be mechanical failures: a connection may be lost in a computerized system, or a card misfiled or overlooked in a card catalog. Barring temporary mechanical failures, retrieval systems are like automata in that the same inquiry repeated will yield the same predictable response. This may seem contrary to experience in that two different people with the same initial inquiry (or the same person with the same inquiry at two different times) can and often do end up with different results. There are two explanations for this, both compatible with the view of retrieval systems as automata:

  1. Retrieval systems are usually in a state of continuous, gradual revision: data are added or withdrawn; new index points inserted; syndetic relationships changed. A library catalog may seem to be a particularly undynamic object, but the appearance is misleading. It follows that, when the same inquiry is presented at different points in time to what is ostensibly the same retrieval system, the system has, in fact, changed somewhat. It is, therefore, in an altered state—updated, most likely—and, therefore, while still an automaton, it has become a slightly different automaton. The slight difference may well be significant for some inquiries.
  2. There is scope in sophisticated retrieval systems like most on-line bibliographical search systems and complex card catalogs for changes in the expression of the inquiry. For example, one might conclude that the rendering of the inquiry in the language of the retrieval system has not been done well and that a modification of the phrasing would improve it. Another possibility is that the initial formulation of the query may yield too little (or too much) by way of retrieved signals—and so the formulation of the search may be expanded (or restricted) accordingly.

On-line bibliographical retrieval systems are usually postcoordinate systems wherein complex searches are formulated from their constituent concepts at the time of searching instead of depending on selecting complex descriptions composed when the document was cataloged (as in the case of precoordinated systems such as traditional library subject catalogs). It is, therefore, comparatively easy to expand (or restrict) searches. One can add (or relax) restrictions (e.g., "written in English," "published since 1975," and "pertaining to rabbits"). In precoordinated systems, such as library card catalogs, one can simply decide to exclude some of the richness offered (e.g., by not following "see also references" or by considering only works by a particular author). Alternatively, one can expand the expression of the search by formulating it more broadly in the language of the retrieval systems.

In both cases, the change in the set of retrieved signals from one posing of the inquiry to another represents not an unreliability on the part of a stable retrieval system but a change in the state of the retrieval system or in the formulation of the inquiry.

Data retrieval and document retrieval

The signals retrieved are data in one form or another. The distinction is sometimes made between data retrieval and document retrieval. A "data retrieval system" is one in which specific data are retrieved-the specific information requested in the inquiry.

The term "document retrieval" is ambiguous. It can refer simply to systems in which a document is retrieved. Ordinarily, however, it is used to refer to a retrieval system in which the data retrieved are the description and address of one or more documents; and it is these documents that will contain the data desired. For example, suppose one wanted a list of strong and irregular German verbs. A data retrieval system that enabled one to retrieve such a list directly would have to permit one to use a search command such that the list of strong and irregular verbs would appear. in a document retrieval system this would not happen. Instead, one would specify the need for a book on German grammar, specifying either a particular known work or any German grammar. Assuming that the system does contain details concerning German grammars, the response might be, for example,

Luscher, R. & Schapers, R.
Deutsch 2000: A Grammar of Contemporary German.
Munchen: Hueber, 1976.

This response does not, in itself, inform us about strong and irregular verbs. However, it is a reasonable assumption that, if we were to examine the book, it would contain a list of strong and irregular German verbs and their principal parts. In finding them, however, we may well use another information retrieval system, the index. Looking in the index under either "irregular" or "strong" indicates that, in this case, a list can be found starting on page 274.

There is, it would seem, a basic distinction between a data retrieval system designed to provide directly what one thinks one wants and a document retrieval—or, better, a "reference retrieval"—system that merely points one toward the desired data in one or more steps. However, difficulties arise. Suppose, for example, instead of wanting a list of strong and irregular German verbs, we had wanted to know the title of a German grammar by Luscher and Schapers— or, indeed, of any German grammar. The inquiry could have been formulated in the same way and posed to the same bibliographic retrieval system in the same way. The exact same response might well have been received as that given above. The difference is that the desired datum has been retrieved, without further searching being needed. In this example, the exact same system used in the same manner constitutes both a data retrieval system and a reference retrieval system. The only difference is in the intention of the user. This being so, we need to reexamine the distinction more closely. If it is not to be abandoned, then the following options remain:

  1. We can classify such a retrieval system as being a data retrieval system or a document retrieval system on the basis of the intention of the person using it; or
  2. We can define retrieval systems according to whether they are capable of functioning as a reference retrieval as well as a data retrieval system. This capability would appear to extend to any system in which the data stored are capable implicitly or explicitly of indicating further sources of data. From this perspective, we can consider data retrieval systems as a limited, primitive category within the larger class of retrieval systems.

Although the example used was bibliographical, other sorts are possible. The example, in museum documentation, the reference is likely to be to an actual artifact rather than to a book. More generally in data retrieval systems, the inclusion of some kind of reference to further documentation is likely to be provided whether explicitly or implicitly, even if only to indicate the source, authority, or definition of the data retrieved.

Multiple retrieval systems

The ability of a retrieval system to retrieve signals (which, according to the intentions of the user, can be regarded either as the data desired or pointers toward the data desired) can be illustrated by considering some the many examples of retrieval mechanisms in a library.

One user might treat the subject arrangement of the documents on the shelves (arranged according to the Dewey Decimal Classification, perhaps) as a retrieval mechanism and go directly to the shelves associated with the Dewey number(s) expected to contain the signals (books) desired. (That is, indeed, an example of document retrieval in a literal sense.)

A second user might approach the library catalog and examine subject headings and note the bibliographical data recorded on the cards there. That may be all that the user wanted if verifying a bibliographical citation and willing to rely on card catalog data instead of inspecting the documents themselves.

A third user might do exactly the same as the second user but the data recorded on the cards do not constitute the goal, only the directions to he document on the shelves.

A fourth user might go to the librarian and express the inquiry verbally. The librarian, acting as a retrieval system, would endeavor to convert the inquiry into the terms of the library system and deliver the data required, whether this is a whole document or merely a mention of specific data from within a book. When a document is found, further retrieval mechanisms—index, table of contents—may be used to identify more specifically what is wanted. Dictionaries, encyclopedias, and bibliographies—all examples of retrieval mechanisms—may well be used independently or in conjunction with other means of searching. It is clear that many examples of retrieval systems occur in libraries and that they can be used independently or sequentially.

Retrieval languages I: Notation

Describing things is a linguistic problem. Even concepts in formal logic have to be defined linguistically at some stage.

One readily recognized distinction in descriptive labels is between "natural'' language and "artificial" languages. For example, in a bookshop texts on economics are likely to be shelved by a sign bearing the label "ECONOMICS." The very same books in a public library might bear the sign "330," the symbol denoting Economics in the artificial notation of the Dewey Decimal Classification.

From the point of view of achieving an acceptable address, the labeling doesn't matter very much so long as it is intelligible and accurate. Also, from the point of view of achieving acceptable definition, the choice of notation, language, or meta-language doesn't matter very much as long as the description is sufficiently precise and intelligible for the purposes intended.

It is common for a major distinction to be made between "classification schemes" based on artificial notations (such as Dewey) and natural language indexes. Both will be reviewed.'

Natural language systems

In natural language indexing systems, words or terms from the language of choice are used as descriptors. The simplicity is highly attractive. However, as increasing numbers of terms are used, various problems emerge.

Since terms in natural languages overlap in meaning and in appearance, some kind of control is usually imposed. Examples of these include homographs: two or more words spelled the same way but different in meaning. For example, the word CHARACTER means different things to a playwright, a typewriter mechanic, a theologian, and an architect. Similarly, information about a TANK that would interest a general would not necessarily help a plumber. In these cases, the terms need to be expanded or differentiated in some way, or else inappropriate signals are likely to be retrieved.

Different words for the same concept or entity (synonyms) cause other problems. A retrieval system with information under AGRICULTURE might yield nothing in a search under FARMING. Straightforward one-to-one relationships can be handled by imposing further control. Typically, one term becomes the "preferred" term and the other term is used only for redirection. "For FARMING see AGRICULTURE." In practice, more difficult to handle are near-synonyms where the overlap is incomplete, e.g., AGRICULTURE and LAND USE. A list of subject headings, or thesaurus, enumerating terms and relationships between them and other terms is used. The process of control is usually known as vocabulary control or authority control.

Another complexity has to do with hierarchical relationships where material on a topic, such as dairying, can be found not only in books specifically on dairying listed under the heading DAIRYING, but also within more general books on agriculture listed under AGRICULTURE. Alternatively, there might be no general work on agriculture but books dealing with aspects of agriculture such as dairying, horticulture, aquaculture, and so on. Retrieval systems vary in the extent and manner in which they contain references to narrower (more specific) topics and to broader (more general) ones.

So far, only simple definitions have been considered. In practice, the topics of inquiry often require some complexity in description. Consider, for example, "TEACHING METHODS IN AUSTRIAN UNIVERSITIES FROM 1918 TO 1939." Natural language indexing systems can handle such combinations ("coordinations") of concepts in either of two ways. In a manual system, it is convenient to compose ("precoordinate") a single (but complex) entry following agreed conventions concerning the order of the aspects (more technically "facets") of the concept. In a traditional library subject catalog one would expect to find a book on this topic listed under a heading of the following form "AUSTRIA, Higher education, 1918-1939. Teaching methods" and probably only under that heading.

If this precoordination were not done, one would need to look at each item listed under AUSTRIA, under HIGHER EDUCATION, under 1918-1939, and under TEACHING METHODS and select only those items which had been entered under all those labels. Technically, this is known as "postcoordination" since the expression of the relationship the coordination—is done at the time of the inquiry, after the time of indexing. It is obviously laborious.

In computer-based systems, where the computer can take on the chore of doing the "coordinating," it is customary not to express the coordination at the time of indexing. Various appropriate index terms would be assigned. In a postcoordinate search, the searcher enumerates the index terms that are expected to be helpful in identifying desired material, and the computer assumes the task of finding the books which have been assigned whatever combination of terms the searcher has specified. In this example, one might try the following terms:

HIGHER EDUCATION or POST SECONDARY EDUCATION or UNIVERSITIES
TEACHING METHODS
AUSTRIA

One might also be able to specify the historical period though, in this case, it would probably be quite convenient to browse among the items retrieved by the above terms.

These coordinate relationships can be ambiguous, since the mere occurrence of words may be insufficient to define the topic. A textbook example is the difference between a "venetian blind" and a "blind Venetian." Similarly with the three words "bites," "dog," and "man"—"dog bites man" is not the same syntactical relationship or meaning, nor is it as newsworthy, as "man bites dog."

These relationships can properly be described as syntactical in exactly the same way that syntax denotes grammatical relationships. On the whole, syntactical relationships tend to be ignored in indexing languages, but there are notable exceptions, such as relational indexing. 4

An often overlooked attribute of natural language indexing systems is that each system is natural to a particular time, place, and cultural environment. Subject headings used in the British National Bibliography are different from those assigned for a comparable purpose by the U.S. Library of Congress. Special groups of any kind—professional, social, technical, cultural, regional—will tend to use their own particular vocabularies. A natural language indexing system appropriate to one group is likely to be more or less unnatural and inappropriate to another. This limited applicability is also true to a lesser degree in the case of indexing languages with artificial notation in that they too reflect the conceptual arrangements (even though not the language) of a particular group.

Classification schemes with artificial notation

Since they are intended to perform a similar role, it is not surprising that classification schemes with artificial notation tend to have characteristics similar to those of "natural" language systems.

  1. Homographs (words with identical spelling but different meaning) are avoided since the concepts are identified and distinguished prior to the assignment of the labels.
  2. Synonyms should also be avoided since, having established a label for a concept, subsequent labels for the same concept need not be created.
  3. Hierarchical relationships need to be handled in the same sort of way. The relationships are typically embodied in the notation, e.g.,

    942 History of England
    942.7 History of Northwestern England
    942.76 History of Lancashire
    942.769 History of Lancaster

  4. With complex topics such as "teaching methods in Austrian Universities, 1918-1939," the same solutions are available as with "natural" language, e.g. 378.1709436 in the Dewey Decimal Classification. Since most existing classification schemes are designed for human rather than computer use, complex precoordinate labels are used.

Precoordination, however, is not a necessary characteristic of classification schemes with artificial notation any more than it is with natural language indexing schemes. It is true that most existing schemes are precoordinate, but in recent decades most thinking about classification has been based on analysis into a number of "facets," which, in combination, define the topic concerned. This faceted approach, which is detectable in revisions to existing schemes such as the Universal Decimal Classification (UDC), lends itself to postcordinate searching by computer, because the elements which have been "precoordinated" remain identifiable could, therefore, be used for a postcoordinate search also.

Similarities and dissimilarities

As the foregoing discussion illustrates, natural language indexing systems and classification schemes with artificial notation are substantially similar in structure, controls, and uses. A natural language indexing system with a fully worked-out relationship between topics will require the same degree of analysis and control as a classification scheme for the same range of topics. From this perspective, it seems helpful to regard them both as part of the generic category of indexing systems. The difference in notation is just one of the differences between them and not necessarily the most important.

Other diferences are as follows:

  1. Since the construction of a classification scheme with an artificial notation implies analysis of the relationships between topics, such schemes ordinarily have a great deal of "syndetic" control, i.e., rules concerning the preferred choice of indexing term and the relationships between terms. The development of syndetic control is not a necessary part of using natural language indexing systems, although it may be desirable. Consequently, the amount of syndetic control varies in practice from essentially none to very detailed control. Intuitively, it seems needed; in practice, there is some question as to whether the benefits justify the costs. 5
  2. A classification scheme with artificial notation will need a natural language index to it—a "relative index"—if it is to be conveniently used.
  3. Classification schemes with artificial notation permit the designer to arrange topics that are similar or related to be collocated. In natural language indexing systems, the juxtaposition of concepts will depend on accidents of spelling in the language concerned.
  4. The labels used in classification schemes with artificial notation tend to be shorter, though there can be considerable variations in this. For example, "Bibliography of the Economic History of Hungary" would be "330.9439016" in the Dewey Decimal Classification but "ML,E2" in the Bliss Bibliographic Classification.

The foregoing discussion indicates that indexing systems vary in several ways: in fineness of detail; in degree of vocabulary control; in syndetic structure; and in choice between pre- and postcoordination. The principal conclusions from tests of retrieval systems are that, despite these variations and the ingenuity of their designers, there is little difference between them in effectiveness, even though they tend to retrieve different material even when the same inquiries are posed in identical circumstances. 6

Retrieval languages II: Attributes

In the previous sections we restricted our attention to subject indexes using natural languages or artificial notation. This provided a convenient introduction to some semantic and syntactical aspects. However, it is important to emphasize that "subject" access is only one example of retrieval. Documents and data can have "contextual" attributes assigned to them for the purpose of retrieval, such as author, publisher, and date of creation. Indeed, in academic libraries, more use is made of author entries in the catalogs than of subject entries. The principal means of arrangement and approach for archival materials is their administrative provenance. As noted in chapter 6, attributes of authorship and provenance can also have connotations of subject content.

A citation, by means of which an author refers in one document to another document, implies a relationship between the two and constitutes prima facie evidence that if one document is relevant to an inquiry, so might the other be. 7 Citation indexes can be viewed as a form of subject index even though the attribute that forms the basis for retrieval is citedness rather than a description of what the document is about. The argument is that the usual custom in scholarly research is to cite closely related work. Such citation is, therefore, indicative of a close relationship, usually in subject matter. Hence, the citation of one article by another generally implies similar subject matter. Since the description is in the form of a citation, not in the inconstant terminology of subject indexes, a citation index has some advantages when it comes to the use of articles in foreign languages or on subjects without standardized terminology. 8 This is reminiscent of our argument in chapter 6 that a document description can be a surrogate description of a subject inquiry.

This point has been stressed because of our assertion that retrieval is primarily a linguistic process, drawing on:

Since retrieval systems are based on indexing "languages" which share (more or less) the attributes of ordinary languages, they derive from the culture of their context and their study can be asserted to be a form of linguistic study. 9 We can note that this assertion assumes that linguistics does or can include the study of "artificial" languages. How far this assertion can be pushed is a matter for debate. It is suggested that all retrieval systems could be included on the grounds that all depend on indexing "languages," that all indexing languages depend on the assignment of attributes and on the labeling of these attributes, and that a proper understanding of these systems depends on semantic, syntactical, and pragmatic analysis. This is more easily understood in the case of "word"-based subject indexes than in others, but we speculate that it is also true of all "non-word" indexes using artificial notation, all contextual indexes, all citation indexes, and even systems based on the statistical association of attributes of data or documents.

Objects, concepts, and definability

So far, we have simply assumed that a "descriptor," whether drawn from natural language or concocted in some artificial notation, can suffice to describe concepts. This is not the case. A descriptor can act as a label. It may also serve as an address. It may define the topic, more or less, but it is unlikely in all cases to provide sufficient definition to describe completely the topic in relation to other topics. The extent to which definition is needed will vary with respect to at least two considerations:

  1. The number of different entities to be described. In a trivial example of an information retrieval system that retrieves data concerning only two or three items (e.g., the number of chairs, tables, and lamps in a warehouse), the need to define will be trivial. In contrast, a chain of furniture shops or a large furniture museum will need detailed definitions to distinguish one type from another.
  2. The definability of the entities concerned. It is clear that not all concepts are equally easily defined. Consider the following sequence of concepts:
    • Seat 34C on flight PA 6 from San Francisco airport on January 29, 1980.
    • An elephant
    • Heat
    • Lassitude

The airline seat is easily unambiguously defined. An elephant is quite difficult to describe but can be recognized from pictures. Once one knows what an elephant looks like, there is usually little doubt as to whether a particular object is an elephant or not. Heat can only be indirectly illustrated. It cannot be seen directly, though sometimes its causes or effects can be. It can be sensed, however, if one is close enough and it can be defined and measured in physical terms. Lassitude can be sensed and symptoms of lassitude can be observed. Not at all clear, however, is the relationship between lassitude and related concepts such as ennui, tiredness, weariness, exhaustion, boredom, etc. Unlike the aircraft seat, the elephant, and heat, there would appear to be no precise accepted definition of lassitude. If one were to search in an information retrieval system for material about "lassitude" one would probably have to try a variety of search terms, and the items retrieved as a result of the search would probably vary in the extent to which they were about lassitude. 10

Areas of study appear to vary in the extent to which the concepts they deal with are definable. Physical sciences tend to have "hard," i.e., relatively definable concepts such as temperature, molecular weight, size, velocity, etc. This is reasonable since the physical sciences deal with physical objects and the physical relationships between them. The "hardness" of these properties permits relatively easy measurement and calculation. One development can build upon another because the earlier achievements are clearly defined. The progress in achievement in the hard sciences is, in consequence, more palpable than in other areas of activity such as education, literature, political science, or social welfare. 11

The "softer" areas tend to be those which deal with human behavior and social values. 12 Each area of study appears to have a characteristic degree of intellectual hardness/softness. There seems to be variation within areas (e.g., welfare economics seems "softer" than econometrics) and subjects may change in hardness/softness over time. In other words, definability does not appear to be itself a linguistic problem. If it were, then coining new words might solve many difficulties. Rather, there appears to be something more basic that has effects which are difficult to handle linguistically. Bunge comments that "vagueness or blurredness has no positive aspect and is a conceptual rather than a linguistic disease, hence it is rather more difficult to cure. 13

The extent to which formal, logical notation and quantification are used is an indicator of hardness. It is an imperfect one since the use of such notation and quantification is not necessarily appropriate or well rooted in realistic definitions.

The terminology of hardness and softness is potentially misleading. One of the meanings of the word "hard" signifies a firm consistency and its opposite is "soft." In that sense, the imagery of intellectual hardness and softness is appropriate. A different meaning of the word "hard" is "difficult." For that meaning, the proper opposite is "easy" rather than "soft." It is in the second connotation of hardness as an indicator of difficulty that this imagery is misleading with respect to intellectual hardness and softness. The pages of formal notation and algebra which characterize writings in physics and chemistry look particularly unintelligible to the lay person. However, in an important sense it is even more difficult to make progress in fields where definitions are "soft" and unreliable, than when they are relatively "hard" and dependable. The foundations built by previous scholars cannot so easily be taken on trust but may need redefinition. New work may need to be built into rather than onto prior work. However one may view difficulty, let us simply accept that the definability of concepts—the intellectual "hardness" and " softness"—varies from one field of discourse to another.

To the extent to which terms are relatively low in definability, information retrieval is likely to be less satisfactory since retrieval, being a linguistic process, depends heavily on definitions.

In view of the importance of definition in communication, one would expect definability to emerge as fundamental in information studies in general and information retrieval in particular. Storer has referred to the distinction between hard and soft sciences as being possibly "the most powerful single variable in explaining disciplinary differences in the cultural realm." 14 So far, however, research appears to have been limited. Studies of the information gathering habits of social scientists indicate a general similarity with the habits of physical scientists. Hindle has suggested that in patterns in use of books and journals, "softer" subject areas are characterized by much more diffuse reading than "harder subjects." 15 In other words, the use of documents is more widely spread over different titles and over materials of a wider range of ages.

"Signaling through time" and indirectness

Robert Fairthorne's delightful description of information retrieval as "marking" and "parking" catches nicely the expectation that what is marked and parked may be retrieved after some lapse of time. 16 If that were not the case, then "discarding" or "dumping" would be more appropriate. The same emphasis on the elapse of time as a feature of information retrieval was made more explicitly by Calvin Mooers in a short paper that is said to contain the first use of the term "information retrieval." 17 Mooers describes information retrieval as "communication through time."

The image of information retrieval as communication through time helps explain the lack of direct link between the originator of the message and the recipient. One can readily imagine messages (data, documents) as having hooks (tags, labels, or descriptors) attached to them and then being placed in some timeless void where they remain until an inquiry in the form of a set of one or more hooks reaches into the same void and pulls back any messages which have one or more hooks coinciding with those of the inquiry.

So far so good. However, as one starts to work around the edges of this definition it becomes a little frayed. In one special case of information retrieval, there is an attempt to minimize the lapse of time. This occurs when every new batch of marked and parked information is automatically searched for material pertaining to specific inquiries. In professional jargon, a standing profile of reader's searches is routinely searched against a file in order to provide S.D.I. (selective dissemination of information). One might regard this as a sort of preemptive information retrieval, although the notion of a standing order would seem closer. Yet even here, some delay, even if minimized, is necessarily present because each of these processes of acquiring, marking, parking, and retrieving must take some time. Even in prompt on-line processing, the indirectness and sequential, discontinuous, two-stage nature of the process necessarily involves some time—even though it might be very little. A different sort of problem emerges from other examples of communication which take time and which do not easily fit accepted definitions of informal retrieval. A letter sent through the mail will take time to arrive. (Strictly, all communication processes must take some time even though they may be trivial.) A notice that is posted on a fence, such as "Trespassers will be prosecuted," or on a refrigerator, "Don't drink all the milk," will continue to inform people for as long as it remains posted. A documentary article in a newspaper may consciously be intended to inform readers in posterity.

The simplest conclusion from all this would seem to be that, although information retrieval can properly be regarded as communication through time, it is not the only form of communication in which time may be significant. Delay, one could conclude, is a necessary attribute of information retrieval but not exclusive to it. In information retrieval, the indirectness or discontinuity of communication permits and, indeed, ensures delay over and above the time required for communication itself. Both time and indirectness would seem to be significant in information retrieval and will be considered further in chapter 9.

Time

In some cases, library catalogs compiled more than a century ago are still in active use. The disadvantage of old catalogs is that older cataloging practice differs from contemporary cataloging in two ways. Descriptive cataloging (choice of form of entry for author, title, etc.) has evolved over the years. An extreme example-found in an ancient English library in a book-form catalog printed in 1790 and still in use-is the entry of books by the author, "Smith," under the letter F— because the genitive case of the Latin word for a smith (faber) begins with an F. Those whose work requires them to use old catalogs tend to learn how to allow for some of these vagaries. More serious is the shift in terminology for describing things over the years. In all aspects of human activity, new terms are coined and existing words change their meaning. Language evolves. Objects themselves may evolve. Consider, for example, the computer. Its appearance, power, and function have all changed radically in less than half a century. Finding antiquated terminology in library and other sorts of catalogs can be a source of amusement, of irritation, and of failure to retrieve. Inevitably, the use of words reflects the parlance and perspectives of the day. For example, Berman has drawn attention to the use of indexing terms which reflect sexist and racist attitudes which are now less acceptable in the United States than they used to be. 18 From the point of view of information retrieval, the shifting of word usage over time and the evolving of new concepts, objects, and terms (which seem inevitable and behind which retrieval systems seem bound to lag) will mean that the retrieval system itself tends to become less accurate and less effective over time. The already imperfect description and definition, which are pivotal to effective retrieval, get worse.

Ideally, of course, the books should be recataloged continually according to contemporary cataloging practice, especially with respect to the subject headings used to describe what they are about. This requires, however, relatively expensive intellectual labor and the cost and benefits need to be weighed against the alternative uses of such money as is available. Most library users would be properly upset if their librarians ceased to buy new books of current interest in order to divert resources to the recataloging of old books which may be of limited interest.

Indirectness

By the "indirectness" of information retrieval systems, we refer to the characteristic that the designers and operators do not know who will seek the indirect communication that the information retrieval system provides. Not only does one not know who, one also does not know why they will seek to use it or what perspectives and vocabulary they will have when they seek to use it. To some extent, this lack of prior knowledge is shared by other communication systems, notably in mass communication. One is, in general, unable to predict or later ascertain who heard a radio broadcast, read a newspaper, or heard a speech to a crowd—or how much of it they understood—or how beneficial the message was to them. In information retrieval situations, however, there is a further problem and that is that the user—with or without someone else acting as a mediator—needs to define what it is that needs to be retrieved. The vocabulary of the would-be user is necessarily somewhat different from that of the designer of the retrieval system, since no two persons' vocabularies are exactly alike, and may be substantially different especially if the designer (a category within which we include the indexer for present purposes) is distant in time, education, and culture from the would-be user.

This problem of predicting rather than knowing what each future user of the system is likely to ask for and how he or she is likely to ask for it has long been recognized. Indexing and cataloging are, in part, predictive pastimes since the formal description of what data represent and what documents are about has to be modified by estimation of the probability that they will be sought and of the probable ways in which they will be sought. 19 For the most part, this seems to have been assumed more or less implicitly. In analysis, it has been described in terms of "thought experiments." 20

The importance in practice of this indirectness varies according to the circumstances and, we suspect, with the degree of definability. Previously, we have noted three factors as affecting definability:

  1. The range of choices available: limited in most management information systems; unlimited in general libraries and archives.
  2. The definability of the things that might be retrieved: from specific aircraft seats to vague cultural concepts.
  3. The extent to which the individual seeking to use the system can describe what he or she needs to reduce distressing ignorance: from a telephone number to, say, background material on stoicism in modern Western culture.

Time and indirectness would both seem to reduce the closeness of match between the designer and the user in terms of approach to description and definition. Both, therefore, would seem to exacerbate the problems of matching characterized by these three dimensions.

Strictly speaking, it is the fact of indirectness which permits the matching of definitions and it is the fact of time which inhibits adaptation of the system to the user. The user can, heuristicly, learn to understand the system better and the system, if computer-based, might be programmed to facilitate this heuristic learning. With either manual or computerbased systems, other humans can, and often do, play a mediating role, as, for example, in doing a literature search on behalf of someone else who is too busy or less familiar with the retrieval systems available. Significantly, ascertaining what the user wants and translating it into a form suitable for the system(s) to be used are both regarded as important processes which not only take time but require special training. We conclude that:

Relatedness, relevance, responsiveness, and retrieval

Introduction

The design, use, and evaluation of retrieval systems depend heavily on various sorts of relatedness. There have been two problems in discussions of this area: (1) the elements and relationships have not always been analyzed in enough detail; and (2) terminology has not always been clear and consistent. In particular, the utilization of retrieved data has not always been adequately distinguished from the retrieval process and the term "relevance" has been loosely used for more than one sort of relatedness. 21 An attempt will be made to clarify the concepts and terminology involved. 22

The mechanism by which retrieval systems operate is the association (usually but not necessarily the matching) of arbitrarily chosen but predetermined attributes of the set of data that is to be susceptible to retrieval. The attributes that are used include authorship (as in a library catalog), date of publication (as, sometimes, in bibliographies), age (as in museum documentation), occurrence of words (as in the searching of texts), and so on. The list of possible attributes that could be used seems endless: location, size, chemical process, origin, etc. Nor need attributes be used alone: systems retrieving bibliographical data, for example, commonly operate on two or more attributes in combination, e.g., authorship, date of publication, language, and subject matter. The retrieval system responds to an inquiry by yielding such data as it finds that are highly associated with the attributes specified in the inquiry.

"Aboutness"

Writings about retrieval and especially about the evaluation of information retrieval systems have been dominated by just one of the apparently unlimited range of attributes: subject matter, i.e., what documents are about.

The term "aboutness" can be conveniently defined as referring to a coincidence of concepts, that is to say, if a book is "about" Austria, we infer that the book contains concepts that we associate with the subject "Austria." If it did not do so, we should deny that the book is "about" Austria. This is not entirely an objective matter since concepts have to be perceived and there may be some scope for disagreement in the perceptions by different people as to the concepts they perceive in a book and even in the concepts they associate with the subject "Austria" (Austria-Hungary, Austrian Empire, Republic of Austria, etc.). Hence, there is scope for honest difference of opinion as to what a book is about. Consider, for example, an allegory. A person who fails to perceive the allegorical symbols will have a different opinion concerning what the text is "about" than someone who does perceive them. In an extreme case, most persons who saw a book on Buddhist mythology written in Tibetan would be able to perceive so little of the concepts that they would, if honest, have to say that they did not know what the book was about. This is a matter of conceptual perception as well as a linguistic problem. If the book on Buddhist mythology were translated into a language they could understand, then, if the concepts were unfamiliar, they would probably still understand little of what the book was about. It is unwise to assume that a document has a single subject matter. It would seem more sensible to recognize that a document may, by general consensus, be concerned with a particular topic and yet have quite different meanings for particular individuals on particular occasions. 23

In addition to the problems of the recognition of the concepts, there is also the problem of the definability of the concepts. Even with accurate perception, if there is not a rigorous, exclusive, unambiguous use of terms, the defining of the concepts perceived in the book may vary from one reader to another. This might simply be a matter of using alternative and equivalent synonyms. However, to the extent that concepts are not susceptible to description in unambiguous terms—they are vague or "soft"—then it is to be expected that statements by different individuals as to what a book is about, i.e., which set of concepts they perceive to be represented, will vary. Different people will state different nonequivalent definitions as to what the book is about.

The scope for honest disagreement concerning what something is about is important. However, in any given point in linguistic and cultural time and space, there is likely to be a great deal of agreement. 24 If there were not, then subject indexes would not work. On the assumption that indexes are expected to indicate what things are about, one can state that the effectiveness of indexes depends on and is determined, in part, by the degree of uniformity in perception of concepts, and common definition and labeling of those concepts.

Retrieval using the attribute of what documents are about has been and can be expected to be of primary importance since subject access is difficult, useful, and technically interesting. It has dominated so much that it has, perhaps, hindered clarity of thought about the foundations of information retrieval theory. Retrieval by "aboutness" has to be seen as the use of one attribute among many. Our conceptual framework and definitions should be broad enough to include all attributes not just one.

"Utility"

We define "utility," following usage in economics, as benefit accruing. If reading a document and being informed by it leads to an enhanced state of knowledge which enables one to achieve some goal, then we can describe that process as having been beneficial or useful—as having utility.

We need to note two other outcomes: having a harmful effect, sometimes referred to awkwardly as a "negative utility," "disutility," or "disbenefit;" and having no known effect that could be regarded as useful or harmful—no utility or benefit.

Utility is meaningful only in terms of some objective, explicit or implicit. Giving somebody money has the property of utility if becoming wealthy or purchasing things are goals for that person; if that person were trying to achieve spiritual growth through poverty, then the gift of money would be unhelpful—of negative utility. Utility, then, is in all cases dependent on an objective. People have objectives. Inanimate objects do not. Organizations have objectives only to the extent to which individuals and groups have objectives which they seek to achieve through the organization. Hence, utility is dependent not only on an objective but on the objective of one or more persons. Further, objectives imply values. There are values, implicit or explicit, which make one decide that it is desirable to pursue an objective, and a particular objective in preference over other objectives. Utility and objectives both derive from human values.

We are concerned here with information retrieval and the manner in which it might have utility. In principle, it would seem that being informed might assist in any of the objectives that one or more individuals might have: spiritual, physical, intellectual, professional, or social. These objectives might also be hindered. An obvious example of harmful information would be information that was, in fact, misinformation. The objective of getting from San Francisco to New York is likely to be hindered if one receives inaccurate information about the departure time of the airline flight and, as a result, misses the plane. Even so, although inaccurate information is likely to be the major cause of disutility associated with information retrieval, it is not necessary or proper to equate disutility with inaccuracy since it can happen that accurate information can also hinder the achievement of objectives. It is, after all, not absurd for someone to state honestly: "I would never have undertaken that task had I known more about what was involved, but I am glad that I did it!" Further, it is important to remember that if the utility derived from information retrieval can pertain to any human objective, then it is only to be expected that some of these objectives will appear obscure. They might seem irrational to other people. They might be kept a secret. They might lie deep in the subconscious and not be recognized even by the individuals concerned. They can be expected to reflect values of a very private nature as well as publicly proclaimed ones. There may be inconsistencies between professed objectives and those actually pursued. There will probably be conflicting objectives even for one individual.

The elements of retrieval

We have defined the use of retrieval systems as including three distinguishable stages: the formulation of an inquiry; the retrieval of signals; and the topic of the next chapter—the utilization of what has been retrieved. The effectiveness of each process can vary:

From the analysis thus far there emerge several possibilities for things to be related to each other—or to have degrees of relatedness. These include but are not limited to:

  1. the inquiry as formulated for the retrieval system;
  2. any of a seemingly unlimited range of attributes;
  3. data retrieved; and
  4. benefit to the user.

Before considering sorts of relatedness, it is important to emphasize again the separateness of the retrieval process from the processes of formulation and of utilization. The difference between retrieval and utilization can be conveniently illustrated by what we might call the case of the disappearing user. Let us imagine that someone formulated and posed an inquiry concerning chocolate, cholesterol, and heart disease to a computer-based retrieval system. The retrieval system responds by yielding a set of data. The user becomes better informed as a result of perusing the data and benefits from a changed state of knowledge. Let us now imagine that, having posed the inquiry, the inquirer loses interest, is unable to await the response, or dies from a heart attack. In this latter scenario, there is no opportunity for utilization of the data, nor, therefore, for benefit to the user. Yet the retrieval system has performed in exactly the same way. The process of retrieval and the data retrieved are indistinguishable, in fact unchanged, from one scenario to another.

We can clarify the distinction between formulation and retrieval by extending this simple case. In some circumstances, the user may modify the formulation of the inquiry if it is thought that the data yielded would not be what is desired. In this case, what has happened is that a different search has been formulated, however slight the modification has been. Commonly, the user's knowledge has changed as a result of preliminary indications concerning the set of data that would be yielded. The response by the retrieval system to any given formulated search will not have been changed unless the retrieval system itself has also been altered in some way, e.g., new data added or the indexing modified.

Relatedness

In the evaluation of information retrieval systems, the term "relevance" has been loosely used to denote differing forms of relatedness. 26 A practical approach is to define the most useful relationships and degrees of relatedness first and then give them distinctive names. We shall consider three different relationships for each of which the terms "relevant" or "relevance" has been used.

Relatedness I: Responsiveness

Responsiveness refers to relatedness of the data retrieved to the inquiry as posed in terms of the attributes used as the basis for retrieval. To what extent did the system retrieve all and only the works that it contains by Mark Twain? The quality of the response to the inquiry by the retrieval system will be affected by several factors including the appropriateness and completeness of the data base, the suitability of the attribute(s) used as the basis for retrieval (in this case authorship), and the ability of the retrieval system to identify those data that fit the description offered by the inquiry—or fit it to the desired degree. If one wished to avoid talking of the "relevance" of the retrieved data to the inquiry, one might speak of the responsiveness of the system.

Relatedness II: Pertinence

In Relatedness I above, we were concerned with the general term: the relatedness of the retrieval system's response to the inquiry regardless of the attribute(s) being used as a basis for retrieval. We now consider one special case within the general class: when the attribute used as the basis for retrieval is topicality—the subject matter of the data. In ordinary speech one might well speak of one topic as being relevant to another topic. Such relationships (e.g., general to specific, overlap) can be difficult to understand or to define. For example, when retrieving by the attribute of topicality, data on "Freud" are related (relevant) but not identical to the topic of psychoanalysis. We might term this relatedness "pertinence " This sort of relatedness between properties of data within a given attribute could exist with other attributes than topicality.

Relatedness III: Berieficiality

A relationship that is entirely different from either of the above is that between the retrieved data and the benefit of the user. It is in this sense, for example, that Wilson has sought to limit the use of the term "relevance" in Two Kinds of Power27 and it is implicit in all discussions of utility-theoretic indexing. Stated simply, it is assumed that retrieval systems are provided and used in order that their utilization will have beneficial effects. Social values are implied. This raises two questions: (1) Whose values? The users' or those of the providers of the service? and (2) Are we referring to actual benefits or expected benefits? These are critical questions. However, whatever answers are given, it is clear that the relationship is different in kind from both responsiveness and pertinence because factors external to the retrieval system affect the outcome: social values and the knowledge and cognitive skills of the users.

Implications for information retrieval

We have defined responsiveness as signifying the extent to which the retrieval system yields data associated with the attribute(s) specified in the formulated inquiry. We have defined beneficiality as the property of assisting in the achievement of objectives. We have noted that these objectives are necessarily the objectives of human beings and relate to human values even though they may sometimes be obscure and even seem irrational to other people. From this discussion the following conclusions would seem to follow.

An ideal information retrieval system would retrieve data and documents that would assist individuals in the pursuit of their objectives, i.e., values. This implies that the information retrieval system should be concerned with the utility of what is retrieved rather than what it is about, since utility, not aboutness, is the goal. In order to achieve utility, the ideal information retrieval system would need to know the objective(s) and value(s) of each user.

Since redundant information does not help achieve goals and may hinder their achievement, the information retrieval system would, ideally, also have to know the state of knowledge of the inquirer—both its extent and its limitations. Avoiding the retrieval of unneeded and unusable data is, after all, a major purpose of information retrieval systems. If a researcher sought material relevant to an inquiry concerning Freud, it is unlikely that it would be helpful to retrieve a document that had been written by that same researcher, even though it may be related to the subject of the inquiry.

It is not practical by any known technique to expect to know all of the objectives of people currently using an information retrieval system. It is still less reasonable to expect to be able to predict what future users' objectives and values might be. Even if one could, the most useful set of retrieved documents is likely to be unique for each inquiry. Further, since each person's mind is unique, even objectives that are ostensibly the same for different people may be different in practice. What is more, since redundant information is to be avoided, subsequent inquiries concerned with the same person's objective would call for different responses since the individual's state of knowledge will probably have changed in the meanwhile. Some things will have been learned, others forgotten. Therefore, an ideal information retrieval system based on utility would have a formidable set of design requirements. It would need to understand objectives that present inquirers might not be willing to admit to or might not consciously understand fully themselves; it would need to predict which persons might use the system in the future and what their objectives might be in that future; and it would in each case need to know not only what the individual's objective is but also that same individual's state of knowledge not at the point in time that the inquiry was made but at the point in time that the data are retrieved. This would have to be true for each and every individual who may come to use the system. What should be retrieved should vary even for different posings of the same question by the same person. All this is quite apart from the fact that the concepts and definitions used may be more or less ambiguous.

Although such an ideal system would seem to be what is needed, the compounding of inherently improbable achievements one upon another means that this ideal system is most likely to remain an inspiring but unrealized achievement. (Not that this might not be helpful. Witness the repeated homage to the inspirational role played by Vannevar Bush's seminal essay, "As We May Think," published in 1945. 28

The notion of imagining what future inquiries might be posed may very well be a useful device (cf. W. Cooper29), but the combination of needing to know objectives, values, and future states of knowledge—including knowledge not yet known to anybody—casts grave doubt on the achievability of such an ideal system.

In the discussion of "aboutness" above, reasons were adduced as to why complete agreement is not to be expected concerning what documents are about. Nevertheless, within a given cultural and linguistic context, considerable consensus is likely. Indexing and retrieving books according to what they are currently perceived to be about is, therefore, a more practical matter than indexing them in relation to potential future inquiries.

If indexing and retrieval based on aboutness is more practical than indexing with respect to predictions of future inquiries and future knowledge, how does aboutness relate to utility? The answer would appear to be two-fold:

  1. If we hold relentlessly to the importance of utility—of being beneficially informed—we can still regard aboutness as a sensible predictor of utility to the inquirer. If we seek to reduce our ignorance about Freud, a document about Freud is likely to make us more fully informed about Freud. Further, although this process does nothing to minimize the retrieval of knowledge that we already know, this redundancy is at worst inefficient rather than misleading since we can, presumably, ourselves filter out subsequently that which is already known to us.
  2. in our discussion, we have tended to assume that the attribute used as a basis for retrieval would be its subject aboutness. Although a subject retrieval system could operate in isolation, this is a singularly unrealistic and unnecessary assumption, since constant use is in fact made of other attributes of identifying documents which might be useful. A few examples will demonstrate: Contextual attributes such as author, origin, date of creation, and extensions of these can be helpful. Indeed, in the case of archives, the principal means of arrangement and approach for documents is their administrative provenance. In libraries, a common mode of approach is to use the author's name as a means of identifying books on a subject on which it is thought that person might have written. Citations, by means of which an author refers in a document to another document, imply a relationship between the documents—a prima facie indicator that if one document is related to an inquiry, so might the other. The success of the various citation indexes is clear evidence of their value in supplementing subject indexes. Further, a part of the formal information system is the information specialist who operates and may have designed it. Users of archives depend heavily on the archivist for guidance. Librarians have always included in their role the drawing on their experience with bibliography, with their collections and with their users in order to assist users (cf. the "reference interview"). In addition, it is foolish not to include as part of this picture the informal assistance played by friends and colleagues who can and do play a significant role in scholarship and in bureaucracies, to give but two examples.

In brief, information retrieval based on what documents are deemed to be about (as opposed to prediction of what is unknown in relation to future enquiries) can be expected to work moderately well in practice because there is more or less consensus on what documents are about, because "aboutness" can plausibly be regarded as predictive of utility, and because, in practice, subject indexing is supplemented by other indicators of probable utility.

Information retrieval based on "aboutness" lends itself to automatic indexing since the occurrence and more especially the co-occurrence of terms in the text can indicate the apparent subject content of the document. This may be expensive and error-prone, but results likely to be of some use can be achieved.

The very same reasons which make the ideal information retrieval system unlikely also make implausible the concept of an "information counselor" in any sophisticated sense. Any person knowledgeable about sources of information can, in general, be helpful. However, the notion of an information counselor based on an analogy with a dietician, who can diagnose and prescribe information like a change of diet, would have to cope with the same problems as would the ideal information system: in addition to understanding objectives which the inquirer may imperfectly comprehend, the counselor would also need to understand the extent and nature of the inquirer's knowledge, and presumably be able to identify the point at which the inquirer has been beneficially informed to a sufficient extent. The analog of a dietician would be more apt if states of knowledge could be objectively assessed by blood count, encephalograms, and the like.

Summary

The foregoing discussion leads to the following conclusion:

  1. The three sorts of relatedness—responsiveness, pertinence, and beneficiality—are different.
  2. Since beneficiality (Relatedness III) is rooted in the utilization of retrieved data and in human values, the superficially startling conclusion follows that relevance in the sense of utility cannot properly be used to evaluate information retrieval processes. In this, we are using a strict, narrow definition: the ability of the system to yield a set of data responsive to the formulated inquiry. The proper basis for evaluation would be rooted in the responsiveness of the system in terms of the inquiry and the attributes used: "fitting the description" in Wilson's terms. 30 Any use of beneficial effects cannot be an evaluation of retrieval processes only but must be either an evaluation of utilization or some combination of utilization and other processes (e.g., utilization and retrieval and, possibly, formulation, and even other logically prior activities such as identification of need) or concurrent activities (e.g., improvement of cognitive skills in order to improve utilization). 31 Whoever uses benefit or utility in information retrieval evaluation should specify the boundaries of what is being evaluated.
  3. It would be helpful if new, distinct terms were to be adopted to distinguish different sorts of relatedness ("relevance"). Confusion might be reduced. ideally, the terms should be applicable generally to sorts of relatedness and not restricted to specific examples of retrieval activities (e.g., subject-based retrieval as opposed to retrieval on the basis of other attributes) unless such narrower usage is justified and clearly stated.

We shall reconsider the distinction between responsiveness and beneficiality further in our final chapter.

Requisite variety

We can speculate on the consequences of failure to find items. For this it is helpful to invoke a principle from cybernetics—the law of requisite variety—which states, in effect, that a system must have as many different responses as it encounters challenges. Otherwise, the system will fail for lack of being able to respond. In the context of retrieval systems, one might say that, if something appropriate is not retrieved for each and every inquiry, the system will fail. So it will if, for example:

It follows that requisite variety can be increased by attending to one or more of these causes. A more complete collection of retrievable objects (e.g., a bigger collection of appropriate books) will help. (We define completeness as being complete with respect to some range of inquiries.) A wider range of index terms would also help, provided, of course, they lead to appropriate retrieval, as would more extensive and more reliable connections between index terms and retrievable objects. 32

What are the consequences of failure? They would appear to be two-fold:

Competence to use retrieval systems

We have assumed, hitherto, that those who would use retrieval systems are fully competent to do so. This is a most unwise assumption for the following reasons:

  1. There are many different retrieval systems. Consider the fact that Sheehy's Guide to Reference Books describes more than 10,000 different bibliographies and other works of reference. 33
  2. Typically, several retrieval systems are likely to contain material about a given topic, and many others may well contain at least a little material relating to that topic.
  3. Each system is more or less different from the others, even though the differences may sometimes be small and subtle. Further, retrieval systems commonly change. Published works may have new editions and computer-based retrieval systems are continuously being modified.

These are the reasons why a good reference librarian not only knows many reference sources, but also has the familiarity and understanding that come from frequent use, and keeps up-to-date. 34 It is inconceivable that one could know too many sources, could be too familiar with them, or would be unnecessarily up-to-date. This is clearly a major challenge even for the dedicated professional information specialist.

What, then, of the user who is not a professional reference librarian or information specialist? There are, of course, some exceptional individuals, but the general situation is entirely predictable:

  1. There is some vague (but generally incomplete) awareness that libraries have considerable potential for the retrieval of information.
  2. The number of retrieval systems known to any given user is likely to be small.
  3. The expertise that can come only from conscious attention to the complexities of the system is likely to be lacking in most cases.

It is difficult and unreasonable to imagine any circumstances in which this situation could be expected to be otherwise, since such expertise requires opportunity, time, and effort—and not just once but on a continuing basis. It is not at all clear that for everyone the benefits involved in being expert justify paying the price.

The consequence is that the use of retrieval systems is bound to be far less in both frequency and effectiveness of use than is possible and beneficial. But to assert that this is "wrong" would be to forget the price involved (mostly nonmonetary) and to overlook the fact that the use of retrieval systems is a means, not an end. More use of libraries can sensibly be expected to follow—and only to follow—changes in the perceived benefits and perceived costs. What then could or should be done? Three sorts of practical activities would seem sensible:

  1. Greater awareness of the existence and of the potential usefulness of retrieval systems would permit consideration by the user of more use of them.
  2. A lower price—primarily, greater ease of use—can be expected to result in an increase of use.
  3. More effectiveness in retrieval systems should also help since that should increase the benefit. Yet, caution is in order since use will follow perceptions of cost-effectiveness as perceived by the user. (Because we are discussing use, we deliberately listed ease ahead of effectiveness.)

Some of the price (effort) can be reduced by using a competent intermediary (a reference librarian or information specialist). However, it is easily overlooked that using an intermediary may increase the price somewhat unless the user trusts and is accustomed to using the intermediary. Otherwise, for most people, a psychological effort and a change of habit—hence, a price—may be involved in asking for help. The style and demeanor of the reference librarian can increase or decrease the perceived price involved.

In this rather intangible area, the interpersonal skills of librarians become important. So also is instruction (formal or otherwise) in library skills. 35 Major problems include the providing of motivation and the development of good enough instruction. In a university context, a favorable attitude by faculty in the students' area of study and arrangements for credit for bibliographical instruction to count toward their degree both help. Pitching the instruction at the correct level and relating the learning experience to the user's personal interest both appear to be important. Unfortunately, skill as a librarian does not, in and of itself, guarantee skill as an instructor.

In the next chapter we shall discuss the use of the data and documents that have been retrieved.

 Go to Chapter 9


1 G. Jahoda and J. S. Braunagel, The Librarian and Reference Queries: A Systematic Approach (New York: Academic Press, 1980); E. Z. Jennerich, "Before the Answer: Evaluating the Reference Process," RQ 19 (Summer 1980): 360-6; G. Jahoda and P. E. Olson, "Analyzing the Reference Process, " RQ 12 (Winter 1972): 148-56; M. T. Lynch, "Reference Interviews in Public Libraries," Library Quarterly 48 (April 1978): 119-42; G. B. King, "Open and Closed Questions: The Reference Interview," RQ 12 (Winter 1972): 157-60; M. E. Murfin and L. R. Wynar, Reference Service: An Annotated Bibliographical Guide (Littleton, Colo.: Libraries Unlimited, 1977), esp. chapter 9. Barnes found very superficial levels of interaction in practice: M. Barnes, "Staff/User Interaction in Public Libraries: A Non-Participation Observation Study," in Information in Society, edited by M. Barnes et al. (Leeds: School of Librarianship, Leeds Polytechnic, 1981): 61-81. Question negotiation can have adverse effects: F. W. Lancaster, "MEDLARS: Report on the Evaluation of Operating Efficiency," American Documentation 20, no. 2 (April 1969): 119-42.

2 N. Shosid, "Problematic Interaction: The Reference Encounter," in Varieties of Work Experience, edited by P. L. Stewart and M. G. Cantor (New York: Schenkman, 1974), pp. 224-37. See also R. M. Harris and B. G. Michell, "The Social Context of Reference Work: Assessing the Effects of Gender and Communication Skill on Observers' Judgements of Competence," Library and Information Science 8, no. 1 (January-March 1986): 85-101.

3 The present discussion isolates a few aspects. For a fuller treatment, see L. M. Chan, Cataloging and Classification: An Introduction (New York: McGraw-Hill, 1981); R. Hagler and P. Simmons, The Bibliographic Record and Information Technology (Chicago: American Library Association, 1982); J. E. Rowley, Organizing Knowledge: An Introduction to Information Retrieval (Aldershot, England: Gower, 1987); and B. S. Wynar, Introduction to Cataloging and Classification, 7th ed., edited by A. G. Taylor. (Littleton, Colo.: Libraries Unlimited, 1985). For the more specialized topics of information retrieval theory and testing see G. Salton and M. J. McGill, Introduction to Modern Information Retrieval (New York: McGraw-Hill, 1983); Information Retrieval Experiment, edited by K. Sparck Jones (London: Butterworths, 1981); Information Retrieval Research, edited by R. N. Oddy et al. (London: Butterworths, 1981). For a detailed treatment of indexing and classification systems as "documentary languages" see W. J. Hutchins, Languages of Indexing and Classification: A Linguistic Study of Structures and Functions (Stevenage, England: Peter Peregrinus, 1975).

4 J. Farradane, "Semantic Analysis. Farradane's Relational Indexing System," Journal of Information Science 1, no. 5 (January 1980); 267-76; 1, no. 6 (March 1980): 313-24. "International Symposium on Relational Factors in Classification, University of Maryland, 8-11 June 1966. Proceedings," edited by J. M. Perrault, in Information Storage and Retrieval 3, no. 4 (December 1967): 147-410.

5 "The results of Cranfield 11 were rather unexpected because, taking both recall and precision into account, the index languages performing best used uncontrolled single words, that is, they were natural-language systems, such as Uniterms, based on words occurring in document texts." Lancaster, Information Retrieval Systems p. 275. Cf. E. M. Keen, and J. Digger, Report of an Information Science Index Languages Test (Aberystwyth, Wales: College of Librarianship, 1972), Part 1, pp. 166-7.

6 E. M. Keen, "Laboratory Tests of Manual Systems," in Information Retrieval Experiment, edited by K. Sparck Jones (London: Butterworths, 1981), pp. 136-55.

7 For a study of the significance of citation see T. L. Hodges, Forward Citation Indexing: Its Potential for Bibliographical Control (Ph.D. dissertation. University of California, Berkeley, School of Librarianship, 1972. University Microfilms order no. BGD73-16787). B. Cronin, The Citation Process: The Role and Significance of Citations in Scientific Communication (London: Taylor Graham, 1984). T. A. Brooks, "Evidence of Complex Citer Motivations, " Journal of the American Society for Information Science 37, no. 1 (January 1986): 34-36.

8 "For example, suppose you want information on the physics of simple fluids. The simple citation 'Fisher, M. E., Math. Phys., 5, 944, 1964' would lead the searcher directly to a list of papers ... a significant percentage of the citing papers are likely to be relevant. ... In other words, the citation is a precise, unambiguous representation of a subject that requires no interpretation and is immune to changes in terminology." E. Garfield, Citation Indexing—Its Theory and Application in Science, Technology, and Humanities (New York: Wiley, 1979), p. 3.

9 C. Beghtol, "Bibliographic Classification Theory and Text Linguistics: Aboutness Analysis, Intertextuality and the Cognitive Act of Classifying Documents," Journal of Documentation 42, no. 2 (June 1986): 84-113.

10 Cf. Fugman's Axiom of Definability: "The compilation of information relevant to a topic can be delegated only to the extent to which an inquirer can define the topic in terms of concepts and concept relations," R. Fugman, "The Five-Axiom Theory of Indexing and Information Supply," Journal of the American Society for Information Science 36, no. 2 (March 1985): 116-29.

11 Note also the distinction between "uncertainty", which ordinarily decreases with more information, and "equivocality" or "ambiguity," characterized by multiple and conflicting interpretations and less easily remedied by more information; cf. R. L. Daft et al., "Message Equivocality, Media Selection, and Manager Performance: Implications for Information Systems," MIS Quarterly 11, no. 3 (September 1987): 355-66.

12 It is sometimes argued that "soft" disciplines are immature disciplines that have not yet developed to a mature, "hard" state. It seems as likely that "soft" disciplines are, at least to some extent, inherently and incorrigibly "soft" by the nature of their concerns.

13 M. Bunge, Scientific Research 1: The Search for System (Berlin: Springer, 1967), pp. 97-8.

14 N. W. Storer, "Relations Among Scientific Disciplines," in The Social Contexts of Research, edited by S. Z. Naagi and R. G. Corwin (London: Wiley Interscience, 1972), 229-68, on p. 239.

15 M. K. Buckland, "Are Obsolescence and Scattering Related?" Journal of Documentation 28, no. 3 (September 1972): 242-6; see pp. 244-5.

16 "... all retrieval systems demand marks of some kind. ... An object can be marked by changing it intrinsically in some recognizable way—as by painting it, punching a hole, or introducing it to a skunk. This I call 'inscribing.' Or it can be changed relative to its environment by putting it upside down, on one side, in an inscribed pigeon-hole, and so forth. This is called 'ordering the item.' Better terms for less formal contexts are 'marking' and 'parking'!" R. A. Fairthorne, "The Patterns of Retrieval," American Documentation 7, no. 2 (April 1956): 65-70. (Reprinted in R. A. Fairthorne, Towards Information Retrieval (London: Butterworths, 1961).)

17 C. Mooers, "Information Retrieval Viewed as Temporal Signalling," in International Congress of Mathematicians. Cambridge, Mass., 1950. Proceedings (Providence, R.I.: American Mathematical Society, 1951), vol. 1, pp. 572-3.

18 S. Berman, Prejudices and Antipathies: A Tract on the LC Subject Heads Concerning People (Metuchen, N.J.: Scarecrow Press, 1971). See also J. K. Marshall, On Equal Terms: A Thesaurus for Nonsexist Indexing and Cataloging (New York: Neil-Schuman, 1977).

19 "Among the several possible methods of attaining the objects, other things being equal, choose that entry (1) that will probably be first looked under by the class of people who use the library. ... " C. A. Cutter, Rules for a Dictionary Catalog, 4th ed. (Washington, D.C.: Government Printing Office, 1904), p. 12. This issue recurs in discussions of "probabilistic indexing": M. E. Maron and J. L. Kuhns, "On Relevance, Probabilistic indexing and information Retrieval," Journal of the Association for Computing Machinery 7, no. 3 (1960): 216-44; A. Bookstein, "Probability and Fuzzy-Set Applications to Information Retrieval," Annual Review of Information Science and Technology 20 (1985): 117-5 1.

20 W. S. Cooper, "Indexing Experiments by Gedanken Experimentation," Journal of the American Society for Information Science 29, no. 3 (May 1978): 107-19.

21 For a convenient introduction to the literature concerning relevance, see A. Bookstein, "Relevance," Journal of the American Society for Information Science 30, no. 5 (September 1979): 269-73; also T. Saracevic, "The Concept of 'Relevance' in Information Science: A Historical Review," in Introduction to Information Science, compiled by T. Saracevic (New York: Bowker, 1971), pp. 111-51.

22 For a similar discussion see M. K. Buckland, "Relatedness, Relevance, and Responsiveness in Retrieval Systems," Information Processing and Management 18, no. 4 (1983): 237-41.

23 A library user who was studying Third World dialects of English complained that a book on politics by an African politician had been classified in the politics section when "It is African use of English." The library had acquired the book specifically to support this user's research. See Beghtol, "Bibliographic Classification, Theory and Text Linguistics: Aboutness Analysis, Intertextuality and the Cognitive Act of Classifying Documents," Journal of Documentation 42, no. 2 (June 1986): 84-113.

24 L. E. Leonard, Inter-Indexer Consistency Studies, 1954-1975 (Occasional Papers, 13 1) (Urbana, Ill.: University of Illinois, Graduate School of Library Science, 1977); K. Markey, " Interindexer Consistency Tests: A Literature Review and Report of a Test of Consistency in Indexing Visual Materials," Library and Information Science Research 6, no. 2 (April-June 1984): 155-77.

25 For a systematic review see N. J. Belkin and W. B. Croft, "Retrieval Techniques," Annual Review of Information Science and Technology 22 (1987): 109-45.

26 P. G. Wilson, Two Kinds of Power: An Essay on Bibliographical Control (Berkeley: University of California Press, 1968).

27 Ibid.

28 V. Bush, "As We May Think," Atlantic Monthly 176 (July 1945): 101-3.

29 Cf., Cooper, "Indexing Experiments."

30 Wilson, Two Kinds of Power.

31 The difficulties encountered in trying to use beneficiality where responsiveness rather than beneficiality is appropriate is illustrated in a recent discussion of the retrieval of data that are relevant (i.e., beneficial) but not topical. "The material is relevant but not topical. However to expect a retrieval system to respond to an unexpressed need seems a harsh requirement indeed. ... The instance of the relevant but untopical document is unlikely to occur in any test of relevant documents used for system evaluation. ... While it is certainly the case that topicality is not a necessary condition for relevance, it seems that we may comfortably treat it as such without great loss." B. Boyce, "Beyond Topicality: A Two Stage View of Relevance and the Retrieval Process," Information Processing and Management 18, no. 3 (1982): 105-9. See Fugman, "The Five-Axiom Theory of Indexing and Information Supply," Journal of the American Society for Information Science 36, no. 2 (March 1985): 116-29.

32 Cf. H. Wellisch, "The Cybernetics of Bibliographical Control: Toward a Theory of Document Retrieval Systems," in heory and Application of Information Research, edited by 0. Harbo (London: Mansell, 1980), pp. 82-100.

33 E. P. Sheehy, Guide to Reference Books, 10th ed. (Chicago: American Library Association, 1986).

34 Lytle, in a study of retrieval from archives, found a significant relationship between search effectiveness and the searcher's familiarity with the retrieval technique. R. H. Lytle, "Intellectual Access to Archives: 11 Report of an Experiment Comparing Provenance and Content Indexing Methods of Subject Retrieval," American Archivist 43, no. 2 (Spring 1980): 191-207. "A searcher experienced in the P method achieved good results with it, as did a searcher experienced in CI using that method," p. 194.

35 There is a large literature on instruction in library use. See, for example, P. J. Taylor, C. Harris, and D. Clark, The Education of Users of Library and Information Services: An international Bibliography, 1926-76. (Aslib bibliography, 9) (London: Aslib, 1979); A. K. Beaubien, S. A. Hogan and M. W. George, Learning the Library: Concepts and Methods for Effective Bibliographical Instruction (New York: Bowker, 1982); C. Oberman and K. Strauch, Theories of Bibliographic Education: Designs for Teaching (New York: Bowker, 1982); A. F. Roberts, Library Instruction for Librarians (Littleton, Colo.: Libraries Unlimited, 1982); 1. Malley, The Basics Of Information Skills Teaching (London: Bingley, 1984).

 Go to Chapter 9

Copyright © 1988, 1999 Michael K. Buckland.
Document maintained at http://sunsite.berkeley.edu/Literature/Library/Services/chapter8.html by the SunSITE Manager.
Last update February 26, 1999. SunSITE Manager: manager@sunsite.berkeley.edu