LIS 512: Introduction to Knowledge Organisation
Indexing, Controlled Vocabularies and Thesaurii
Course website: http://myweb.liu.edu/~mkipp/512/
Readings:
Taylor, Arlene G. 2004. The Organization of Information. 2nd ed. Westport, Conn. Libraries Unlimited. Chapter 10
Hammond, Tony; Hannay, Timo; Lund, Ben and Scott, Joanna. 2005. Social Bookmarking Tools (I): A General Review. D-Lib Magazine 11(4). http://www.dlib.org/dlib/april05/hammond/04hammond.html
Leise, Fred; Fast, Karl, and Steckel, Mike. 2002. All About Facets and Controlled Vocabularies [series of short articles]
http://www.boxesandarrows.com/view/all_about_facets_controlled_vocabularies
http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary_
http://www.boxesandarrows.com/view/creating_a_controlled_vocabulary
http://www.boxesandarrows.com/view/synonym_rings_and_authority_files
http://www.boxesandarrows.com/view/controlled_vocabularies_a_glosso_thesaurus
Kipp, Margaret E.I. 2005. Complementary or Discrete Contexts in Online Indexing: A Comparison of User, Creator, and Intermediary Keywords. Canadian Journal of Information and Library Science 29(4). Preprint available from http://dlist.sir.arizona.edu/1533/.
Further Readings:
Cleveland, D.B. & Cleveland, A.D. (2001). Vocabulary Control (pp.35-47). In Introduction to indexing and abstracting. Englewood, CO: Libraries Unlimited.
Lancaster, F. W. 1986. Vocabulary control for information retrieval. Arlington, Va: Information Resources Press. (skim)
Lancaster, F. W. 2003. Indexing and abstracting in theory and practice. London: Facet Pub. Chapter 3 (113-134) and Chapter 7 (100-112)
NISO. Guidelines for Abstracts. NISO Press, 1997.
Tenopir, C., & Jacso, P. 1993. Quality of Abstracts. Online. 17(3), 44,46-48,50-55.
Indexing, Controlled Vocabularies and Thesaurii
Subject Analysis
Indexing Systems
Controlled Vocabularies
Subject Headings
Thesaurii
Ontologies
Social Tagging
Subject Analysis
Subject analysis is the process by which you decide what a work is about. Once you know what a work is about it can be classified or indexed to aid retrieval. Call numbers (classification) and subject headings (indexing) are assigned based on subjects from the subject analysis process.
The purpose of subject analysis is to provide meaningful subject access to information packages, to provide for collocation of information by subject, to save time searching for related concepts and to allow users to see related subjects and materials. Subject analysis consists of a conceptual analysis to determine the aboutness of an information package followed by the translation of the aboutness into the terminology used in the classification system or controlled vocabulary.
The three basic questions for determining what an information package is about are: What is it? What is it for? and What is it about? Subjects can be determined by examining the information package for significant items like title, topics in the table of contents, images, tables and other significant items.
Problems with subject analysis are often caused by disagreement over what the information package is about, confusion between the subject of the item and what it will be used for, and the fact that people simply do not use the same terms to refer to the same concepts. See http://del.icio.us/ or any social tagging system for a working example of this problem
An important question in subject analysis is: How much of the conceptual content of the work should be indexed? Generally, there are two levels of indexing: summarisation, which covers only the most important concepts, and depth indexing, which covers all major concepts. Sumarisation is the most common level of indexing. With full text search, summarisation seems more limiting than it used to.
Sumarisation is the process of deciding what an item is about and translating this into index terms. This process should examine three distinct areas: the discipline in which the item was produced, the specific subjects or topics treated and the form of the item. Summarisations can be written in the following format:
Discipline | Topic {Facet}1-n | Form
Abstracting is the creation of an abbreviated representation of a work which fully represents its intellectual content. Abstracts can be structured or unstructured, but should include the objectives and scope, methodology, results and conclusion of the paper.
Once major subjects have been identified the object can be indexed or classified using subject headings from a thesaurus (e.g. Library of Congress Subject Headings) or class marks from a classification system (e.g. Dewey Decimal, Library of Congress).
Once major subjects have been identified the object can be indexed using subject headings from a list or thesaurus (e.g. Library of Congress Subject Headings) or classified using class marks from a hierarchical classification system (e.g. Dewey Decimal, Library of Congress)
Indexing Systems
Indexing is the process of assigning keywords to an item based on its subject(s). These keywords generally consist of a controlled vocabulary of selected terms. Subject headings are controlled vocabulary keywords used in catalogues while descriptors are the equivalent terms used in indexes.
Indexing (like classification) has two objectives: to identify pertinent material on a given subject and to enable a searcher to find material on related subjects. Both objectives require that the terms used match those used by users (and authors).
Subject indexing is intended to improve the recall, precision and relevance of the results of a search. (Remember that recall is the proportion of relevant responses to a query that are actually retrieved, precision is the proportion of materials retrieved that are relevant to the query and relevance is a judgement call the user makes to determine if the results match their query.) The problem in improving these three measures is one of signal transformation, the translating of the user's query into the language of the system.
There are presently two schools of thought on how to achieve this improvement. One school states that full text search is the answer since scholars can search in their own language and find other scholars using the same terms; however, this neglects anyone working on the same problem who is using different terminology. Another school of thought, and the one that currently predominates in librarianship, is that items should be indexed using unambiguous vocabularies and that the user can then search to find the chosen search terms for the subjects of interest. These unambiguous vocabularies are called controlled vocabularies. They consist of subject languages (indexing languages, indexing systems, thesauri, ontologies, classifications, etc.) in which conscious decision-making has played a role about the terms used, the terms not used, and the relationships among all terms. One problem with these systems is that they often provide too little help in locating the controlled vocabulary term related to a users search terms.
Shera and Egan's Eight Principles
These eight principles appeared in 1956 in a publication called The Classified Catalog (Chicago: American Library Assn.) by Jesse Shera and Margaret Egan. These eight principles build on three of the objectives of a library catalogue (identifying, collocating and evaluating).
The principles say that a subject indexing system should:
Provide access by subject to all relevant material.
Provide subject access to materials through all suitable principles of subject organization, e.g., matter, process, applications, etc.
Bring together references to materials [that] treat of substantially the same subject regardless of disparities in terminology. (e.g. Trucks vs. Lorries)
Show affiliations among subject fields.
Provide entry to any subject field at any level of analysis, from the most general to the most specific.
Provide entry through any vocabulary common to any considerable group of users. (e.g. Movies, Films, Motion pictures)
Provide a formal description of the subject content of bibliographic unit.
Coextensivity implies that indexing must equal subject, no more no less
Provide means for the user to make selection from among all items in any particular category.
Principle 1 is the identifying objective of the catalogue. Principle 8 deals with the evaluating objective of the catalogue. The other principles (2-7) deal with the collocation objective and attempt to specify the various ways in which collocation could be accomplished. Principle 2 says that there should be a logical basis for collocation of materials. Principles 3 and 6 discuss methods for dealing with terminological issues, principles 4 and 5 discuss syndetic structure, entry points and navigation between related terms. For example, if you were searching for materials on cats, you should be able to start with animals and follow references until you arrive at cats, just as you could follow road signs and a map to reach a city on an unfamiliar road. This is generally accomplished using hierarchies of related terms or facets.
Principle 6 also discusses warrant or the rationale for a term choice. Literary warrant and user warrant are the most common forms. The Library of Congress Subject Headings use literary warrant, which simply means that all headings are based on material actually held by the Library of Congress. This also means that headings do not exist if the Library of Congress has no material for a subject. In essence, literary warrant means that there is a designated set of sources for the language in an indexing system. Another form of warrant is user warrant--sometimes called domain-specific ontology--which means that the controlled vocabulary is based on the usage in a user community. Principle 6 implies that we should attempt to offer user and literary warrant in a controlled vocabulary. The simplest way to do this would be to expand the vocabulary of terms that lead users to the authorised controlled vocabulary terms. Finally, principle 7 says that the indexing system should not be confusing, that is it should be possible for a user to understand why their search terms have been modified to those from the controlled vocabulary.
Controlled Vocabularies
Controlled vocabularies are specialised vocabularies which remove the ambiguity from language by selecting a preferred term for each important indexable concept and listing all other terms as non preferred or entry terminology. These entry terms would then be used to guide the user to the preferred term. The preferred term is intended to be the term most commonly used in the field to refer to that concept. Controlled vocabularies are often found in the form of a thesaurus with links between related concepts or a subject heading list with groups of related terms.
There are many challenges to the creation of controlled vocabularies stemming from issues with the ambiguities in natural languages, disagreements over the categorisation of objects and cultural differences.
Specific versus General Terms
How specific should the chosen terms be? For example, a specialised vocabulary for animal breeders would contain specific terms for specific breeds such as Shitzu, Dalmatian, etc. whereas a vocabulary for young children could simply use the word dog.
Synonyms or Quasi-synonyms
How different are similar terms? Synonyms and quasi synonyms make written material more interesting, but make indexing hard since users may wish to search under any of a number of similar terms. For example, in different locations the terms trousers, slacks, and pants all refer to the same kind of garment and are used relatively interchangeably. Another example includes terms such as raining, pouring and dripping, which all describe similar meteorological conditions.
Word Form for One-Word Terms
An important issue in indexing is consistency in the spelling and word forms of single word terms. Issues include the use of singular versus plural forms, spelling variations, and the use of ombined or hyphenated terms. For example, e-mail evolved to email, colour vs color depends on the country and the terms fish versus fishes actually have very different meanings so it is important to be consistent about use of singulars versus plurals.
Sequence and form for Multiword Terms and Phrases
Multiword terms and phrases pose similar problems to single word terms, but also include issues of which term goes first and how to show that the terms are joined together to form a single concept. In the past, many terms were inverted in order to ensure that they would all sort together with related terms. For example, the term Education, Higher would have been used to ensure that this term would sort with other subjects in Education. However, most users think of the term Higher Education and find inverted terms to be confusing (this is similar to the confusion resulting from inverting author names and requiring that they be searched in this way). Inverted terms are no longer created, older ones still exist.
Homographs and Homophones
Homographs are words that have the same spelling and different meanings while homophones sound the same but have different spellings. Examples:
minute (time or size)
mercury (planet, element, roman god, vehicle)
see, sea or plain, plane
One solution to this problem is qualifications of terms:
Mercury (Planet)
Mercury (Roman deity)
Abbreviations and Acronyms
Abbreviations and acronyms are traditionally handled by spelling them out since abbreviations and acronyms differ by language (e.g. AIDS in English versus SIDA in French and Spanish)
Popular vs Technical Terms
The choice of which term to use depends on the intended audience of the thesaurus.
Multiple Languages
Multiple languages pose issues of translation of terms, idioms, and concepts. Some concepts do not exist separately in other languages, or may require multiple terms to express.
Examples:
Canada (bilingual indexing)
UN Thesaurus (6 languages)
PreCoordination or PostCoordination
Precoordination is the combination of concepts to create a combined subject term. This allows a concept which does not have a unique term to be expressed and also allows the expression of complex concepts. These headings also often combine concepts, subconcepts, place names, time periods, and form in one controlled vocabulary term. An example of a precoordinate subject heading is Natural History -- Dictionaries -- Middle Ages. Postcoordination separates these terms into separate terms and leaves the combination of concepts to the searcher. One of the benefits of postcoordinate headings is that it requires less effort on the part of the cataloguer or indexer who is no longer bound to try to figure out all the possible future uses of the item and index them.
Subdivision of Terms
Terms are often subdivided to allow complex or subordinate concepts to be described. The purpose of subdivision is generally to allow searchers to narrow their search into a very specific subject area, but may also be to allow them to selected specific formats or genres of items.
Examples of subdivided terms:
separating by form or genre
Chemistry -- Dictionaries
showing treatment of only part of the larger subject
Doctors -- Training
showing special aspects of the larger subject
Medicine -- Patient Care
showing geographical or chronological limitations
Natural History -- Dictionaries -- Middle Ages
The order in which subdivided terms appear shows which part of the term is considered to be the most important part, the topic. Other portions of the term are subordinate concepts. For example, the Natural History heading would be applied to a book on the subject of natural history, which happens to be a dictionary written in the Middle Ages. The main topic is considered to be Natural History. The order of subdivided terms is guided by the cataloguer's determination of which subjects are most important, followed by rules and guidelines for the use of subordinate headings for the specific controlled vocabulary.
Faceted Classification
Ranganathan created the concept of faceted classification out of a sense of dissatisfaction with existing indexing and classification schemes which he considered to be too ambiguous. It is a truism that an item can be indexed or classified in many different ways in a precoordinate system depending rather heavily on the importance the cataloguer gives to each subject and concept.
Faceted classification systems are entirely postcoordinate and concepts must be combined by the searcher. This means that they can be combined in any order based on how important each concept is to the searcher.
Each facet deals with particular attributes of the item. For example, there would be subject facets, form facets (book, audio cd, video, dvd), time facets and place facets. Facets can be described as aspects of a subject or summary.
For example, a book on the design of columns in colonial times in the US would have multiple facets:
form facet: book
subject facet: design (maybe architecture)
subject facet: columns
time facet: colonial times
place facet: US
Facets allow subjects to be complex and complete without requiring the creation of precoordinate subject headings containing form, subject and discipline. Facets fit better into the electronic world since searchers tend to think of facets rather than precoordinate subject headings.
Principles for the Creation and Use of Controlled Vocabularies
Creating Controlled Vocabularies
Perhaps the most important decision when creating a controlled vocabulary is the level of subject analysis which will be used with it as well as the specificity of the language itself. How specific will the terminology be? For example, will the vocabulary include specific breeds or only the term dog?
In a similar vein, the concept of direct entry states that a concept should be entered under the term that names it rather than as a subdivision of a broader term.
The use of literary versus user warrant is another consideration. It is preferable that terminology in a controlled vocabulary follow the usage of the expect user group. Should terminology be added if it is present in the literature or should it be added as it develops.
Using Controlled Vocabularies
Concepts should be assigned the most specific term possible from the thesaurus since a properly constructed thesaurus will allow the user to move from this narrower term to a broader term. Narrowing the focus of a search without having to skim all available abstracts is not possible if terms that are too general are used. On the other hand, absent concepts are generally described with a more general concept until the new concept has gained enough currency to appear as a subject heading so this may not be avoidable for cutting edge research.
In principle there should be no arbitrary limit on the number of terms assigned to a document. The principle should be "as many as possible to fully describe the item." In practice, there are often limits assigned or suggested by database vendors or descriptive cataloguing schemes.
Names in controlled vocabularies are generally controlled by authority files.
There are three main types of controlled vocabularies: Subject Heading Lists, Thesauri, and Ontologies. The major differences between the three are that thesauri generally have single terms in hierarchical relationships while subject heading lists have precoordinate terms. Thesauri also tend to have narrower scope than subject heading lists. Finally, ontologies do not have preferred terms.
Subject Heading Lists
Subject heading lists are alphabetical lists of controlled vocabulary headings for use in subject indexing. Related terms are linked by references between the controlled vocabulary terms (preferred terms) and entry vocabulary (non preferred terms). Catalogues can then bring together all items using the same heading while directing users (through the entry vocabulary) to the preferred terms.
Headings are chosen to reflect common usage within a subject community based on usage in printed materials. For a scientific audience, the terms should be scientific, for a general audience the terms should be general. Subject headings are intended to be the word or words the user would be most likely to choose as a search term.
Subject headings are artificial languages designed to remove ambiguities. For example, allowing the user to distinguish between London (UK), London (Ontario), and London (Kentucky). To accomplish this subject headings use a controlled vocabulary and a list of entry vocabulary. Controlled vocabulary terms are unambiguous terms for a concept which are often referred to as preferred terms. Entry vocabulary consists of a list of non preferred terms which are synonyms or quasi-synonyms pointing to the controlled vocabulary.
There are two general types of subject headings. Precoordinate headings that consiste of subjects, forms and types joined together into one multi word subject heading (e.g. Studies – College and University Students – United States) and postcoordinate headings in which subject, form and type are all assigned as separate subjects.
Universal controlled vocabulary lists attempt to cover all of knowledge and are heavily used in libraries as a compromise covering all the subjects in a library. Examples of this type of subject heading list are the Library of Congress Subject Headings (LCSH) and the Sears Subject Headings. Specialised controlled vocabularies are used for subject or domain specific systems in which general or universal headings would be considered too general or even inaccurate. Examples of this type of headings are Medical Subject Headings in Pubmed (http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh) and Book Industry Subject Headings (http://bisg.org/standards/bisac_subject/).
LCSH – Library of Congress Subject Headings
LCSH is the Library of Congress' subject heading list and is based on the subjects in the Library of Congress. It is heavily used in academic libraries and available through Cataloguer's Desktop or LC Classweb. It is probably one of the most widely used subject heading lists, with local variations created in other countries (such as Canadian Subject Headings).
In 1898, the Library of Congress first created a unified library-wide dictionary catalogue to replace various division specific catalogues. The introduction of this dictionary catalogue patterned on Cutter's rules meant the LC had to begin collocating their material by subject in the catalogue as well. The initial version of LCSH was patterend on the ALA List of Subject Headings for Use in Dictionary Catalogs (1885) which had been developed for use in small and medium-sized public libraries. The list was then expanded, at first by subject bibliographers, and later (and still) by subject catalogers.
The LCSH list is based on literary warrant, which means the list covers the topics covered in LC materials (and only those topics). Concepts in LCSH are therefore more likely to be represented by scientific, scholarly, or technical terms than colloquial, and are less likely to change over time.
LCSH has the structure of a thesaurus but is not really a thesaurus. Terms in a thesaurus tend to cover only one concept, while subject headings may cover highly complex subjects with multiple embedded concepts. For example, fire engine is a term while Fire engines in literature could be an LCSH subject heading. Nevertheless LCSH headings appear with UF, BT, NT, and RT relationships as well as SA for see also. The entire subject heading list is printed in a multi-volume set with frequent new editions (5 big red books), and every heading is also contained in a MARC authority record online.
LCSH headings may be subdivided. Specified subdivisions are printed after dashes following the heading in the list (see Hair--Dyeing and bleaching in the Taylor example). These may be used only as printed in the list; separate authority records represent headings with specified subdivisions. Free floating subdivisions exist as well, for form and genres (e.g. Bibliography) which may be used with any heading as appropriate. These subdivisions are listed in the Subject Cataloging Manual in Cataloguer's Desktop. Geographic subdivisions exist and may be applied only as prescribed in LCSH. The term "May Subd Geog" will appear in the list if geographic subdivisions can be applied.
Pattern headings are printed in the Subject Cataloging Manual for entities such as universities, governments, monarchs, industries, and so forth--these are just convenient gatherings of subdivisions that are of limited applicability, and their usage is restricted. Scope notes appear under headings and subdivisions throughout to give catalogers guidance in applying headings.
There were three concepts in each summarisation: Discipline, Topic, and Form. Subject headings are assigned to represent topics in resources and subdivisions are used to represent forms. Discipline is not included in LCSH, except by the use of Library of Congress Classification (LCC) symbols. From the example in Taylor, you can see that Hair has three disciplinary contexts--Physical anthropology, comparative anatomy, and human anatomy (this is an issue in subject classification, and is one of the reasons that Dewey included a relative index to allow users to locate all the possible locations in which a topic could be located).
Items are generally indexed at the summary level (i.e. one summarisation is developed to describe the subject of the whole item). Sometimes, though, items with special components may include slightly more in depth indexing covers parts of the item (e.g. a book of essays).
Care is taken, however, not to exceed the content of the volume -- if the book is about Siamese cats then the subject heading Cats would be considered too broad (unless there is no more specific heading in the list). This is the principle of coextensivity from Shera and Egan's list.
Sears Subject Headings
The Sears List of Subject Headings was created in the 1920s by Minnie Sears and intended for use in public libraries. Sears uses common terminology rather than scientific or specialist terminology. The first edition was based on the subject heading used in a set of public libraries with reputations for reliable cataloguing.
Sears placed these headings in line with LCSH format, which allowed the introduction of LC headings when necessary, and also the ability to switch to LCSH should the collection grow too large for a limited list. Like many subject heading lists, Sears has a quasi thesaural structure, meaning that it shows connections like narrower, broader and related terms. Sears does not allow the application of geographic subdivisions unless they are explicitly listed. Sears is intended to be applied to represent the topics of resources, so discipline is represented by references to the Dewey Decimal Classification.
An excerpt from Sears appears on page 278 of Taylor's book. Take a moment to compare the Sears heading for Hair to that in LCSH. You will see that the terminology is colloquial rather than technical and that the structure of the subject headings list is simpler.
MeSH - Medical Subject Headings
Pubmed is a service of the National Library of Medicine that allows users to search over 17 million citations from MEDLINE and various life science journals, some with coverage back to the 1950s. Many articles are available in full text while others may be available through the university library system.
Basic Pubmed searching is accomplished via a standard search box which takes user selected terms. Articles on Pubmed are indexed using Medical Subject Headings (MeSH) which is a subject heading list of controlled vocabulary terms related to medicine. Unlike most such lists, MeSH contains an extensive entry vocabulary of non medical terminology allowing users with little medical background to perform successful searches.
MeSH also has the appearance of a thesaurus and is often referred to as such. This subject heading list can be examined online through Pubmed.
http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh
Canadian Subject Headings (CSH)
Canadian Subject Headings are maintained by the Library and Archives Canada to supplement the Library of Congress Subject Headings where these headings provide insufficient coverage of Canadian topics or where headings are inappropriate in a Canadian context. CSH also provides equivalent French terms for indexing.
http://www.collectionscanada.gc.ca/csh/index-e.html
http://www.collectionscanada.gc.ca/rvm/index-e.html
Thesaurii
Thesaurii are collections of controlled vocabulary similar to subject heading lists, but with added syndetic structural links between broader and narrower terms as well as related terms.
Syndetic structure1 is a structure that contains connections between items. On a map these would be connections (roads, train tracks, etc) between cities. In a thesaurus these are connections between concepts (e.g. Connection between two synonyms - slacks and trousers). Many people find maps easier to use for navigation than a list of instructions (although there are people who find the opposite). Likewise, a syndetic structure like a thesaurus can make it easier to navigate related subjects than an alphabetic list of all subjects since it has added structure linking narrower terms to broader terms and related terms.

Figure
1: Syndetic Structure of Banking Terms
Now, you can see the links between terms which are broader, like Banks, and terms which are narrower like Loans or Mortgage. To reach a concept like Mortgage Loans, you must first move from Banks to Loans to Mortgage. This is a hierarchy (also referred to as a tree since it has one root and many branches).
Banks is the topmost node of the hierarchy. The terms on the second tier are narrower terms than Banks. (The ellipsis indicates missing terms.) Each succeeding point in the hierarchy has relationships mapped as well. From Loans we see Banks as a Broader term. Loans also has three narrower terms. Terms that are on the same tier are said to be related terms. We can represent these relationships as BT for Broader Terms, NT for Narrower Terms, and RT for Related Terms. In this way we can display the entire map by relating each term alphabetically together with all of its relationships:

Figure
2: Simple Banking Thesaurus
This is a thesaurus, giving terms and their syndetic structure. Two additional thesaural relationships that are not pictured here are "Used For" or "UF" which refers to entry vocabulary or non preferred terms and "USE" which refers to the preferred term or controlled vocabulary terms (in this picture these are shown in bold). Both pictures give the same information, but in different formats. Just like AACR2 and MARC show the same information, but in widely different formats.
Systems like this are also often described as domain-specific ontologies. That just means all of the terms we need on this one topic. The ontology is the underlying structure for all knowledge organization.
Sample Thesaurus Entry from ERIC (Excerpt)
Online Courses
Scope Note: Classes conducted remotely via computer systems, usually on the Internet
Category: Curriculum Organization
Broader Terms: Computer Uses in Education; Courses;
Narrower Terms: n/a
Related Terms: Computer Mediated Communication; Distance Education; Extension Education; Independent Study; Internet; Nontraditional Education; (etc.)
AAT - Art & Architecture Thesaurus
The Art and Architecture Thesaurus or AAT was created in response to the need for a unified vocabulary for objects of art that could be used not only for indexing library catalogs, but also for representing cultural heritage resources found in archives and museums and on the Web.
http://www.getty.edu/research/tools/vocabulary/aat/about.html
AAT is a faceted system. Existing vocabularies were merged to create the basic list of terms from with AAT was created, but the vocabularies in the list at present are entirely domain-specific for the art and architecture world. AAT is a thesaurus, in the form of a hierarchical database. The facets themselves incorporate specific hierarchies.
Here is the base structure:
ASSOCIATED CONCEPTS FACET (Hierarchy: Associated Concepts)
PHYSICAL ATTRIBUTES FACET (Hierarchies: Attributes and Properties, Conditions and Effects, Design Elements, Color)
STYLES AND PERIODS FACET (Hierarchy: Styles and Periods)
AGENTS FACET (Hierarchies: People, Organizations)
ACTIVITIES FACET (Hierarchies: Disciplines, Functions, Events, Physical and Mental Activities, Processes and Techniques)
MATERIALS FACET (Hierarchy: Materials)
OBJECTS FACET (Hierarchies: Object Groupings and Systems, Object Genres, Components)(Built Environment: Settlements and Landscapes, Built Complexes and Districts, Single Built Works, Open Spaces and Site Elements)(Furnishings and Equipment: Furnishings, Costume, Tools and Equipment, Weapons and Ammunition, Measuring Devices, Containers, Sound Devices, Recreational Artifacts, Transportation Vehicles)(Visual and Verbal Communication: Visual Works, Exchange Media, Information Forms)
Obviously the Objects Facet is the equivalent of a main list of artistic concepts, representing matter, or kinds of things, to which all of the other facets may be applied.
ERIC Thesaurus
ERIC is the Education Resources Information Center, an online digital library of materials related to the field of education. ERIC is sponsored by the Institute of Education Sciences (IES) of the U.S. Department of Education. ERIC provides access to more than 1.2 million bibliographic records of journal articles and other materials related to education. If available, ERIC also includes links to the full text.
ERIC can be searched using natural language or user selected terms. Basic search can be used to search for keywords, title, author or thesaurus terms. Eric has its own thesaurus of education related terms, which are used to index the contents of the ERIC database. These terms can also be used to search for items.
Multilingual Thesauri
These thesaurii contain multiple languages and are much harder to create and maintain due to disagreements over translations and the importance of various concepts. Political and cultural issues may also cause dissatisfaction with multilingual thesauri.
EUROVOC
UN Thesaurus
Ontologies
Ontologies consist of hierarchical relationships between terms similar to thesaurii. The term comes from computer science where ontologies were developed to attempt to formalise abstract concepts for artificial intelligence. It is also used to refer to structures created on the web for linking metadata information. Ontologies are heavily used in semantic web applications. The semantic web is a more highly structured version of the web intended to allow intelligent robots to merge information from diverse sources.
Ontologies have similar structures to thesauri but use different terms for the relationships.
synonym (used for)
coordinate terms (related term)
hypernym (like broader term)
hyponym (like narrower term)
holonym (similar to broader term)
meronym (similar to narrower term)
antonym (not used in traditional thesauri)
Natural language processing (NLP) involves the use of a computer for term or word recognition. This may involved spoken or written words. It is useful for full text analysis and searching, but its use is complicated by the actual use of language (e.g. incomplete sentences in speech, inaccurate sentences, verbing of words, confusion of context, lack of context, etc). Ontologies in the semantic web rely on highly structured data (similar to the structures in Dublin Core) to improve computer term recognition by eliminating some of the ambiguities in language.
Social Tagging
Vocabulary control satisfies some but not all of the principles put forward by Shera and Egan. Perhaps the most important principle that is not satisfied is that of using the terms that users use, in other words natural language. While vocabulary control can satisfy many of Shera and Egan's principles in terms of organising knowledge to bring references together, show affiliations and provide entry at varying level of specificity, this works only when the thesaurus or subject heading list has been carefully structured.
Natural language terms, on the other hand, are more accessible to users even if they are ambiguous and create problems in separating quasi synonyms, homophones and homographs. This leaves the task of separating such terms and items to the user. Perhaps the major difference is that controlled vocabularies require more initial effort to use, though this effort may prove worthwhile when irrelevant materials are pruned away without the need to examine them all first.
One interesting new development in information organisation is social tagging, also sometimes referred to as social bookmarking. Social tagging is the process of applying natural language terms to an item. Generally social tagging exists as part of a system which allows users to bookmark or store interesting items such as web sites, videos or pictures. (See the following youtube video on social bookmarking to see the process described visually: http://www.youtube.com/watch?v=x66lV7GOcNU)
Social tagging can be described as the act of associating a term with a link or article. It is an example of labelling or classifying for personal use. The main difference is that items which have been tagged can be viewed by other users and you can see their bookmarked items and tags as well. This public sharing of bookmarks public sharing of links creates a network of related links and related tags, all selected by users who had some interest in the item they are bookmarking.
When social tagging sites collate the words used by various users for the same item they create lists of related terms which are related in ways which are not the same as the related terms of a thesaurus. The relationship is more that of the relationship between subject headings assigned to the same item in a library catalogue. This collection of terms forms what is called a folksonomy, a hierarchy of relationships between natural language terms.
Here is a list of related terms from del.icio.us assigned to an academic article on tagging (Kipp, Margaret E. I. and Campbell, D. Grant (2006) Patterns and Inconsistencies in Collaborative Tagging Systems. In Proceedings Annual General Meeting of the American Society for Information Science and Technology, Austin, Texas (US). http://eprints.rclis.org/archive/00008315/). The full list of terms can be found at http://del.icio.us/url/94ac1c8406e5346260229fd266dc68f3.
33 tagging
18 folksonomy
15 del.icio.us
13 research
12 web2.0
6 delicious
6 folksonomies
5 classification
4 analysis
4 article
4 library
4 socialbookmarking
4 tags
4 taxonomy
In this list, terms like tagging and classification are linked by their relationship to this paper. The fact that the term tagging appears first simply shows the bias towards newer terms in the system. It is also worth noting that the term taxonomy, which refers to a hierarchy, is also included.
One of the biggest problems with social tagging is that there is no vocabulary control at all. Spelling variations, plurals versus singular forms and multiple quasi synonyms in the same list are common. In addition, early systems made distinctions between terms which started with a capital letter and terms which did not (Fish is different from fish). While this might seem to be a horrible situation leading directly to chaos, users themselves may alleviate some of these problems. Research has shown that the more users who tag an item, the more likely that term variations will be fully explored and that a consensus will develop on what the item is actually about. The consensus can be seen by examining the first few terms in the list (the most popular).
Sample Social Tagging Sites
See Wikipedia's list of social bookmarking, social cataloguing or social citation sites as well:
http://en.wikipedia.org/wiki/List_of_social_software#Social_bookmarking
Tag Clouds
Tag clouds show popular terms in a social tagging system. Terms that are larger have been used the most often. In some cases, lightness or darkness of text may indicate currency, terms which are darker have been used more recently.
Del.icio.us Tag Cloud http://del.icio.us/tag/
Amazon's Tag Cloud http://www.amazon.com/gp/tagging/cloud
PennTags from the University of Pennsylvania http://tags.library.upenn.edu/
By examining a tag cloud you can get an idea of what subjects are covered on a system or in a particular user's collection.
Tag clouds can be used to display subject headings as well. The University of Flinders in Australia provides a tag cloud for their Library of Congress Subject Headings. Though this may or may not be an asset when searching, it does provide an interesting picture of the extent of the collection in terms of subjects and size of specific subject collections. This is a picture you would be unable to collect yourself unless you worked in the library for a long time. http://www.lib.flinders.edu.au/resources/voyager/cloud.html
1Syndetic: greek root term meaning connected