GOLD is an ontology for descriptive linguistics. It gives a formalized account of the most basic categories and relations (the "atoms") used in the scientific description of human language. GOLD is intended to capture the knowledge of a well-trained linguist, and can thus be viewed as an attempt to codify the general knowledge of the field. It will facilite automated reasoning over linguistic data and help establish the basic concepts through which intelligent search can be carried out. Furthermore, GOLD is meant to be compatible with the general goals of the Semantic Web.
The GOLD Community has a vision to bring together those interested in the best-practice encoding of linguistic data. The aims of the GOLD Community may be summarized as follows:
- To promote best practice as suggested by the E-MELD project.
- To encourage data interoperability through the use of ontologies.
- To encourage the broader use of software.
- To facilitate search across disparate data sets.
- To create a forum for data providers and consumers.
More than just a taxonomy of linguistic terms, GOLD is founded on principles of ontological engineering: rich axiomatization of classes and relations is provided, for example. In the beginning GOLD was constructed from the top-down using SIL International's on-line glossary of linguistic terms and standard linguistics sources, for example: David Crystal's Cambridge Encyclopedia of Language. To supplement the original development, a new methodology for concept acquisition is being developed by Will Lewis and Scott Farrar, whereby GOLD can be constructed on an empirical basis (see Data-Driven Linguistic Ontology project). GOLD has been mapped to the Suggested Upper Merged Ontology (SUMO). We are now also implementing the GOLD Community of Practice as a building block for a cyberinfrastructure for linguistics.
The GOLD Community of Practice has as its basis the General Ontology for Linguistic Description (GOLD). First envisioned by Scott Farrar and reported in his 2003 dissertation, GOLD was originally planned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data from endangered languages (Scott Farrar and D. Terence Langendoen (2003) "A linguistic ontology for the Semantic Web." GLOT International. 7 (3), pp.97-100). Originally EMELD intended to build a single termset that all could use, but this goal was soon seen to be unattainable: there was too much diversity in terms between linguists and sub-communities in linguistics, and considerable reluctance to change them. An ontology, through which these diverse termsets could be linked, thus made the most sense.
Will Lewis created the first version of the ontology by transforming and organizing information obtained from SIL International's on-line glossary of linguistic terms. A team of research assistants at the University of Arizona then searched the linguistics literature and doubled the size of the term set. Much of GOLD's intellectual content can be attributed to Terry Langendoen who also coined the acronym. Major development has been done by Gary Simons, Anthony Aristar, Brian Fitzsimons, and many others involved with the E-MELD project. Most recently in 2005, E-MELD sponsored a workshop to allow critique of the content and structure of GOLD. The implementers are now revising GOLD based on the suggestions from the E-MELD 2005 workshop participants. In this they are being aided by consultation with The Surrey Morphology Group, which is participating in a series of exchange visits designed to improve the content coverage of GOLD.
The idea of a GOLD Community came together in November 2004 during the Fresno meeting, organized by Will Lewis in order to promote the GOLD effort to the larger community. The notion of community-based ontology building was also discussed at the Digital Tools Summit in Linguistics held in Lansing, Michigan in conjunction with the E-MELD 2006 workshop.
GOLD owes its initial support to the NSF-sponsored E-MELD grant. Then, in a separate effort, the Data-Driven Linguistic Ontology (DDLO) grant was awarded by the NSF in 2004 to build out GOLD, while in April 2006, a supplement to E-MELD was awarded by NSF. The supplement funded the exchange visits with the Surrey Morphology Group. Development is currently funded by the NSF via the GOLD Community of Practice project.
The General Ontology of Linguistic Description (GOLD) (http://linguistics-ontology.org/) is licensed under a Creative Commons Attribution 3.0 Unported License <http://creativecommons.org/licenses/by/3.0/> and can be used freely as long as it is properly cited.
How to cite
2010. General Ontology for Linguistic Description (GOLD). Ypsilanti, MI: Institute for Language Information and Technology (LINGUIST List), Eastern Michigan University. http://linguistics-ontology.org/.
The GOLD Community is founded on best-practice data resources. Since data may come from a variety of disparate sources, be about different languages, and be described from different theoretical perspectives, it is necessary to map these data onto a common semantic resource -- GOLD. Mapping from data to knowledge is not a simple transformation, however. The various terminologies used in the best practice resources first need to be rendered transparent and compatible with one another. Thus, best practice data resources are mapped to a set of descriptive profile resources, shown in Figure 1. These resources in turn allow for the transition from XML (data) to RDF/OWL (knowledge), described below.
However, since GOLD is relatively new, the Community has to rely also on the large number of legacy data resources already available on the Web, ie. those not in best practice. One of the goals of the Community, then, is to promote services which will transform legacy resources to best-practice. In the actual implementation, legacy resources will be mapped to a set of legacy mapping resources. This is shown in Figure 1 below:
Figure 1: From legacy to best practice.
Best practice resources
Best-practice data resources minimally use Unicode and are in a consistent XML format with an accompanying XML Schema or DTD. In addition, it is recommended that such resources utilize one of the many formats suggested by the E-MELD School of Best Practice.
A profile minimally consists of a mapping of terms used in the data source document to concepts in the ontology. We refer to this as a terminology mapping. A terminology mapping document is a simple set of terms, a termset, linked to concepts in the ontology. A terminology mapping has these minimal requirements:
- Each term is represented only once.
- Each term is defined by its relationship to one or more concepts.
- The terminology mapping document must uniquely identify the resource or resources that contain the concepts that each term references. (Valid resource types include ontologies and profiles.)
Beyond terminology mappings, a profile may include a grammatical sketch of the language in question, e.g., an enumeration of the possible features in the grammatical system.
Legacy resources are given in unstructured formats such as HTML and text documents and proprietary formats, e.g., Microsoft Word, which cannot be read in the absence of special software that may not always be supported.
Legacy mapping resources
Legacy mapping resources map legacy materials to descriptive profiles. This methodology provides a short-cut when the entire legacy resource cannot be converted to a best-practice format. Instead, the most important aspects of the resource, e.g., a partial grammatical description, are captured.
Once a framework for best-practice data is in place, the real advantages of the GOLD Communty emerge, as data can be transformed into knowledge. The following figure shows GOLD in relation to various other knowledge components: at the right is an OWL version of an upper ontology (e.g., SUMO or DOLCE); at the middle level is GOLD itself which is actually a network of separate OWL files extending the upper ontology via the subclass relation; towards the left are various Community of Practice Extension (COPE) resources which extend GOLD via the subclass relation into the various theory- or language-specific subdomains; and finally, on the extreme left is the RDF store consisting of instantiated best-practice resources.
Figure 2: The Knowledge Components of the GOLD Community.
Community of practice extensions
COPEs are essentially sub-ontologies that extend GOLD. COPEs provide two main benefits for the GOLD Community. First, they provide the means to create 'communities of practice', the community of consensus formed around specific terminologies and services. With COPEs, communities have the ability to maintain their language-specific, theory-specific, or resource-specific knowledge in discrete, manageable packets. Second, COPEs give individual communities the means to relate their work with one another by virtue of the fact that COPEs extend a single semantic resource, GOLD. A specific community, e.g., linguists concerned with Bantu languages, could create a COPE ensuring that their terminology is interoperable with a completely different community, such as American Indianists or even a community centered around the use of WordNet.
Instances: the RDF store
The RDF store consists of instantiated classes from COPEs and GOLD. The instances correspond directly to the data and annotation as expressed in best-practice XML. To be maximally useful, the RDF store can be loaded into an RDF framework (e.g., Sesame) for fast knowledge retrieval.