| Lesson 3 | Define Metalanguages |
| Objective | Describe what a metalanguage is. |
Generally speaking, a metalanguage is a language used to describe a language. XML is a metalanguage used to describe markup languages. Rather than defining a fixed set of tags for a specific purpose, XML provides the rules and syntax for defining whatever tags a particular document type requires.
XML is a descendant of a more extensive metalanguage called Standard Generalized Markup Language, or SGML. The XML specification explicitly states that XML is a subset of SGML and that every valid XML document should also be a conforming SGML document. The following series of diagrams shows the relationship between SGML, XML, and HTML.
The diagrams above illustrate three distinct roles in the metalanguage family:
In the late 1960s, as computers began to be used more widely, a group called the Graphic Communications Association (GCA) created a layout language called GenCode. GenCode was designed to provide a standard language for specifying formatting information so that printed documents would look the same regardless of the hardware used to produce them. GenCode was primarily a procedural markup language - it focused on how documents should look, not what they meant.
In 1969, Charles Goldfarb led a group at IBM who built on the GenCode concept and created the General Markup Language (GML). Where GenCode was concerned mainly with appearance, GML strove to define not only the visual presentation of a document but also, to some degree, the structure of its data. This shift toward structural description was a significant step in the evolution of markup languages.
Nearly a decade after GML emerged, the American National Standards Institute (ANSI) established a working committee to build on GML and create a broader standard. Goldfarb was asked to join this effort and has since become known as the "father of SGML." The first public draft of SGML appeared in 1980. The final version of the standard was published in 1986.
Since that time SGML has been extended as needed. In 1988, a version of SGML was developed specifically for military applications (MIL-M-28001). Further additions followed over the years. As SGML grew in scope, some in the industry came to feel that it had acquired complexity beyond what most practical applications required - a condition sometimes described as complexity bloat. This perception was a primary motivation for the development of XML as a simpler, more accessible subset.
The design of SGML - and by inheritance, XML - is grounded in two foundational postulates about how markup should behave:
These two postulates explain why XML, as SGML's successor, emphasizes well-formed and valid documents. A document that is not well-formed cannot be reliably processed. A document that is valid against a defined schema or DTD can be trusted to conform to the rules of its markup language.
XML is a profile of the ISO SGML standard. Most of XML comes directly from SGML unchanged. Both HTML and XML are subsets of SGML, but they occupy different positions in the family and serve different purposes.
HTML is an application of SGML - a specific markup language defined using SGML's rules, with a fixed tag set designed for presenting documents in a browser. HTML describes how content should look. XML, by contrast, is a metalanguage - it provides the framework for defining new markup languages rather than being one itself. XML describes what content means.
This distinction clarifies a question that often arises: if HTML and XML are both subsets of SGML, why is one a markup language and the other a metalanguage? The answer lies in their purpose. HTML uses a fixed, predefined set of tags. XML defines the rules by which any set of tags can be created and used. A developer using HTML is constrained to the tags the HTML specification provides. A developer using XML can define whatever tags their document type requires.
XHTML extends this relationship further. XHTML is an application of XML that reformulates HTML using XML syntax. Because XHTML is written in XML, it must be well-formed - a stricter requirement than traditional HTML. XHTML is a subset of both XML and HTML in the sense that it combines HTML's vocabulary with XML's syntax rules.
SGML itself is a metalanguage for defining markup languages. XML is also a metalanguage, but a simpler one. HTML and XHTML are markup languages - applications of their respective metalanguages, not metalanguages themselves.
Using XML, a developer can define a set of elements to describe any physical or logical entity - a customer record, a purchase order, a mathematical formula, a chemical compound, or any other structure of interest. Once that set of elements is defined, it can be used to describe many specific instances of that entity.
For example, a purchase order can be described using the following XML element set:
<PURCHASE-ORDER>
<PURCHASE-ORDER-NUMBER>1005</PURCHASE-ORDER-NUMBER>
<PURCHASE-ORDER-DATE>01/10/2001</PURCHASE-ORDER-DATE>
<SUPPLIER-NAME>BACON INDUSTRIES</SUPPLIER-NAME>
</PURCHASE-ORDER>
Notice that XML does not use a fixed, predefined set of elements the way HTML does. Each purchase order described using this element set is an XML document instance. The same element set can describe purchase orders 1006, 1007, and beyond - only the values inside the elements change. The structure defined by the element set remains constant across all instances.
This flexibility is what makes XML a metalanguage rather than a markup language. The element set for purchase orders is itself a markup language - one that was defined using XML. The same approach applies to any domain: medical records, financial transactions, scientific data, or configuration files. Each domain can have its own XML-based markup language, defined to match the structure and semantics of its data.
Once a markup language has been defined using XML, the rules for creating valid documents in that language can be formalized using a Document Type Definition, or DTD. The DTD specifies which elements are allowed, in what order, and with what attributes. It is used to validate XML document instances - to confirm that a document conforms to the rules of its markup language.
DTDs are one of the mechanisms that make XML suitable for data interchange between organizations. When two systems agree on a DTD, they have agreed on a contract: any document that validates against the DTD is guaranteed to have the structure both systems expect. DTDs will be examined in detail later in this course. The next lesson explores the limitations of HTML and why those limitations motivated the development of XML.
A metalanguage is a language used to define other languages. SGML, developed through the 1970s and published as an ISO standard in 1986, was the first widely adopted markup metalanguage. XML emerged as a simpler, more practical subset of SGML and inherited its two foundational principles: markup should be declarative, and markup should be rigorous.
SGML, XML, and HTML occupy distinct positions in the metalanguage family. SGML is the complex superset metalanguage. XML is a simplified subset metalanguage that can be used to define new markup languages. HTML is an application of SGML - a markup language with a fixed tag set designed for browser presentation, not a metalanguage. XHTML reformulates HTML using XML syntax, combining HTML's vocabulary with XML's well-formedness rules.
XML's power as a metalanguage is demonstrated by its ability to define domain-specific element sets - such as the purchase order example in this lesson - and validate document instances against those definitions using a DTD. This combination of flexibility and rigor makes XML the foundation for data interchange across virtually every domain in modern software development.