XML Programming  «Prev  Next»

Lesson 3 Define Metalanguages
Objective Describe what a metalanguage is.

XML Metalanguages: SGML, XML, and HTML

Generally speaking, a metalanguage is a language used to describe a language. XML is a metalanguage used to describe markup languages. Rather than defining a fixed set of tags for a specific purpose, XML provides the rules and syntax for defining whatever tags a particular document type requires.

XML is a descendant of a more extensive metalanguage called Standard Generalized Markup Language, or SGML. The XML specification explicitly states that XML is a subset of SGML and that every valid XML document should also be a conforming SGML document. The following series of diagrams shows the relationship between SGML, XML, and HTML.

Diagram showing SGML as a superset metalanguage with XML as a subset and HTML as an application, connected by red arrows.
SGML is the superset metalanguage. XML is a subset of SGML. HTML is an application of SGML, connected by red arrows showing the derivation relationship.
Comparison diagram showing SGML as a complex metalanguage and XML as an easier metalanguage, illustrated by relative block sizes.
SGML is a powerful but complex metalanguage. XML is a simplified subset designed to be easier to learn and implement, illustrated by the relative sizes of the blocks.
Diagram showing XML contained within SGML as metalanguages, with red arrows pointing to HTML as an application derived from both.
XML sits within the SGML family as a contained subset. Both SGML and XML point to HTML as an application derived from the SGML metalanguage.


SGML, XML, and HTML: Three Distinct Roles

The diagrams above illustrate three distinct roles in the metalanguage family:

  1. SGML is a superset and a complex metalanguage. It defines the rules for creating markup languages and has been used as the foundation for both XML and HTML. SGML is powerful but its complexity makes it difficult to implement and work with directly.
  2. XML is a subset of SGML and a metalanguage in its own right. It inherits SGML's core principles while eliminating much of the complexity. XML can be used to define other markup languages, making it a practical and widely adopted tool for data interchange and document definition.
  3. HTML is a markup language defined using SGML. It provides a fixed, predefined set of tags that specify how documents are rendered in a browser. Because HTML's tag set is fixed and its purpose is presentation rather than language definition, HTML is not a metalanguage.

History of SGML

In the late 1960s, as computers began to be used more widely, a group called the Graphic Communications Association (GCA) created a layout language called GenCode. GenCode was designed to provide a standard language for specifying formatting information so that printed documents would look the same regardless of the hardware used to produce them. GenCode was primarily a procedural markup language - it focused on how documents should look, not what they meant.

In 1969, Charles Goldfarb led a group at IBM who built on the GenCode concept and created the General Markup Language (GML). Where GenCode was concerned mainly with appearance, GML strove to define not only the visual presentation of a document but also, to some degree, the structure of its data. This shift toward structural description was a significant step in the evolution of markup languages.

Nearly a decade after GML emerged, the American National Standards Institute (ANSI) established a working committee to build on GML and create a broader standard. Goldfarb was asked to join this effort and has since become known as the "father of SGML." The first public draft of SGML appeared in 1980. The final version of the standard was published in 1986.

Since that time SGML has been extended as needed. In 1988, a version of SGML was developed specifically for military applications (MIL-M-28001). Further additions followed over the years. As SGML grew in scope, some in the industry came to feel that it had acquired complexity beyond what most practical applications required - a condition sometimes described as complexity bloat. This perception was a primary motivation for the development of XML as a simpler, more accessible subset.

Two Postulates of Generalized Markup

The design of SGML - and by inheritance, XML - is grounded in two foundational postulates about how markup should behave:

  1. Markup should be declarative. It should describe a document's structure and attributes rather than specify the processing to be performed on it. Declarative markup is less likely to conflict with unforeseen future processing needs and techniques. A document that describes what its content is can be processed in many ways; a document that prescribes how it must be processed is brittle in the face of change.
  2. Markup should be rigorous. The same techniques used for processing formally defined objects such as programs and databases should be applicable to processing documents. Rigorous markup enables validation, automated processing, and reliable interchange between systems.

These two postulates explain why XML, as SGML's successor, emphasizes well-formed and valid documents. A document that is not well-formed cannot be reliably processed. A document that is valid against a defined schema or DTD can be trusted to conform to the rules of its markup language.


Relationship Between SGML, XML, and HTML

XML is a profile of the ISO SGML standard. Most of XML comes directly from SGML unchanged. Both HTML and XML are subsets of SGML, but they occupy different positions in the family and serve different purposes.

HTML is an application of SGML - a specific markup language defined using SGML's rules, with a fixed tag set designed for presenting documents in a browser. HTML describes how content should look. XML, by contrast, is a metalanguage - it provides the framework for defining new markup languages rather than being one itself. XML describes what content means.

This distinction clarifies a question that often arises: if HTML and XML are both subsets of SGML, why is one a markup language and the other a metalanguage? The answer lies in their purpose. HTML uses a fixed, predefined set of tags. XML defines the rules by which any set of tags can be created and used. A developer using HTML is constrained to the tags the HTML specification provides. A developer using XML can define whatever tags their document type requires.

XHTML extends this relationship further. XHTML is an application of XML that reformulates HTML using XML syntax. Because XHTML is written in XML, it must be well-formed - a stricter requirement than traditional HTML. XHTML is a subset of both XML and HTML in the sense that it combines HTML's vocabulary with XML's syntax rules.

SGML itself is a metalanguage for defining markup languages. XML is also a metalanguage, but a simpler one. HTML and XHTML are markup languages - applications of their respective metalanguages, not metalanguages themselves.

Defining Markup Languages with XML

Using XML, a developer can define a set of elements to describe any physical or logical entity - a customer record, a purchase order, a mathematical formula, a chemical compound, or any other structure of interest. Once that set of elements is defined, it can be used to describe many specific instances of that entity.

For example, a purchase order can be described using the following XML element set:


<PURCHASE-ORDER>
    <PURCHASE-ORDER-NUMBER>1005</PURCHASE-ORDER-NUMBER>
    <PURCHASE-ORDER-DATE>01/10/2001</PURCHASE-ORDER-DATE>
    <SUPPLIER-NAME>BACON INDUSTRIES</SUPPLIER-NAME>
</PURCHASE-ORDER>

XML Document Instance

Notice that XML does not use a fixed, predefined set of elements the way HTML does. Each purchase order described using this element set is an XML document instance. The same element set can describe purchase orders 1006, 1007, and beyond - only the values inside the elements change. The structure defined by the element set remains constant across all instances.

This flexibility is what makes XML a metalanguage rather than a markup language. The element set for purchase orders is itself a markup language - one that was defined using XML. The same approach applies to any domain: medical records, financial transactions, scientific data, or configuration files. Each domain can have its own XML-based markup language, defined to match the structure and semantics of its data.

Defining Rules with a DTD

Once a markup language has been defined using XML, the rules for creating valid documents in that language can be formalized using a Document Type Definition, or DTD. The DTD specifies which elements are allowed, in what order, and with what attributes. It is used to validate XML document instances - to confirm that a document conforms to the rules of its markup language.

DTDs are one of the mechanisms that make XML suitable for data interchange between organizations. When two systems agree on a DTD, they have agreed on a contract: any document that validates against the DTD is guaranteed to have the structure both systems expect. DTDs will be examined in detail later in this course. The next lesson explores the limitations of HTML and why those limitations motivated the development of XML.

Summary

A metalanguage is a language used to define other languages. SGML, developed through the 1970s and published as an ISO standard in 1986, was the first widely adopted markup metalanguage. XML emerged as a simpler, more practical subset of SGML and inherited its two foundational principles: markup should be declarative, and markup should be rigorous.

SGML, XML, and HTML occupy distinct positions in the metalanguage family. SGML is the complex superset metalanguage. XML is a simplified subset metalanguage that can be used to define new markup languages. HTML is an application of SGML - a markup language with a fixed tag set designed for browser presentation, not a metalanguage. XHTML reformulates HTML using XML syntax, combining HTML's vocabulary with XML's well-formedness rules.

XML's power as a metalanguage is demonstrated by its ability to define domain-specific element sets - such as the purchase order example in this lesson - and validate document instances against those definitions using a DTD. This combination of flexibility and rigor makes XML the foundation for data interchange across virtually every domain in modern software development.


SEMrush Software