| Lesson 2 | Description of Markup Languages |
| Objective | Describe markup languages. |
To understand what XML is and how it works, you must first understand what a markup language is. Markup languages are designed to tell machines — particularly computers — how to process data. The term "markup" derives from early print publishing, where editors would physically mark up manuscripts by hand to indicate to the printer which font size to use, in which weight, and with what form of alignment. The earliest markup languages were dedicated entirely to passing these kinds of formatting instructions.
Markup instructions are generally referred to as tags, and the process of marking up a document is sometimes called tagging. Early word-processing programs required users to perform manual tagging — inserting formatting codes directly into the text. Today, most tagging happens behind the scenes and uses a proprietary system specific to each application.
The divergence of proprietary tagging systems created a significant problem: it became difficult for people and organizations to exchange documents with each other. A document tagged using one system could not be reliably read by an application that expected a different system. With the advent of the Internet, this problem intensified. The ability for authors to interchange documents in a format that was easy to use, yet powerful and aesthetically acceptable, became both more valuable and more urgent.
Markup specifically designed to affect the appearance of a document is called procedural markup. Procedural markup instructs the computer — the browser, in the case of HTML — how to render or display the text. It is concerned with presentation: font weight, size, color, alignment, and similar visual properties.
A straightforward example of procedural markup is the HTML <b> tag.
When a word is wrapped in <b>Summary</b>, the browser receives
an instruction: display this word in bold. The tag says nothing about what "Summary"
means in the context of the document — only how it should look.
Procedural markup works well for simple document formatting, but it breaks down when organizations need to process, search, or exchange large volumes of documents. A government agency managing thousands of policy documents, for example, cares far more about what each section of a document represents than about its font weight.
As a result of the limitations of procedural markup, a different approach emerged: markup that describes the content of a document rather than its appearance. This type of markup is called descriptive, or logical, markup.
Logical markup answers the question what rather than how. Instead of instructing the browser to display text in bold, a logical markup tag identifies the text as a heading, a title, a summary, or an author name. The visual rendering of that element is determined separately — by a stylesheet, a rendering engine, or an application that consumes the document.
Using the same example: <h4>Summary</h4> marks the word
"Summary" as a fourth-level subheading. It makes a structural and semantic claim about
the content. A different stylesheet could render that heading in any font, size, or color
— or not render it visually at all, if the document is being processed by software rather
than displayed to a human reader.
A markup language is a system for annotating a document in a way that is syntactically distinguishable from the text itself. The terminology evolved from the editorial practice of marking up paper manuscripts — revision instructions traditionally written in blue pencil on an author's draft before it went to the printer.
In digital media, blue-pencil instructions were replaced by tags. Instructions are expressed either directly by tags or as instruction text encapsulated within tags. Historical examples include typesetting systems such as troff, TeX, and LaTeX. Modern examples include the structural markers used in XML and HTML. In every case, the markup instructs the software that displays or processes the text to carry out appropriate actions — and the markup itself is omitted from the version of the text that end users see.
Some markup languages, such as HTML, have pre-defined presentation semantics. Their specification prescribes exactly how structured data should be presented. HyperText Markup Language (HTML), one of the document formats of the World Wide Web, is an instance of SGML (Standard Generalized Markup Language) and follows many of the markup conventions established in the publishing industry for communication between authors, editors, and printers.
XML documents use markup to describe the semantics of the data they contain.
Unlike HTML, XML has no predefined tags. Instead, a set of tags is created specifically
for each type of document. A document representing a library catalog uses tags like
<book>, <author>, and <isbn>.
A document representing a purchase order uses tags like <order>,
<item>, and <quantity>.
Because XML provides the rules for defining tags rather than defining the tags themselves, it operates at a higher level than a conventional markup language. In that sense, XML is a metalanguage — a language used to define other languages. The metalanguage aspect of XML is explored in detail in Lesson 3.
To appreciate why markup languages matter, it helps to understand what came before them — and what still exists alongside them. Binary files represent a fundamentally different approach to storing data.
At its simplest, a binary file is a stream of bits — 1s and 0s. It is entirely up to the
application that created the binary file to interpret what those bits mean. Binary files
can only be read and produced by programs specifically written to understand their
internal format. A document saved in Microsoft Word prior to 2003, for example, uses a
binary format with a .doc extension. Opening that file in a plain text
editor such as Notepad reveals garbled output — occasional lines of readable text
surrounded by binary code. The document's prose is present, but unreadable without the
application that knows how to decode the format.
In a binary file, data and metadata are mixed together. Metadata specifies which words should appear in bold, what text belongs in a table, what font is applied to a heading. That metadata is encoded in the binary structure of the file. Interpreting it requires either the originating application or a converter with deep knowledge of the underlying binary format. Without that knowledge, a document created in Word cannot be reliably opened in WordPerfect, and vice versa.
Binary formats do have a genuine advantage: they are concise. Binary encoding can represent the same information in a much smaller file than a text-based format. This means more files can be stored in a given space and, importantly, less bandwidth is consumed when transferring files across a network. In the early days of computing, when storage and bandwidth were scarce, this advantage was decisive.
As networks grew and organizations began exchanging documents across organizational and platform boundaries, the limitations of binary formats became more costly than their size advantages. A file that only one application can read is not portable. A format that mixes data and metadata in an undocumented binary structure is not interoperable. These limitations created the conditions that made markup languages — and ultimately XML — necessary.
Markup languages evolved from the physical practice of marking up paper manuscripts with editorial instructions. In digital systems, markup instructions take the form of tags that annotate a document in a way that is distinguishable from the document's text.
Two types of markup emerged to serve different needs. Procedural markup instructs the display software how to render text — it is concerned with appearance. Logical markup describes what content is — it is concerned with meaning and structure. Organizations processing large volumes of documents found logical markup far more useful, because knowing what data represents matters more than knowing how it looks.
XML extends the concept of logical markup into a metalanguage — a language for defining markup languages. Unlike HTML, which provides a fixed set of tags with predefined meanings, XML allows developers to define their own tags appropriate to each document type. This flexibility, combined with XML's human-readable text format, made it the foundational technology for data interchange in distributed systems. The next lesson explores the metalanguage concept in greater depth.