| Lesson 5 | XML Defined |
| Objective | Define XML Intelligence. |
XML enables users to create documents that contain more specific information about content than HTML ever could, adding a level of intelligence to those documents. An intelligent document is one that describes not only how its content should be displayed, but what that content actually represents. XML was designed from the ground up to serve this purpose.
HTML began with a similar ambition. Tim Berners-Lee's original vision for the Web was built on well-structured documents readable by a universal client. Over time, however, HTML drifted from structural description toward visual presentation. Browser vendors added proprietary formatting tags. Authors used heading tags for their visual effect rather than their structural meaning. The distinction between content and presentation collapsed. Defining document structure alone proved insufficient. To fully exploit the potential of evolving web technologies, documents must define not only their structure but their actual content as well. XML was created to restore and extend that original vision.
The W3C responded to HTML's drift toward presentation with a formal proposal to separate formatting from structure. The result was Cascading Style Sheets. CSS Level 1 (CSS1) was the first version of this effort. CSS2, which supersedes and extends CSS1, became the current W3C recommendation.
CSS establishes a clear rule: all formatting should be defined externally to the document
content, either in a separate stylesheet file, in a STYLE section within
an HTML page, or as values for a STYLE attribute on individual tags. Tags
such as <CENTER> and <FONT> are deprecated by the
W3C in favor of stylesheets. Authors are strongly discouraged from using deprecated
presentational elements in new code.
The intended workflow separates responsibilities cleanly. HTML provides structural
features - paragraphs, lists, headings, semantic elements - and avoids presentational
features such as font changes and layout hints. CSS formats the document based on its
structural properties. Well-designed class attributes in the HTML extend
the semantics of the structural markup, giving CSS more precise hooks for flexible
formatting. Assistive technologies can substitute or extend the CSS to modify
presentation for accessibility purposes, or ignore the CSS entirely and interact
directly with the structural encoding of the document.
This separation of format from structure is a step toward XML's broader goal: a document whose meaning is independent of its visual presentation. CSS achieves this for HTML by moving formatting to an external stylesheet. XML takes the principle further by removing built-in presentation entirely - an XML document contains only data and structure, leaving presentation to be defined by whatever technology consumes the document.
XML is a W3C recommendation developed by the XML Working Group and the XML Special Interest Group. The recommendation defines an XML document as a text document composed of storage units called entities. Entities may contain parsed or unparsed data. XML provides a formal mechanism for imposing constraints on both the storage layout and the logical structure of a document. These constraints take two forms:
These constraints are what make XML documents reliably processable by machines. A parser reading a valid XML document can trust the document's structure completely. Both constraint types are examined in detail later in this course.
XML enhances web intelligence by enabling the creation of structured, semantically rich documents that machines can parse and interpret with precision. When a document's content is described by meaningful tags rather than visual formatting instructions, software can extract specific information, validate it against a schema, transform it into other formats, and combine it with data from other sources - all automatically.
This capability has practical consequences across multiple domains. Search engines can match queries to content with greater precision when the content is semantically tagged. Recommendation systems can identify relationships between products, people, and preferences when those relationships are expressed in structured data. Custom markup languages built on XML can represent domain-specific information in ways that are understood by both human authors and the machines that process their documents.
The Web itself has evolved into a system where human knowledge and machine processing are increasingly intertwined. XML functions behind the scenes in this system, providing machines with structured information about documents that enables automated decision making. A document that tells a machine not just what text to display but what that text represents - a film title, a drug compound, a delivery address, a financial instrument - is a document that a machine can act on intelligently.
To understand the practical scope of XML intelligence, consider the kinds of questions that semantically structured data enables a machine to answer:
Each of these questions requires a system that can locate relevant data, understand what that data represents, and combine information from multiple sources to produce a useful answer. None of these tasks are possible when data is buried in unstructured HTML formatted for visual display. All of them become tractable when data is expressed in XML documents whose elements describe what the content means rather than how it looks.
Entities or computer programs that learn from their environment and act on what they have learned are defined as intelligent agents. These agents range from simple rule-based systems - a smoke detector that triggers an alarm when it detects combustion products - to complex autonomous systems that make decisions across large datasets in real time.
Intelligent agents have applications across virtually every domain. An ambient intelligence agent monitors its environment and adjusts conditions automatically. A market analysis agent mines stock price trends and signals trading opportunities. A negotiation bot participates in online auctions on behalf of a user. A virtual purchasing agent buys products based on a user's preferences and purchase history.
The explosion of content on the web presents four specific challenges that intelligent agents are well-positioned to address:
Intelligent agents draw their data from social networks, blogs, transaction histories, and sensor feeds. Their effectiveness depends on the quality and structure of that data. XML-encoded data, with its explicit semantic tags and well-defined structure, is significantly more useful to an intelligent agent than unstructured text or visually formatted HTML. An agent that can parse a structured XML document knows exactly what each piece of data represents and can act on it without ambiguity.
Link analysis emerged as a technique for improving information retrieval in web search. By analyzing the links between pages - which pages link to which, and how those links are distributed - search engines can infer the relative authority and relevance of pages without relying solely on the text they contain. This approach proved effective at improving search quality and spawned significant interest in the mathematical techniques underlying it.
Over time, link analysis became susceptible to manipulation. Search engine optimizers discovered that creating large numbers of inbound links to a page could inflate its apparent authority. The ongoing contest between search engines seeking to preserve the integrity of their results and adversaries seeking to exploit their ranking algorithms continues to drive innovation in information retrieval.
Link analysis is one application of the broader field of social network analysis. The concept traces back to Stanley Milgram's experiment in the 1960s, which produced the observation that any two people in the world can be connected through a chain of roughly six acquaintances - the basis of the popular "six degrees of separation" idea. Social network analysis has since expanded well beyond human relationships. The study of phone call networks, email communication patterns, and eCommerce recommendation systems all apply similar mathematical techniques to different types of connected data.
eCommerce recommendation systems are a particularly visible application. By analyzing the connections between customers and the products they purchase or rate, a recommendation system can predict which products a new customer is likely to find interesting - based not on that customer's history alone, but on the collective behavior of all customers with similar profiles. This kind of inference requires structured, queryable data. XML provides a format in which those relationships can be expressed, stored, and processed reliably.
The concept of intelligent content extends XML's semantic capabilities into media beyond text. Intelligent content enables the creation and reuse of complex media by creators who need not understand the technical details of the tools they use. When content carries semantic metadata describing what it depicts, retrieval systems can find it based on meaning rather than filename or manual keyword tags.
A straightforward example illustrates the difference. A basic content retrieval system given an image of a panda can find other images tagged "panda." A semantically enabled retrieval system can go further: given the sound of a panda eating bamboo, it can find images of a panda eating bamboo - because the content's associated metadata describes not just the subject but the action and context. The richer the semantic description, the more precise and useful the retrieval.
For production companies managing large media libraries, this capability has significant practical value. Finding existing footage of a specific animal performing a specific action, or identifying scenes from past productions that match a new production's requirements, becomes a structured search problem rather than a manual review task. Intelligent content makes that search tractable.
The next lesson examines the goals that the XML creators had in mind when designing the specification.
XML intelligence refers to the capacity of XML documents to describe not just how content should be displayed, but what that content actually represents. This semantic capability distinguishes XML from HTML, which uses a fixed tag set focused on visual presentation rather than data meaning.
The separation of format from structure, formalized through CSS, was the W3C's response to HTML's drift toward presentation. XML extends this principle by removing built-in presentation entirely. As a W3C recommendation, XML defines formal constraints - well-formedness and validity - that make documents reliably processable by machines.
The applications of XML intelligence span search engines, recommendation systems, intelligent agents, link analysis, and semantically enabled media retrieval. In each case, the underlying requirement is the same: structured data whose meaning is explicit, whose elements are defined by domain-specific vocabularies, and whose constraints are enforced by schemas and DTDs. XML provides the framework that makes all of this possible.