Querying
XML
XQuery, XPath, and SQUXML
in
Context
The Morgan Kaufmann Series in Data Management Systems
Series Editor: Jim Gray, Microsoft Research
Querying XML: XQuery, XPath, and SQL/XML in
Context
Jim Melton and Stephen Buxton
Data Mining: Concepts and Techniques, Second
Edition
Jiawei Han
and Micheline Kamber
Database Modeling and Design: Logical Design,
Fourth Edition
Toby J, Teorey, Sam S. Lightstone and Thomas E
Nadeau
Foundations of Muhidimensional and Metric Data
Structures
Hanan
Samet
Joe Celko's SQL for Smarties: Advanced SQL
Programming, Third Edition
Joe Celko
Moving Objects Databases
Ralf Hartmut G~iting and Markus
Schneider
Joe Celko's SQL Programming Style
Joe Celko
Data Mining, Second Edition: Concepts and
Techniques
Ian Witten and Eibe Frank
Fuzzy Modeling and Genetic Algorithms for Data
Mining and Exploration
Earl Cox
Data Modeling Essentials, Third Edition
Graeme C. Simsion and Graham C. Witt
Transactional Information Systems: Theory,
Algorithms, and Practice of Concurrency Control and
Recovery
Gerhard Weikum and Gottfried Vossen
Spatial Databases: ~th Application to GIS
Philippe Rigaux, Michel Scholl, and Agnes Voisard
Information Modeling and Relational Databases:
From Conceptual Analysis to Logical Design
Terry Halpin
Component Database Systems
Edited by Klaus R. Dittrich and Andreas Geppert
Managing Reference Data in Enterprise Databases:
Binding Corporate Data to the Wider World
Malcolm Chisholm
Understanding SQL and Java Together: A Guide to
SQLJ, JDBC, and Related Technologies
Jim Melton and Andrew Eisenberg
Database: Principles, Programming, and Performance,
Second Edition
Patrick and Elizabeth O'Neil
The Object Data Standard: ODMG 3.0
Edited by R. G. G. Cattell and Douglas K. Barry
Data on the Web: From Relations to Semistructured
Data and XML
Serge
Abiteboul, Peter Buneman, and Dan Suciu
Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations
Ian Witten and Eibe Frank
Understanding SQL's Stored Procedures: A Complete
Guide to SQL/PSM
Jim Melton
Principles of Muhimedia Database Systems
V. S. Subrahmanian
Principles of Database Query Processing for Advanced
Applications
Clement T. Yu and Weiyi Meng
Advanced Database Systems
Carlo Zaniolo, Stefano Ceri, Christos Faloutsos,
Richard T. Snodgrass, V. S. Subrahmanian,
and
Roberto Zicari
Principles of Transaction Processing
Philip A. Bernstein and Eric Newcomer
Using the New DB2: IBMs Object-Relational
Database System
Don Chamberlin
Distributed Algorithms
Nancy A. Lynch
Active Database Systems: Triggers and Rules For
Advanced Database Processing
Edited by Jennifer Widom and Stefano Ceri
Migrating Legacy Systems: Gateways, Inte~aces, & the
Incremental Approach
Michael L. Brodie and Michael Stonebraker
Atomic Transactions
Nancy Lynch, Michael Merritt, William Weihl, and
Alan Fekete
Location-Based Services
Jochen Schiller and Agn& Voisard
Database Modeling with Micros~" Visio for Enterprise
Architects
Terry Halpin, Ken Evans, Patrick Hallock, Bill
Maclean
Designing Data-Intensive Web Applications
Stephano Ceri, Piero Fraternali, Aldo Bongio,
Marco Brambilla, Sara Comai, and Maristella
Matera
Mining the Web: Discovering Knowledge from
Hypertext Data
Soumen Chakrabarti
Advanced SQL: 1999 Understanding Object-
Relational and Other Advanced Features
Jim Melton
Database Tuning: Principles, Experiments, and
Troubleshooting Techniques
Dennis Shasha and Philippe Bonnet
SQL:1999 Understanding Relational Language
Components
Jim Melton and Alan R. Simon
Information Visualization in Data Mining and
Knowledge Discovery
Edited by Usama Fayyad, Georges G. Grinstein,
and Andreas
Wierse
Joe Celko's SQL for Smarties: Advanced SQL
Programming, Second Edition
Joe Celko
Joe Celko's Data and Databases: Concepts in Practice
Joe Celko
Developing 7~me-Oriented Database Applications in
SQZ
Richard T. Snodgrass
Web Farming for the Data Warehouse
Richard D. Hackathorn
Management of Heterogeneous and Autonomous
Database Systems
Edited by Ahmed Elmagarmid, Marek
Rusinkiewicz, and Amit Sheth
Object-Relational DBMSs: Tracking the Next Great
Wave, Second Edition
Michael Stonebraker and Paul Brown,with Dorothy
Moore
A Complete Guide to DB2 Universal Database
Don Chamberlin
Universal Database Management: A Guide to Object/
Relational Technology
Cynthia Maro Saracco
Readings in Database Systems, Third Edition
Edited by Michael Stonebraker and Joseph M.
Hellerstein
Query Processing for Advanced Database Systems
Edited by Johann Christoph Freytag, David Maier,
and Gottfried Vossen
Transaction Processing: Concepts and Techniques
Jim Gray and Andreas Reuter
Building an Object-Oriented Database System: The
Story of 02
Edited by Fram;ois Bancilhon, Claude Delobel, and
Paris
Kanellakis
Database Transaction Models for Advanced
Applications
Edited by Ahmed K. Elmagarmid
A Guide to Developing Client~Server SQL
Applications
Setrag Khoshafian, Arvola Chan, Anna Wong, and
Harry K. T. Wong
The Benchmark Handbook for Database and
Transaction Processing Systems, Second Edition
Edited by Jim Gray
Camelot and Avalon: A Distributed Transaction
Facility
Edited by Jeffrey L. Eppinger, Lily B. Mummert,
and Alfred Z. Spector
Readings in Object-Oriented Database Systems
Edited by Stanley B. Zdonik and David
Maier
Querying XML
XQuery, XPath, and SQL/XML
in Context
Jim Melton
and
Stephen Buxton
ELSEVIER
Amsterdam 9 Boston
Heidelberg 9 London
New York 9 Oxford 9 Paris
San Diego. San Francisco
Singapore 9 Sydney 9 Tokyo
MORGAN KAUFMANN PUBLISHERS
Publisher
Publishing Services Manager
Editorial Assistant
Cover Design
Cover Image
Composition
Technical Illustration
Copyeditor
Proofreader
Indexer
Interior printer
Cover printer
Diane Cerra
Simon Crump
Asma Stephan
Ross Carron Design
OJavier Pierini/Digital Images/Getty Images
Multiscience Press
Dartmouth Publishing, Inc.
Elliot Simon
Jacqui Brownstein
Northwind Editorial Services
Maple-Vail Book Manufacturing Group
Phoenix Color
Morgan Kaufmann Publishers is an imprint of Elsevier.
500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper.
9 2006 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks or registered
trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear
in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more
complete information regarding trademarks and registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means-electronic, mechanical, photocopying, scanning, or otherwise-without prior written permission of the
publisher.
Permissions may be sought directly from Elsevier's Science & Technology Rights Department in Oxford, UK:
phone: (+44) 1865 84383O, fax: (+44) 1865 853333,
e-mail: You may also complete your request on-line via the Elsevier homepage
() by selecting "Customer Support" and then "Obtaining Permissions."
Library of Congress Cataloging-in-Publication Data
Application submitted
ISBN 13:978-1-55860-711-8
ISBN 10:1-55860-711-0
For information on all Morgan Kaufmann publications,
visit our Web site at www.mkp.com or www.books.elsevier.com
Printed in the United States of America
06 07 08 09 10 5 4 3 2 1
To rescued Shelties, and Shelties in need of rescue, everywhere. Especially
to senior Shelties who, after years of devotion to their owners, are cruelly
discarded for the most pathetic of reasons: "We're thinking about
moving", "She's just in the way", "He's too old to be fun any more", and
the worst of all - "We're getting a puppy and, you know ". And to the
loving people who welcome these old dogs into their lives, knowing that
older Shelties are calmer, settled, cuddly, and devoted - they selflessly
deal with medical needs, arthritic limitations, and the piddles of old age.
Wonderful karma accrues to these people for giving these seniors love
and respect, allowing them to live out their lives in comfort and
happiness.
Jim
To my Mum and Dad, for their long, long journey.
Stephen
This Page Intentionally Left Blank
Contents
Foreword
Preface
Why the subject matter is important xix
Why we wrote this book xx
Who should read this book xxi
How the book is organized xxi
The example we're using xxiii
Syntax Conventions xxiii
9 Additional resources xxv
9 Type conventions xxv
9 Acknowledgements xxv
Part I XML: Documents and Data
Chapter I
XML
I. I Introduction 3
1.2 Adding Markup to Data 3
1.2. I Raw Data 4
1.2.2 Separating Fields 4
1.2.3 Grouping Fields Together 5
1.2.4 Naming Fields 6
1.2.5 A Structural Map of the Data 8
1.2.6 Markup and Meaning 12
1.2.7 Why XML? 13
1.3 XML-Based Markup Languages 14
1.4 XML Data 19
1.4.1 Structured Data 19
1.4.2 Unstructured Data 20
xvii
xix
vii
viii
Contents
1.5
1.6
1.4.3 Messages
20
1.4.4 XML Data ~ Summary 20
Some Other Ways to Represent Data 21
1.5. I SQL ~ Structure Only 21
1.5.2 Presentation Languages ~ Presentation Only 24
1.5.3 SGML 26
1.5.4 HTML 27
Chapter Summary 28
Chapter 2
Querying
2.1
2.2
2.3
2.4
Introduction 31
2. I. I Definitions of Query 3 I
Querying Traditional Data 32
2.2. I The Relational Model and SQL 33
2.2.2 Extensions to SQL 36
2.2.3 Querying Traditional Data ~ Summary 38
Querying Nontraditional Data 39
2.3. I Metadata 40
2.3.2 Objects 41
2.3.3 Markup 41
2.3.4 Querying Content 43
Chapter Summary 43
Chapter 3
Querying XML
3. I Introduction 45
3.2 Navigating an XML Document 46
3.2. I Walking the XML Tree 48
3.2.2 Some Additional Wrinkles 56
3.2.3 Summary ~Things to Consider 60
3.3 What DoYou Know about Your Data? 61
3.4 SomeWays to Query XMLToday 63
3.5 Chapter Summary 64
Part II Metadata and XML
Chapter 4
Metadata An Overview
4.1
4.2
4.3
4.4
Introduction 67
Structural Metadata 69
Semantic Metadata 75
Catalog Metadata 78
31
45
65
67
Contents
ix
Chapter 5
Chapter 6
4.5
4.6
Integration Metadata 82
Chapter Summary 84
Structural Metadata
5.1
5.2
5.3
5.4
5.5
5.6
85
Introduction 85
DTDs 86
5.2.1 SGML Heritage 87
5.2.2 Relatively Simple, Easy to Write, and Easy to Read 88
5.2.3 Limited Capabilities, Especially with Respect to Data Types 94
5.2.4 An Example Document and DTD 97
XML Schema 100
5.3. I Exploring an XML Schema 101
5.3.2 Simple Types (Primitive Types and Derived Types) 107
5.3.3 Complex Types and Structures 110
Other Schema Languages for XML II 5
5.4.1 RELAX NG 115
5.4.2 Schematron 117
5.4.3 Decisions, Decisions, Decisions 118
Deriving an Implied Schema from a DTD II 9
Chapter Summary 120
The XML Information Set (Infoset)
and Beyond
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
123
Introduction 123
What Is the Infoset? 124
The Infoset Information Items and Their Properties 125
The Infoset vs.the Document 133
The XPath 1.0 Data Model 136
The Post-Schema-Validation Infoset (PSVI) 138
6.6. I Infoset + Additional Properties and Information Items 139
6.6.2 Additional Information in the PSVI 140
6.6.3 Limitations of the PSVI 141
6.6.4 Visualizing the PSVI 142
The Document Object Model (DOM)~An API 142
Introducing the XQuery Data Model 146
A Note Regarding Data Model Terminology 147
6.10 Chapter Summary and Further Reading 149
x Contents
Part III Managing and Storing XML for Querying
151
Chapter 7 Managing XML: Transforming and Connecting 153
7.1
7.2
7.3
7.4
7.5
Introduction 153
Transforming, Formatting, and Displaying XML 154
7.2. I Extensible Stylesheet Language Transformations (XSLT) 155
7.2.2 Extensible Stylesheet Language: Formatting
Objects (XSL FO) 162
The Relationships between XML Documents 163
7.3.1 XML Inclusions (Xlnclude) 164
7.3.2 XML Pointer Language (XPointer) 168
7.3.3 XML Linking Language (XLink) 173
Relationship Constraints: Enforcing Consistency 185
Chapter Summary 191
Chapter 8
Storing: XML and Databases
8. I Introduction 193
8.2 The Need for Persistence 194
8.2. I Databases 195
8.2.2 Other Persistent Media 200
8.2.3 ShreddingYour Data 20 I
8.3 SQL/XML's XMLType 206
8.4 Accessing Persistent XML Data 207
8.5 XML on the Fly: Nonpersistent XML Data 209
8.6 Chapter Summary 21 I
193
Part IV Querying XML
213
Chapter 9
XPath 1.0 and XPath 2.0
9.1
9.2
9.3
9.2.1
9.2.2
9.2.3
9.2.4
9.2.5
9.2.6
9.2.7
9.2.8
Introduction 215
XPath 1.0 217
Expressions 218
Contexts 222
Paths and Steps 224
Axes and Shorthand Notations 228
Node Tests 239
Predicates 241
XPath Functions 243
Putting the Pieces Together 248
XPath 2.0 Components 252
215
Contents
xi
9.4
9.5
9.3.1 Expressions 252
9.3.2 The for and return Expressions 256
XPath 2.0 and XQuery 1.0 258
Chapter Summary 259
Chapter I0 Introduction to XQuery 1.0
10.1 Introduction 261
10.2 A Brief History 262
10.3 Requirements 264
10.3.1 General Requirements for XQuery 266
10.3.2 Data Model Requirements 267
10.3.3 XQuery Functionality Requirements 268
10.3.4 XPath 2.0 Requirements 269
10.4 Use Cases 269
10.5 The XQuery 1.0 Suite of Specifications 275
10.5.1 XQuery 1.0 Language Specification 276
10.5.2 XPath 2.0 and XQuery 1.0 Formal Semantics 278
10.5.3 XPath 2.0 and XQuery 1.0 Functions & Operators 278
10.5.4 XQuery 1.0 Serialization 279
10.5.5 XQueryX 280
10.6 The Data Model 280
10.6.1 Data Model Instances 282
10.6.2 What Is an XQuery Data Model Instance? 283
10.6.3 The Seven Kinds of Nodes 284
10.6.4 The Data Model as Tree- Representing aWelI-Formed
Document 293
10.6.5 The Data Model as Sequence- Representing an
Arbitrary Sequence 295
10.7 The XQueryType System 297
10.7. I What Is a Type System Anyway? 297
10.7.2 XML SchemaTypes 300
10.7.3 From XML Schema to the XQueryType System 304
10.7.4 Types and Queries 305
10.8 XQuery 1.0 Formal Semantics and Static Typing 306
10.8.1 Notations 307
10.8.2 Static Typing 31 I
10.8.3 Dynamic Semantics 312
10.9 Functions and Operators 313
10.9.1 Functions 313
10.9.2 Operators 316
10.10 XQuery 1.0 and XSLT 2.0 Serialization 319
261
xii
Contents
10.10.1 XML Output Method 322
10.10.2 XHTML Output Method 325
10.10.3 HTML Output Method 326
10.10.4 Text Output Method 327
10.1 I Chapter Summary 327
Chapter II XQuery 1.0 Definition
I I.I Introduction 329
11.2 Overview of XQuery 330
11.2.1 Concepts 330
11.3 The XQuery Processing Model 333
11.3.1 The Static Context 334
11.3.2 The Dynamic Context 337
11.4 The XQuery Grammar 338
11.5 XQuery Expressions 339
I 1.5.1 Literal Expressions 341
11.5.2 Constructor Functions 342
11.5.3 Sequence Constructors 343
11.5.4 Variable References 345
11.5.5 Parenthesized Expressions 346
11.5.6 Context Item Expression 346
11.5.7 Function Calls 346
11.5.8 Filter Expressions 349
11.5.9 Node Sequence-Combining Expressions 349
I 1.5.10 Arithmetic Expressions 35 I
11.5. II Boolean Expressions: Comparisons and Logical
Operators 354
I 1.5.12 Constructors ~ Direct and Computed 361
11.5.13 Ordered and Unordered Expressions 370
I 1.5.14 Conditional Expression 371
I 1.5.15 Quantified Expressions 372
I 1.5.16 Expressions on XQueryTypes 374
I 1.5.17 Validation Expression 378
11.6 FLWOR Expressions 380
11.6.1 The for Clause and the let Clause 380
11.6.2 The where Clause 389
11.6.3 The order by Clause 390
11.6.4 The return Clause 392
11.7 Error Handling 393
11.8 Modules and Query Prologs 394
11.8.1 Prologs 395
329
Contents
xiii
11.8.2 Main Modules 398
11.8.3 Library Modules 400
11.9 A Longer Example with Data 402
II. 10 XQuery for SQL Programmers 402
I1.11 Chapter Summary 403
Chapter 12 XQueryX
12.1 Introduction 407
12.2 How Far to Go? 408
12.2.1 Trivial Embedding 409
12.2.2 Fully-Parsed XQuery 410
12.2.3 The XQueryXApproach 41 I
12.3 The XQueryX Specification 416
12.4 XQueryX By Example 417
12.4.1 The Simplest XQueryX Example ~ 42 417
12.4.2 Simple XQueryX Example 423
12.4.3 Useful XQuery Example 430
12.5 Querying XQueryX 433
12.5. I Querying XQueryX for XQueryTuning 434
12.5.2 Querying XQueryX for Application Improvement 436
12.6 Chapter Summary 437
Chapter 13 What's Missing?
13.1
13.2
13.3
Introduction 439
Full-Text 440
13.2. I What Is a Full-Text Query? 440
13.2.2 Full-Text and XML 448
13.2.3 Defining XQuery Full-Text 449
13.2.4 W3C XQuery Full-Text ~ Grammar Extension 455
13.2.5 W3C XQuery Full-Text ~ Some Discussion Topics 471
13.2.6 XQuery Full-Text ~ Some Implementations 474
Update 478
13.3. I Motivation:Where/WhyWe Need Update 479
13.3.2 Requirements 481
13.3.3 Alternatives: Syntax and Semantics 485
13.3.4 How Products Handle Update Today 488
13.3.5 What Lies Ahead? 495
13.4 Chapter Summary 495
407
439
xiv
Contents
Chapter 14 XQuery APIs
14.1 Introduction 497
14.2 Alphabet-Soup Review 498
14.2.1 ODBC andJDBC 499
14.2.2 DOM, SAX, StAX, JAXP, JAXB 501
14.2.3 Alphabet-Soup Summary 502
14.3 XQJ ~ XQuery for Java 503
14.3.1 Connecting to a Data Source 504
14.3.2 Executing a Query 507
14.3.3 Manipulating XML Data 509
14.3.4 Static and Dynamic Context 517
14.3.5 Metadata 518
14.3.6 Summary 519
14.4 SQL/XML 520
14.5 Looking Ahead 521
Chapter 1 5 SQIL/XML
15.1 Introduction 523
15.2 SQL/XML Publishing Functions 526
15.2.1 Examples 526
15.2.2 XMLAGG 529
15.2.3 XMLFOREST 531
15.2.4 XMLCONCAT 535
15.2.5 Summary 536
15.3 XML DataType 537
15.4 XQuery Functions 540
15.4.1 XMLQUERY 541
15.4.2 XMLTABLE 546
15.4.3 XMLEXISTS 570
15.5 Managing XML in the Database 572
15.6 Talking the Same Language ~ Mappings 573
15.6.1 Character Sets 573
15.6.2 Names 574
15.6.3 Types andValues 575
15.7 Chapter Summary 580
Part V Querying and The World Wide Web
Chapter 16 XML-Derived Markup Languages
16.1 Introduction 585
16.2 Markup Languages 586
497
523
583
585
Contents
xv
16.2.1 MathML 587
16.2.2 SMIL 591
16.2.3 SVG 594
16.3 Discovery on the World Wide Web 597
16.4 Customized Query Languages 602
16.5 Chapter Summary 604
Chapter 17 Internationalization: Putting the "W" in "WWW" 605
17.1 Introduction 605
17.2 What Is Internationalization? 606
17.3 Internationalization and theWorld WideWeb 607
17.3.1 Unicode 609
17.3.2 W3C Character Model for theWorld WideWeb 615
17.4 Internationalization Implications: XPath, XQuery, and SQL/XML 618
17.5 Chapter Summary 621
Chapter 18 Finding Stuff
18.1 Introduction 623
18.2 Finding Structured Data ~ Databases 624
18.3 Finding Stuff on theWeb ~Web Search 625
18.3.1 The Google Phenomenon 625
18.3.2 Metadata 627
18.3.3 The SemanticWeb ~The Search for Meaning 628
18.3.4 The DeepWeb ~ Feel theWidth 637
18.4 Finding Stuff atWork ~ Enterprise Search 638
18.5 Finding Other People's Stuff~ Federated Search 640
18.6 Finding Services ~WSDL, UDDI,WSIL, RDDL 641
18.7 Finding Stuff in a More NaturaIWay 644
18.8 Putting It All Together ~The Semantic Web+ 645
623
Appendix A The Example
A.I
A.2
A.3
A.4
A.5
Introduction 647
Example Data 648
A.2. I Movies We Own 648
Some Examples from the Book 698
A.3.1 XQuery Examples 699
A.3.2 SQL/XML Examples 709
A SimpleWeb Application 729
Summary 749
647
xvi
Contents
Appendix B
Standards Processes
B.I
B.2
B.3
B.4
B.5
Introduction 751
World WideWeb Consortium (W3C) 753
B.2. I What Is the W3C? 753
B.2.2 TheW3C Process Document 754
B.2.3 TheW3C Stages of Progression 755
Java Community Process (JCP) 757
B.3. I What Is the JCP? 757
B.3.2 JSRs and Expert Groups: Formation and Operation 758
B.3.3 The JSR Stages of Progression 760
De Jure Standards:ANSI and ISO 761
B.4. I The De Jure Process and Organizations 761
B.4.2 The SQL/XML Standardization Environment 764
B.4.3 Stages of Progression 766
Summary 769
Appendix C
Grammars
C.I
C.2
C.3
C.4
Introduction 771
XQuery Grammar 771
SQL/XML Grammar 779
Chapter Summary 788
Index
751
771
789
About the Authors 815
Foreword
by Don Chamberlin
IBM Fellow
Almaden Research Center
Companies come and go in the database industry, but one thing
remains constant: Jim Melton remains at the center of the database
standards community. For more years than anyone cares to remem-
ber, Jim has served as editor of the international standard for the SQL
database language. Perhaps more importantly, he has translated this
standard into terminology that ordinary people can understand and
has made it accessible to everyone in a series of successful books.
Now the database world is undergoing its most important transi-
tion since the advent of the relational data model in the 1970's. A
new self-describing data format, XML, is emerging as the standard
format for exchange of semi-structured data on the Web. XML is
fundamentally different from relations because it carries descriptive
metadata with each data instance rather than storing it in a separate
catalog. This new format gives unprecedented flexibility for repre-
senting various types of data but at the same time it requires a new
approach to query.
A collection of query-related standards is emerging around the
XML data format, and as usual Jim Melton is at the center of the
xvii
xviii
Foreword
action. Jim is co-chair of the W3C XML Query Working Group, which
is creating an important new language called XQuery and (together
with the XSLT Working Group) is revising the well-known XPath
language. Jim is also co-Spec Lead for XQJ, the Java interface to
XQuery that is being developed under the Java Community Process.
In addition, as editor of the SQL Standard, Jim serves as editor of
SQL/XML, the set of SQL extensions that enable relational databases
to store and query XML data.
Stephen Buxton is also a long-time member of the W3C XML
Query Working Group, and a specialist in full-text search and
retrieval. Stephen's expertise in approximate queries on unstruc-
tured text complements Jim's long experience with exact queries on
structured data.
In short, there is no more authoritative pair of authors on Query-
ing XML than Jim Melton and Stephen Buxton. Best of all, as readers
of Jim's other books know, his informal writing style will teach you
what you need to know about this complex subject without giving
you a headache. If you need a comprehensive and accessible over-
view of Querying XML, this is the book you have been waiting for.
Don Chamberlin
December 2005
Preface
Why the subject matter is important
In a remarkably short period, XML has arguably become the most
important language for marking up documents for the World Wide
Web and for industry in general. Equally important, XML is rapidly
becoming the
lingua franca
for marking up traditional business data,
for exchanging information between business partners and between
application programs, and for expressing a host of concepts that
improve the usability of computer systems.
While it may be tempting to view XML as a "silver bullet"-a
solution to all of our problems-the truth is a bit more prosaic: XML
is merely a tool (admittedly a very important one) that can help solve
a significant range of problems. Like most tools, XML introduces
tradeoffs and complications. Among the difficulties that XML users
will increasingly encounter are the ones posed by locating and
retrieving information stored in documents marked up using XML.
As you'll learn in this book, there are many approaches to query-
ing XML documents and repositories of such documents. We cannot
claim to have addressed every possible approach, or even every
approach in use at the time we wrote this book. There are simply too
many possibilities and alternatives, too many researchers and practi-
tioners inventing new technologies. Instead, we have focused on the
xix
xx
Preface
approaches that have the broadest uses, the largest community of
adherents, and the greatest promise for economic success.
Before going further, we think that a quick explanation is in order
for one key term that crops up repeatedly in this book:
document.
Because of XMUs origins, sequences of characters that follow the
rules of XML, and are able to stand alone, are properly known as
"XML documents", even when they have nothing to do with books,
articles, or any kind of textual material. When numeric data or even
graphic images are represented in a standalone XML form, that XML
is properly called an XML document. XML that cannot stand by itself
is sometimes called an XML
fragment.
In general, throughout this
book, we use the word "document" or "fragment" when a specific
sort of XML is being referenced and we need to be clear about the
nature of that XML. Otherwise, we mostly use the raw term "XML"
and depend on the context to disambiguate our usage.
Why we wrote this book
"XML" is an enormous topic for any individual to understand. The
term has come to imply much more than the markup language of the
same name. Due in large part to the versatility of the markup lan-
guage and the enormous utility of the Internet and the World Wide
Web, there are countless computer scientists and software engineers
developing specifications, tools, application programs, and even
hardware that use or depend on some use of XML.
There are many fine books available that can teach you how to
mark up your documents and your data with XML, how to use the
eXtensible Stylesheet Language (XSL) to transform documents into
other documents, how to use the many tools such as XML parsers
and XSL transformation engines, and so forth. There are even several
available books focused exclusively on XQuery, the almost-finalized
W3C XML Query language.
But we have not seen any books that cover a broader subject that
we think is vital: how to locate information in documents that are
marked up using XML and how to find and extract that information
in repositories of such documents. It is certainly important to mark
up your documents and your data to capture the meaning inherent
in them, but tremendous additional value is available when you can
use powerful query facilities that not only find certain documents in
a repository, but also find and extract the fine-grained information
contained in those documents.
How the book is organized ~i
In this book, we identify and explore several approaches to query-
ing XML documents, concentrating on those that we believe are most
likely to be important in the near-to-medium future. We also give
you a perspective on some of the other technologies that are closely
related to the subject of querying XML. In doing so, we give you not
only valuable insights about locating and retrieving information in
XML documents, but we put the subject into the contexts in which it
will be used.
Who should read this book
We wrote this book primarily to benefit software engineers who have
to design and build applications that use XML and to access docu-
ments and data presented in an XML form. While the subject is nec-
essarily technical in nature and presentation, we decline to focus
exclusively on production of lines of code. Instead, we approach
mastery of the subject by ensuring that readers understand the rea-
son a particular topic is important, that they know the context in
which the topic is relevant, that the principles of the topic are made
clear,
and
that the details of writing code appropriate to the topic are
illustrated and exemplified.
The book should be of interest to more than just software develop-
ers, though. Architects of software systems that use XML must know
how search and retrieval issues are to be handled, while managers
and team leaders need an understanding of the relationships
between XML markup and storage and future retrieval of documents
based on the semantics of the information they contain.
How the book is organized
This book is divided into several parts. Part I, "XML: Documents and
Data", starts off with a survey of structured document technology
and examines several languages used to produce and/or represent
such documents. It continues with an exploration of the problems
associated with querying data generally, as well as with searching
XML documents, and includes a comparison of querying XML with
the use of SQL used to query traditional data.
Part II, "Metadata and XML', introduces the subject of metadata
for XML-information that describes XML documents and markup
languages. This part covers Document Type Definitions (DTDs) and
XML Schemas (with some attention given to competing XML
xxii
Preface
schema definition languages). We discuss the "meaning" of XML
markup and survey its use in a number of different XML-related
markup languages. This part finishes with a presentation of XMUs
Information Set (commonly known as the Infoset) and an introduc-
tion to several other data models used to describe XML documents
in a formal manner.
Part III, "Managing XML for Querying", looks at the different
sorts of databases (e.g., relational, object-relational, object-oriented,
and so-called "native XML') in which XML documents are being
stored. It also examines several other W3C specifications that play a
role in XML documents that might be queried. This part of the book
includes some information about a number of current products that
are used to store, manage, query, and retrieve XML documents.
Part IV, "Querying XML', is the technical heart of the book,
describing four ways to query XML. XPath (the XML Path Language)
is already an established language for querying within an XML doc-
ument, so this part begins with a significant discussion of the XPath
and its usage for XML querying. XQuery is a brand new language
designed specifically for querying XML, so we will spend a lot of
time and detail on it, including an analysis of the type system and
data model used by that language, an examination of the formal
semantics of the language, and a discussion (replete with examples)
of the use of XQuery and its companion XQueryX. SQL is the leading
query language for structured data today. We explore the ways that
SQL can be used to query XML, especially if the XML is "shredded"
and stored in an object-relational form. Finally, in this Part we dis-
cuss SQL/XML, a set of extensions to SQL that leverage XPath and
XQuery to overcome some of SQUs limitations in managing semi-
structured data.
Part V, "Querying and the World Wide Web", provides a look at a
number of specific XML-based markup languages and responds to
the question of whether XPath, XQuery, SQL, and/or SQL/XML are
suitable for querying documents that are marked up using such lan-
guages or whether other, more specific, query facilities are needed to
deal with them. It also looks at the ways in which XML is, and is
going to be, used on the Internet, both for casual uses like browsing
and for industrial uses such as data interchange between business
partners. The impacts of internationalization on XML and related
specifications are addressed here as well.
We finish up the book with appendices that give you a glimpse
into the way in which open standards like XML, XQuery, and SQL/
XML are developed, that contain the complete grammar of XQuery,
Syntax Conventions
xxiii
that list and describe all of the SQL/XML functions, and that pro-
vides a lengthy set of examples and a small sample of data against
which they have been tested.
The example we're using
We are both avid fans of the cinema-which is illustrated by the fact
that, between us, we subscribe to just about every possible movie
channel offered by satellite television providers. Continuing the tra-
dition started in earlier books written by Jim, we've chosen to use the
subject of movies as the basis for our example. We've collected data
on a broad range of films and organized it into a sort of "database"
that is, in fact, a modestly large XML document. This document -
data with XML markup - serves as the foundation for many of our
examples. (Note that we do not pretend that our example document
is marked up in any sort of optimal way, suitable for industrial use;
we chose specific markup styles to illustrate the points we make at
various parts of the book.) When the topic demands something a lit-
tle less data-oriented, we use a smallish textual document that dis-
cusses several film-related topics.
Syntax Conventions
In several places in this book, we define the syntax of various lan-
guage components relevant to XML, XML query languages, and so
forth. While we are not particularly fond of the syntax conventions
that the W3C has adopted (we find them somewhat less readable than
several other conventions), we believe that readers of this book will
be best served by consistency of style accompanied by explanations.
Therefore, we have (with slight reluctance) adopted the same
style used in the W3C specifications that we reference in the book.
You may be familiar with those conventions, but we think that a
quick summary will help some readers.
A variation of Backus-Naur Form (BNF) is used for syntax presen-
tation. More specifically, a syntactic symbol (called a nonterminal sym-
bol to distinguish it from language components that represent only
themselves) is defined using a notation in which the symbol being
defined appears to the left of a special operator (. -=) and the defini-
tion of that symbol appears as an expression written following that
operator. For example:
xxiv
Preface
nonterminal-x ::= nonterminal-y ( ',' nonterminal-y )*
That line, called a
BNF production,
defines a nonterminal symbol
(nonterminal-x) by saying that it is made up of a second nontermi-
nal symbol (nonterminal-y), optionally followed by zero or more
(that's the meaning of the asterisk, *) repetitions of a sequence made
up of a literal comma (that's a
terminal symbol)
and another instance
of that second nonterminal symbol (nonterminal-y).
Therefore, if nonterminal-y happens to be defined to be an
identifier (in XML, these are either
QNames
or
NCNames),
then an
instance of nonterminal-x might be:
film , cinema , movie
One important thing to note is that, in this style of BNF, all
terminal symbols are enclosed in quotation marks, which might be
single quotation marks (' ') or double quotation marks (" ").
Anything, including parentheses, not enclosed in quotation marks is
either a nonterminal symbol or a character used in the BNF to specify
its meaning.
Here is a complete list of the conventions used in this book by this
style of BNF:
9 "string"
the literal
string
given inside the double
quotes
9 'string' - the literal string given inside the single
quotes
9 a b a single occurrence of a followed by a single occur-
rence of b
9 a I b
a single occurrence of a or a single occurrence of
b, but not both
9 a ?
a single occurrence of a or nothing at all; optional a
9 a+ - one or more occurrences of a
9
a* zero or more occurrences of a
9 ( expression ) expression
is
treated as a
unit,
allows subgroups to carry the operators ?, *, or +
9 /*
*/ - a comment in the BNF (this is unrelated to
comments in languages being defined by the BNF, such as
XQuery)