15
The Open Grid Services
Architecture, and data Grids
Peter Z. Kunszt and Leanne P. Guy
CERN,Geneva,Switzerland
15.1 INTRODUCTION
Data Grids address computational and data intensive applications that combine very large
datasets and a wide geographical distribution of users and resources [1, 2]. In addition to
computing resource scheduling, Data Grids address the problems of storage and data man-
agement, network-intensive data transfers and data access optimization, while maintaining
high reliability and availability of the data (see References [2, 3] and references therein).
The Open Grid Services Architecture (OGSA) [1, 4] builds upon the anatomy of the
Grid [5], where the authors present an open Grid Architecture, and define the technologies
and infrastructure of the Grid as ‘supporting the sharing and coordinated use of diverse
resources in dynamic distributed Virtual Organizations (VOs)’. OGSA extends and com-
plements the definitions given in Reference [5] by defining the architecture in terms of
Grid services and by aligning it with emerging Web service technologies.
Web services are an emerging paradigm in distributed computing that focus on sim-
ple standards-based computing models. OGSA picks the Web Service Description Lan-
guage (WSDL) [6], the Simple Object Access Protocol (SOAP) [7], and the Web Service
Grid Computing – Making the Global Infrastructure a Reality. Edited by F. Berman, A. Hey and G. Fox
2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0
386
PETER Z. KUNSZT AND LEANNE P. GUY
Introspection Language (WSIL) [8] from the set of technologies offered by Web services
and capitalizes especially on WSDL. This is a very natural approach since in the dis-
tributed world of Data Grid computing the same problems arise as on the Internet
concerning the description and discovery of services, and especially the heterogeneity
of data [9].
In this article the application of the OGSA is discussed with respect to Data Grids. We
revisit the problem that is being addressed, the vision and functionality of Data Grids and
of OGSA. We then investigate OGSA’s benefits and possible weaknesses in view of the
Data Grid problem. In conclusion, we address what we feel are some of the shortcomings
and open issues that expose potential areas of future development and research.
The European Data Grid project (EU Data Grid project) aims to build a computational
and data-intensive grid to facilitate the analysis of millions of Gigabytes of data generated
by the next generation of scientific exploration. The main goal is to develop an infras-
tructure that will provide all scientists participating in a research project with the best
possible access to data and resources, irrespective of geographical location. All the Data
Grid discussion presented in this chapter draws from existing concepts in the literature,
but is strongly influenced by the design and experiences with data management in the
context of the EU Data Grid project [10]. In particular, data management in the EU Data
Grid project has focused primarily on file-based data management [11, 12]. However, the
usage of the Web services paradigm to build a Grid infrastructure was always considered
in the architecture and has lead to the definition of the Web Service discovery architecture
WSDA [13].
15.1.1 The vision
In this section we summarize the Data Grid problem and the vision of the Grid as presented
in the literature [1, 5, 14].
15.1.2 Desirable features of Grids
In the scientific domain there is a need to share resources to solve common problems
and to facilitate collaboration among many individuals and institutes. Likewise, in com-
mercial enterprise computing there is an increasing need for resource sharing, not just to
solve common problems but to enable enterprise-wide integration and business-to-business
(B2B) partner collaboration
To be successful on a large scale, the sharing of resources should be
FLEXIBLE, SECURE, COORDINATED, ROBUST, SCALABLE,
UBIQUITOUSLY ACCESSIBLE, MEASURABLE (QOS METRICS),
TRANSPARENT TO THE USERS
The distributed resources themselves should have the following qualities:
INTEROPERABLE, MANAGEABLE, AVAILABLE AND EXTENSIBLE
THE OPEN GRID SERVICES ARCHITECTURE, AND DATA GRIDS
387
Grid computing attempts to address this very long list of desirable properties. The
concept of the Virtual Organization (VO) gives us the first necessary semantics by which
to address these issues systematically and to motivate these properties in detail.
15.1.3 Virtual organizations
A Virtual Organization (VO) is defined as a group of individuals or institutes who are
geographically distributed but who appear to function as one single unified organization.
The members of a VO usually have a common focus, goal or vision, be it a scientific
quest or a business venture. They collectively dedicate resources, for which a well-defined
set of rules for sharing and quality of service (QoS) exists, to this end. Work is coordi-
nated through electronic communication. The larger the VO, the more decentralized and
geographically distributed it tends to be in nature.
The concept of the VO is not a new one; VOs have existed for many years in the
fields of business and engineering, in which consortia are assembled to bid for contracts,
or small independent companies temporarily form a VO to share skills, reduce costs, and
expand their market access. In the commercial context, VOs may be set up for anything
from a single enterprise entity to complex B2B relationships.
In scientific communities such as High Energy Physics (HEP), astrophysics or medical
research, scientists form collaborations, either individually or between their respective
institutes, to work on a common set of problems or to share a common pool of data. The
term Virtual Organization has only recently been coined to describe such collaborations
in the scientific domain.
VOs may be defined in a very flexible way so as to address highly specialized needs
and are instantiated through the set of services that they provide to their users/applications.
Ultimately, the organizations, groups and individuals that form a VO also need to pay for
the resources that they access in some form or another – either by explicitly paying money
for services provided by Grid service providers or by trading their own resources. Grid
resource and service providers may choose to which VOs they can offer their resources
and services. The models for paying and trading of resources within a VO will probably
be vastly different in the scientific and business domains.
In the vision of the Grid, end users in a VO can use the shared resources according to
the rules defined by the VO and the resource providers. Different VOs may or may not
share the same resources. VOs can be dynamic, that is, they may change in time both
in scope and in extension by adjusting the services that are accessible by the VO and
the constituents of the VO. The Grid needs to support these changes such that they are
transparent to the users in the VO.
15.1.4 Motivation for the desirable features in Data Grids
In the specific case of Data Grids, the members of a VO also share their collective
data, which is potentially distributed over a wide area (
SCALABLE
). The data needs to be
accessible from anywhere at anytime (
FLEXIBLE, UBIQUITOUS ACCESS
); however, the user is
not necessarily interested in the exact location of the data (
TRANSPARENT
). The access to
the data has to be secured, enabling different levels of security according to the needs
388
PETER Z. KUNSZT AND LEANNE P. GUY
of the VO in question, while at the same time maintaining the manageability of the data
and its accessibility (
SECURE, COORDINATED, MANAGEABLE
). The reliability and robustness
requirements in the existing communities interested in Data Grids are high, since the
computations are driven by the data – if the data are not accessible, no computation is
possible (
ROBUST, AVAILABLE
).
In many VOs, there is a need to access existing data – usually through interfaces to
legacy systems and data stores. The users do not want to be concerned with issues of
data interoperability and data conversion (
INTEROPERABLE, EXTENSIBLE
). In addition, some
Data Grids also introduce the concept of Virtual Data, [15] in which data needs to be
generated by a series of computations before it is accessible to the user.
15.2 THE OGSA APPROACH
The OGSA views the Grid as an extensible set of Grid services. A Grid service is defined
as ‘a Web service that provides a set of well-defined interfaces and that follows specific
conventions’ [14]. Virtual Organizations are defined by their associated services, since
the set of services ‘instantiate’ a VO. In OGSA each VO builds its infrastructure from
existing Grid service components and has the freedom to add custom components that
have to implement the necessary interfaces to qualify as a Grid service. We summarize the
requirements that motivate the OGSA as stated in References [1, 4, 14] and the properties
it provides to the Grid (see previous section for the discussion of properties) in Table 15.1.
Table 15.1 Requirements mentioned in References [1, 4, 14] for OGSA. We try to map each
requirement to the properties that grid systems are supposed to have on the basis of the requirements
given in Section 15.2. The primary property is highlighted. An analysis of these requirements is
given in the next section
Requirement Property
1 Support the creation, maintenance and application of
ensembles of services maintained by VOs.
COORDINATED, TRANSPARENT,
SCALABLE, EXTENSIBLE
,
MANAGEABLE
2 Support local and remote transparency with respect to
invocation and location.
TRANSPARENT
,
FLEXIBLE
3 Support multiple protocol bindings (enabling e.g.
localized optimization and protocol negotiation).
FLEXIBLE
,
TRANSPARENT,
INTEROPERABLE, EXTENSIBLE,
UBIQ.ACCESSIBLE
4 Support virtualization in order to provide a common
interface encapsulating the actual implementation.
INTEROPERABLE
,
TRANSPARENT,
MANAGEABLE
5 Require mechanisms enabling interoperability.
INTEROPERABLE
,
UBIQ.ACCESSIBLE
6 Require standard semantics (like same conventions
for error notification).
INTEROPERABLE
7 Support transience of services.
SCALABLE, MANAGEABLE,
AVA I L A B L E
,
FLEXIBLE
8 Support upgradability without disruption of the
services.
MANAGEABLE, FLEXIBLE
,
AVAILABLE
THE OPEN GRID SERVICES ARCHITECTURE, AND DATA GRIDS
389
Table 15.2 Current elements of OGSA and elements that have been mentioned by OGSA but are
said to be dealt with elsewhere. We also refer to the requirement or to the Grid capability that
the elements address. Please be aware that this is a snapshot and that the current OGSA service
specification might not be reflected correctly
Responsibility Interfaces Requirements
addressed
Information & discovery GridService.FindServiceData
Registry.(Un)RegisterService
HandleMap.FindByHandle Service Data
Elements Grid Service Handleand Grid
Service Reference
1,2,4,5,6
Dynamic service creation Factory.CreateService 1,7
Lifetime management GridService.SetTerminationTime,
GridService.Destroy
1,7
Notification NotificationSource.
(Un)SubscribeToNotificationTopic
NotificationSink.DeliverNotification
5
Change management CompatibilityAssertion data element 8
Authentication Impose on the hosting environment
SECURE
Reliable invocation Impose on hosting environment
ROBUST
Authorization and policy
management
Defer to later
SECURE
Manageability Defer to later
MANAGEABLE
As mentioned before, OGSA focuses on the nature of services that makes up the Grid.
In OGSA, existing Grid technologies are aligned with Web service technologies in order
to profit from the existing capabilities of Web services, including
•
service description and discovery,
•
automatic generation of client and server code from service descriptions,
•
binding of service descriptions to interoperable network protocols,
•
compatibility with higher-level open standards, services and tools, and
•
broad industry support.
OGSA thus relies upon emerging Web service technologies to address the requirements
listed in Table 15.1. OGSA defines a preliminary set of Grid Service interfaces and capa-
bilities that can be built upon. All these interfaces are defined using WSDL to ensure a
distributed service system based on standards. They are designed such that they address
the responsibilities dealt with in Table 15.2.
The GridService interface is imposed on all services. The other interfaces are optional,
although there need to be many service instances present in order to be able to run a
VO. On the basis of the services defined, higher-level services are envisaged such as data
management services, workflow management, auditing, instrumentation and monitoring,
problem determination and security protocol mapping services. This is a nonexhaustive
list and is expected to grow in the future.
390
PETER Z. KUNSZT AND LEANNE P. GUY
OGSA introduces the concept of Grid Service Handles (GSHs) that are unique to
a service and by which the service may be uniquely identified and looked up in the
Handle Map. GSHs are immutable entities. The Grid Service Reference (GSR) is then
the description of the service instance in terms of (although not necessarily restricted
to) WSDL and gives the user the possibility to refer to a mutable running instance of
this service. OGSA envisages the possibility of a hierarchical structure of service hosting
environments, as they demonstrate in their examples.
Another essential building block of OGSA is the standardized representation for service
data, structured as a set of named and typed XML fragments called service data elements
(SDEs). The SDE may contain any kind of information on the service instance allowing
for basic introspection information on the service. The SDE is retrievable through the
GridService port type’s FindServiceData operation implemented by all Grid services.
Through the SDE the Grid services are stateful services, unlike Web services that do not
have a state. This is an essential difference between Grid services and Web services: Grid
services are stateful and provide a standardized mechanism to retrieve their state.
15.3 DATA GRID SERVICES
There has been a lot of effort put into the definition of Data Grid services in the existing
Grid projects. Figure 15.1 shows the Global Grid Forum GGF Data Area concept space,
Grid Data
Management
System
Management
metadata
Subcollections
Replica CatalogCollections
Global namespaceLogical objects
Replica Management
Quality of
Service
Network transfer
Data transport
Data modelReplica
MetadataPhysical objects
Data 
Storage deviceStorage
properties 
Storage ManagerStorage Access
Protocol 
Storage Systems 
Application
experiment (data
source)
Remote Resources
ApplicationMiddleware
Caching
ACL/CAS
AuthorizationAuthentication
Security Services
Figure 15.1 The concept space of the GGF data area (from the GGF data area website).
THE OPEN GRID SERVICES ARCHITECTURE, AND DATA GRIDS
391
namely, their domains of interest. In this article we address the same concept space.
Examples of Data Grid architectures are given in References [2, 15–21].
15.3.1 The data
In most of the existing architectures, the data management services are restricted to the
handling of files. However, for many VOs, files represent only an intermediate level of data
granularity. In principle, Data Grids need to be able to handle data elements – from single
bits to complex collections and even virtual data, which must be generated upon request.
All kinds of data need to be identifiable through some mechanism – a logical name or
ID – that in turn can be used to locate and access the data. This concept is identical to the
concept of a GSH in OGSA [14], so in order to be consistent with the OGSA terminology
we will call this logical identifier the Grid Data Handle (GDH) (see below).
The following is a nondefinitive list of the kinds of data that are dealt with in
Data Grids.
•
Files: For many of the VOs the only access method to data is file-based I/O, data
is kept only in files and there is no need for other kind of data granularity. This
simplifies data management in some respect because the semantics of files are well
understood. For example, depending on the QoS requirements on file access, the files
can be secured through one of the many known mechanisms (Unix permissions, Access
Control Lists (ACLs), etc.). Files can be maintained through directory services like the
Globus Replica Catalog [3], which maps the Logical File Name (LFN) to their physical
instances, and other metadata services that are discussed below.
•
File collections: Some VOs want to have the possibility of assigning logical names
to file collections that are recognized by the Grid. Semantically there are two different
kinds of file collections: confined collections of files in which all files making up the
collection are always kept together and are effectively treated as one – just like a tar or
zip archive – and free collections that are composed of files as well as other collections
not necessarily available on the same resource – they are effectively a bag of logical
file and collection identifiers. Confined collections assure the user that all the data
are always accessible at a single data source while free collections provide a higher
flexibility to freely add and remove items from the collection, but don’t guarantee that
all members are accessible or even valid at all times. Keeping free collections consistent
with the actual files requires additional services.
•
Relational databases: The semantics of data stored in a Relational Database Manage-
ment System (RDBMS) are also extremely well understood. The data identified by a
GDH may correspond to a database, a table, a view or even to a single row in a table
of the RDBMS. Distributed database access in the Grid is an active research topic and
is being treated by a separate article in this volume (see Chapter 14).
•
XML databases and semistructured data: Data with loosely defined or irregular struc-
ture is best represented using the semistructured data model. It is essentially a dynami-
cally typed data model that allows a ‘schema-less’ description format in which the data
is less constrained than in relational databases [22]. The XML format is the standard
in which the semistructured data model is represented. A GDH may correspond to any
XML data object identifiable in an XML Database.
392
PETER Z. KUNSZT AND LEANNE P. GUY
•
Data objects: This is the most generic form of a single data instance. The structure
of an object is completely arbitrary, so the Grid needs to provide services to describe
and access specific objects since the semantics may vary from object type to object
type. VOs may choose different technologies to access their objects – ranging from
proprietary techniques to open standards like CORBA/IIOP and SOAP/XML.
•
Virtual data: The concept of virtualized data is very attractive to VOs that have large
sets of secondary data that are derived from primary data using a well-defined set of
procedures and parameters. If the secondary data are more expensive to store than
to regenerate, they can be virtualized, that is, only created upon request. Additional
services are required to manage virtual data.
•
Data sets: Data sets differ from free file collections only in that they can contain
any kind of data from the list above in addition to files. Such data sets are useful for
archiving, logging and debugging purposes. Again, the necessary services that track
the content of data sets and keep them consistent need to be provided.
15.3.1.1 The Grid data handle GDH
The common requirement for all these kinds of data is that the logical identifier of
the data, the GDH, be globally unique. The GDH can be used to locate the data and
retrieve information about it. It can also be used as a key by which more attributes
can be added to the data through dedicated application-specific catalogs and metadata
services. There are many possibilities to define a GDH, the simplest being the well-
known Universal Unique IDentifier (UUID) scheme [23]. In order to be humanly readable,
a uniform resource identifier (URI)-like string might be composed, specifying the kind of
data instead of the protocol, a VO namespace instead of the host and then a VO-internal
unique identifier, which may be a combination of creation time and creation site and a
number (in the spirit of UUID), or another scheme that may even be set up by the VO.
The assignment and the checking of the GDH can be performed by the Data Registry
(DR) (see below).
Similar to the GSH, a semantic requirement on the GDH is that it must not change
over time once it has been assigned. The reason for this requirement is that we want
to enable automated Grid services to be able to perform all possible operations on the
data – such as location, tracking, transmission, and so on. This can easily be achieved
if we have a handle by which the data is identified. One of the fundamental differences
between Grids and the Web is that for Web services, ultimately, there always is a human
being at the end of the service chain who can take corrective action if some of the
services are erroneous. On the Grid, this is not the case anymore. If the data identifier
is neither global nor unique, tracking by automated services will be very difficult to
implement – they would need to be as ‘smart’ as humans in identifying the data based
on its content.
15.3.1.2 The Grid data reference
For reasons of performance enhancement and scalability, a single logical data unit
(any kind) identified by a GDH may have several physical instances – replicas – located
THE OPEN GRID SERVICES ARCHITECTURE, AND DATA GRIDS
393
throughout the Grid, all of which also need to be identified. Just like the GSH, the GDH
carries no protocol or instance-specific information such as supported protocols to access
the data. All this information may be encapsulated into a single abstraction – in analogy
to the Grid Service Reference – into a Grid Data Reference (GDR). Each physical data
instance has a corresponding GDR that describes how the data can be accessed. This
description may be in WSDL, since a data item also can be viewed as a Grid resource. The
elements of the GDR include physical location, available access protocols, data lifetime,
and possibly other metadata such as size, creation time, last update time, created by, and
so on. The exact list and syntax would need to be agreed upon, but this is an issue for
the standardization efforts in bodies like GGF. The list should be customizable by VOs
because items like update time may be too volatile to be included in the GDR metadata.
In the OGSA context, the metadata may be put as Service Data Elements in the
Data Registry.
15.3.1.3 The data registry
The mapping from GDH to GDR is kept in the Data Registry. This is the most basic
service for Data Grids, which is essential to localize, discover, and describe any kind of
data. The SDEs of this service will hold most of the additional information on the GDR
to GDH mapping, but there may be other more dedicated services for very specific kinds
of GDH or GDR metadata.
To give an example, in Data Grids that only deal with files – as is currently the case
within the EU DataGrid project – the GDH corresponds to the LFN and the GDR is
a combination of the physical file name and the transport protocol needed to access
the file, sometimes also called Transport File Name (TFN) [16–18]. The Data Registry
corresponds to the Replica Catalog [24] in this case.
15.3.2 The functionality and the services
In this section we discuss the functionalities of Data Grids that are necessary and/or
desirable in order to achieve the Data Grid vision.
15.3.2.1 VO management
Virtual Organizations can be set up, maintained and dismantled by Grid services. In
particular, there needs to be functionality provided for bootstrapping, user management,
resource management, policy management and budget management. However, these func-
tionalities are generic to all kinds of Grids, and their detailed discussion is beyond the
scope of this article. Nevertheless, it is important to note that since VOs form the organic
organizational unit in Grids, VO management functionalities need to be supported by all
services. Some of these issues are addressed by the services for control and monitoring
(see below).
15.3.2.2 Data transfer
The most basic service in Data Grids is the data transfer service. To ensure reliable data
transfer and replication, data validation services need to be set up. Data validation after