Introduction to iRODS
In addition to this description, also see the brief introduction to the integrated Rule-based Data System and datagrids.
iRODS, which stands for i Rule Oriented Data Systems, is a project for building the next generation data management cyberinfrastructure. There are raging discussions in our group about what the i in iRODS means? There is no consensus and even individual attribution to the letter changes day by day. Is it integrated, intelligent, intuitive, internet, invaluable or possibly incomprehensible? One of the main ideas behind iRODS is to provide a system that enables a flexible, adaptive, customizable data management architecture. Hence, we leave it to individual users to hang an explanation for the i based on their intoxicating experience, or most possibly based on irritating frustrations.
The ideas for the iRODS project have existed for quite some time, but became more concrete through an NSF-funded project titled “Constraint-based Knowledge Systems for Grids, Digital Libraries, and Persistent Archives” which started in Fall of 2004. The development of iRODS is driven by the lessons learned from the deployment and use in production of the DICE Storage Resource Broker data grid technology SRB and through applications of theories and concepts from a wide range of well-known paradigms from other fields such as active databases, program verification, transactional systems, logic programming, business rule systems, constraint-management systems, workflows and service-oriented architecture.
A Quick SRB Overview
DICE SRB provides a grid-level middleware for sharing data and metadata distributed across heterogenous resources using uniform APIs and GUIs. To provide this functionality, SRB abstracts key concepts in data management: data object names, sets of data objects, resources, users and groups, and provides uniform methods for dealing with them. SRB hides the underlying physical infrastructure from users by providing global, logical mappings to the digital entities registered into a shared collection. Hence, the peculiarities of storage systems and their access methods, the locations of data, user authentication and access across systems, are hidden from the users. A user can access files from an online file system, near-line tapes, relational databases, sensor data streams and the web without worrying about where they are located, what protocol to use to connect and access the system and without establishing a separate account or password/certificate to each of the underlying computer systems to gain access, etc. These virtualization functionalities are implemented in the SRB system by maintaining mappings and profile metadata in a permanent database system called the MCAT meta catalog and by providing integrated data and metadata management which links the various sub-systems in a seamless manner.
The SRB system is a one-size fits all system. The policies used in managing the data at the server level are hard-coded. Also, if a user wants to perform complex sets of operations they need to script or program this at the client level. If a community wants to perform a different type of operation (say change the way the access control for files is implemented), they need to change the SRB code with the hope that it does not introduce unintended side-effects on other operations. Examples for such customizable requirements come from the SRB user community itself. For example, one user wanted a feature where all files in a particular collection should be disabled from being deleted even by the owner or data grid administrator, but other collections should behave as before! This kind of collection-level data management is not easily done in the current SRB setting without a lot of work. Another example is when a user wants to use additional or alternate checks for access control for sensitive files. This again will require specialized coding to accomplish. A third example, is where a user would like to asynchronously replicate (or extract metadata from, or create a lower resolution file from) newly ingested files in a particular collection (or file type) into an archive resource. This feature also needs additional coding and asynchronous scheduling mechanisms not easily performed with the SRB.
iRODS software belongs to a class of middleware which we term adaptive middleware. The adaptive middleware architecture (AMA for short) provides a means for adapting the middleware to meet the needs of the end user community without making any (if at all) programming changes. One can view the AMA middleware as a glass box where users can see how the system works and can tweak the controls to meet their demands. Normal middleware can be viewed as black boxes where no changes are programmatically possible to adjust the flow of the operations, except configuration changes that may allow one to set the starting conditions of the middleware.
There are multiple ways for achieving adaptive middleware architecture. In our approach, we use a particular methodology that we name Rule Oriented Programming or ROP for short. The rule oriented programming concept is discussed in some detail in the section on ROP.
The iRODS architecture provides a means for encoding customization of data management functionalities in an easy and declarative fashion using the ROP paradigm. This is accomplished by coding the processes that are being performed in the iRODS data grid system as rules (see Rules) that explicitly control the operations that are being performed when a rule is invoked by a particular task. These operations are called micro services (see Micro-Services)in iRODS and are C-functions that are called when executing the rule body. One can modify the flow of tasks when executing the rules, by interposing new micro-services (or rule-calls) in a given rule or by changing and recompiling the micro-service code. Moreover, one can add another rule in the rule base for the same task, but with a higher priority so that it gets chosen before an existing rule. This pre-emptive rule will be executed before the original rule. If there is a failure in the execution of any part of this new rule then the original rule gets executed.
Major features of the iRods architecture include the following:
1) Data grid architecture based on a client/server model and distributed storage and compute resources.
2) A database system for maintaining the attributes and states of data and operations.
3) A rule system for enforcing and executing adaptive rules
The iRods Rule System
At the core of The iRODS Rule System is the iRods Rule Engine running on all iRods servers. The Rule Engine can invoke a number of predefined micro-services based on the intepretation of the rule being executed.
Rules can be invoked on the servers internally to enforce/execute management policies for the system. These are system level rules. Examples include data management policies and automation of system level services.
The Rule Engine can also be invoked externally by clients through the irule command or the rcExecMyRule API. Typically, these are work-flow type rules which allow users to request the iRods servers to perform a sequence of operations (micro-services) on behalf of the user.
Some rules require immediate execution while others may be executed at a later time in the background (Rule Execution modes). The Delayed Execution Service allows rules/micro-services to be queued and executed at a later time by the Rule Execution server. Examples of micro-services that are suitable for delayed execution are postprocessing operations such as checksuming, replication and metadata extraction. The post processing micro-service msiExtractNaraMetadata was designed specifically to extract and register metadata from nail data objects uploaded by a NARA project.
The iRods Rule Engine
The underlying operations that need to be performed are based on C functions which operate on internal C structures. The external view of the execution architecture is based on tasks (we call them actions) that need to be performed and external input parameters (we call them attributes) that are used to guide and perform these actions. The C functions themselves are abstracted externally by giving them logical names (we call the functions "internal micro-services" and the abstractions "external micro services"). To make the links between the external world and the internal C apparatus, we define mappings from client libraries to rules. Moreover, since the operations that are performed by iRODS need to change the persistent state information in the ICAT catalog, the attributes are mapped to a persistent logical name space for metadata names that are used in ICAT.
The foundation for the iRODS architecture is based on the following key concepts.
- A Persistent Database [#] that shares data (facts) across time and users.
- A Transient memory [$] that holds data during a session.
- A set of Actions [T] that name and define the tasks that need to be performed
- A set of internal well-defined callable micoServices [P] made of procedures and functions that provide the methods for executing the sub-tasks that need to be performed,
- A set of external Attributes [A] that is used as a logical namespace to externally refer to data and metadata.
- A set of external micro Services [M] (or methods) that is used as a logical namespace to externally refer to methods in the rules.
- A set of mappings [DVM] that defines a relationship from external attributes in A to internal elements in # and $.
- A set of mappings [FNM] that defines a relationship from external micro services in M and Actions T to procedures and functions in P and other action names in T. In a sense FNM can be seen as providing aliases. One use will be to map different versions of the functions/procedures in P at run time to the actual execution process.
- A set of rules [R] which defines what needs to be done for each action [T] and is based on A and M.
The architecture is shown in Figure below.
Virtualizations in iRODS
iRODS can be thought of as providing a new abstraction for the data management processes and policies themselves (using the logical rule paradigm) much in the same way that SRB provided abstractions for data objects, collections, resources, users and metadata. The goal is to be able to characterize the management policies that are needed to enforce authenticity, integrity, access restrictions, data placement, and data presentation, and to automate the application of the policies for services such as administration, authentication, authorization, auditing and accounting as well as data management policies for replication, distribution, pre- and post-processing and metadata extraction and assignment. The management policies are mapped onto rules that control the execution of all data management operations. iRODS can be seen as supporting four types of virualization beyond those supported by a data grid such as the SRB.
- Workflow virtualization. This is the ability to manage the execution of a distributed workflow independently of the compute resources where the workflow components are executed. This requires the ability to manage the properties of the executing jobs. iRODS implements the concept of workflows through chaining of micro-services within nested rule sets and using shared logical variables that control the workflow.
- Management policy virtualization. This is the expression of management policies as rules that can be implemented independently of the remote storage system. We characterize management policies in terms of policy attributes that control desired outcomes. For each desired outcome, rules are defined that control the execution of the standard remote operations. For each rule application, persistent state information is maintained to describe the result of the remote operation. Consistency rules can be implemented that verify that the remote operation outcomes comply with the policy attributes. Rule-based data management infrastructure makes it possible to express management policies as rules and define the outcome of the application of each management policy in terms of updates to the persistent state information. iRODS applies the concept of transactional rules (ACID properties at the data management level) using datalog-type Event-Condition-Action rules working with persistent shared metadata.
- Service virtualization. The operations that are performed by the rule-based data management systems can be encapsulated in micro-services. A logical name space can be constructed for the micro-services that makes it possible to name, organize, and upgrade micro-services without having to change the management policies. This is one of the key capabilities needed to manage versions of micro-services, and enable a system to execute correctly while the micro-services are being upgraded. iRODS micro-services are constructed on the concepts of well-defined input-output properties, consistency verification, and roll-back properties for error recovery. The iRODS micro-services provide a compositional framework realized at run-time.
- Rule virtualization. This is a logical name space that allows the rules to be named, organized in sets, and versioned. A logical name space for rules enables the evolution of the rules themselves.