 |
|
 |
Overall Architecture and Main Components
Based on the approach outlined in the overview, we have developed an architecture whose main components are: Layered Archive Infrastructure, Producer - Archive Workflow Network (PAWN), Management of Preservation Processes (MPP), and Consumer - Archive Network (CAN). The architecture is illustrated in the following Figure 1.
 Figure 1
We advocate a distributed archive infrastructure whenever the digital assets are of interest to significant communities. We now clarify what we mean by a "distributed archive infrastructure." For our purposes, an archive refers to a consolidation of the data of an enterprise onto storage systems with a centralized metadata management system. The storage systems themselves may be managed under a single administrative domain as is typical for digital libraries, or managed across widely distributed administrative domains, as is the case in a data grid environment. By distributed archive, we refer to the federation of a number of archives across the layers shown in Figure 1. Such federation can range from a loose federation, say implemented through peer-to-peer technologies, to a tightly coupled federation, say implemented through federated data grid technologies. Distributed archives have a number of significant advantages, including better scalability, increased redundancy, higher reliability (since systems and security failures are typically uncorrelated, especially when dealing with loosely coupled, heterogeneous systems), and overall lower cost because of the possibility of resource sharing.
We now turn to briefly describe the components of ADAPT.
- Layered Archive Infrastructure
The main elements are the following:
- Data Management
- This software layer manages the data (bitstreams) stored across the storage systems, and assigns a unique identifier to each digital object (this can be made globally unique by using cryptographic hashing or through a naming service). Our current pilot persistent archive uses the SRB for data management at each site, and a federated SRB version to manage and replicate data across the sites.
- Information Management
- This software will manage descriptive, preservation, and administrative metadata, and will make use of indexing schemes to support fast access to the data. We are currently using the MCAT (Metadata CATalog) of the SRB for the pilot persistent archive and a geospatial metadata database (based on Informix) for the GLCF. A federated version of MCAT is used to manage information across the three sites of our pilot persistent archive.
- Deep Archive
- We assume that each archive has set aside some storage for a deep archive, which can only be accessed by administrators. In this project, we propose to design and build a peer-to-peer distributed deep archive that will achieve a cost-effective, highly reliable, and secure repository, and will operate independently of the rest of the infrastructure. The security and reliability offered by such a design can be quantitatively shown to be substantially superior to alternative designs.
- Security Management
- This component is responsible for setting up and managing the overall security infrastructure of the archive, which include support for secure authentication and access, secure ingestion, and secure preservation management. Distributed archives will be handled through distributed trust management, assuming autonomous security infrastructure for each archive. Our current persistent archive prototype is built around the Grid Security Infrastructure (GSI) and uses X.509 security certificates and public-key encryption.
- Producer - Archive Workflow Network (PAWN)
- This component, fully developed and tested, captures the interactions between the producer and the archive and enables automated secure ingestion of digital objects into the archive. Long-term preservation begins when the object is created, and hence the details of this step are crucial to the lifecycle management of the digital objects. PAWN uses METS to encapsulate content, structural, context, descriptive, IP and access rights, and preservation metadata. PAWN supports either the push model (producers prepare and push data into the archive) or the pull model (the archive pulls the data from producers), and its architecture is illustrated in Figure 2.
PAWN consists of three major software components: management server at the producer; client at the producer; and receiving server at an archive (distributed management and receiving servers for distributed producers and archives). We assume the most general case in which a number of people at the producer will be engaged in preparing and transferring data into the archive. The management server will act as a central point for the initial organization of the data, and for tracking bitstreams and metadata functionality. More specifically, this server performs the following functions:
- It provides the necessary security infrastructure to allow secure transfer of bitstreams between the producer and the archive.
- It assigns a unique identifier for each bitstream to be archived, which is unique within a collection, but not globally unique.
- It provides an interface for bitstream organization and metadata editing.
- It accepts checksums/digital signature, system metadata and other client supplied descriptive metadata.
- It tracks which bitstreams have been transferred to the archive.
 Figure 2
A client will run on each machine to automatically register preservation information and transfer the corresponding Submission Information Packets (SIPs), as defined by the OAIS model, into the archive. The client will be responsible for:
- Bulk registration of bitstreams, checksums and system metadata
- Assembly of a valid SIP
- Transmission of SIP to the archive either directly or through a third party proxy server
- Automatic harvesting of descriptive metadata (e.g., e-mail headers) as necessary.
The archive will have a server setup to receive data transferred from the producer. This server will accept data and initiate verification/validation processes on the bitstream. Some security key negotiation between all three areas may be necessary for the producer to securely transfer documents to the archive. The receiving server will need to do the following:
- Securely accept SIPs from clients at a producer site
- Process SIPs and initiate verification/validation processes
- Coordinate authentication with the management server at the producer site
- Verify with the management server that all SIPs have arrived intact
- Provide enough temporary storage for incoming SIPs until they can be replicated into a digital archive and validated.
The overall security architecture of PAWN is based on open standards (PKI, X.509, and GSI - Grid Security Infrastructure) and distributed trust management. It enables mutual authentication, confidential communication, and requires no or minimum user intervention. Since we assume minimal operational trust between an archive and producer, we allow for each party to manage security locally. More details about PAWN can be found in:
UMIACS-TR-2004-49 - PAWN: Producer - Archive Workflow Network in Support of Digital Preservation
PAWN version 1.0 was released in July 2004, and we expect the next version to be released in November 2004.
- Management of Preservation Processes (MPP)
- This software component deals with policy management, monitoring and preservation services based on automated tools for monitoring and detecting system, network, and media degradation and failures, track and manage technology evolution and obsolescence, and drive and audit replication, migration, and recovery of collections when they are at risk. Our policy manager software, developed under the Lightweight Preservation Environment (LPE) project (developed in collaboration with Fujitsu Laboratories of America), is the first step toward the development of this component.
- Consumer - Archive Network (CAN)
- This software component enables high level exploration and access of the content of the distributed archive, including information discovery across collections using the ontology technology, retrieval and display of content, and advanced digital library services.
|