Metadata Harvesting

From Epos WiKi
Jump to: navigation, search

EPOS metadata A first approach to construction of Services and Metadata Harvesting

General Technical Outline

The EPOS architecture will employ a metadata catalogue system as part of the EPOS Integrated Core Services (ICS), and this metadata catalogue will hold the necessary details of the data and services provided by the EPOS TCS for the ICS to access. The metadata catalogue will enable the EPOS ICS user to perform discovery, visualisation, processing and other functions.

In order to manage all the information needed to satisfy user requests, all metadata describing the Thematic Core Services (TCS) Data, Datasets, Software and Services (DDSS) will be stored into the EPOS ICS, internal catalogue. Such a catalogue, based on the CERIF model, differs from most metadata standards used by various scientific communities. For this reason, EPOS ICS has sought to communicate to the TCSs the core elements of metadata required to facilitate the ICS through the EPOS Metadata Baseline Document.

The EPOS baseline, presents a minimum set of common metadata elements required to operate the ICS taking into consideration the heterogeneity of the many TCSs involved in EPOS. It is possible to extend this baseline to accommodate extra metadata elements where it is deemed that those metadata elements are critical in describing and delivering the data services for any given community.

The metadata to be obtained from the EPOS TCS as described in the baseline document (and any other agreed elements) will be mapped to the EPOS ICS catalogue. The process of converting metadata acquired from the EPOS TCS to CERIF will be done by EPOS WP07 &06 but in consultation with each TCS as to what metadata they have available and harvesting mechanisms. The expectation is that the various TCS nodes will have APIs or other mechanisms to expose the metadata describing the available DDSS in a TCS specific metadata standard that contains the elements outlined in the EPOS baseline documents better described in the following sections. It also requires ICS APIs (wrappers) or other to map and store this in the ICS metadata catalogue, CERIF. These TCS APIs and the corresponding ICS convertors collectively form the “interoperability layer” in EPOS, which is the link between the TCSs and the ICS.

NOTA BENE: Metadata is used here to mean the definition of metadata as defined in standards such as ISO19115/19, DCAT, Dublin Core, INSPIRE etc. describing the DDSS elements and not the index or detailed scientific data.


General recommendations for TCS

1. TCS metadata standard can be expressed in established metadata profiles, for example: ISO 19115, INSPIRE, DCAT, Dublin Core, CKAN.

2. TCS should provide metadata to EPOS covering the metadata elements defined in the EPOS baseline document as a minimum.

3. Each type of DDSS delivered must be associated with descriptive metadata that enables efficient data discovery and contextualisation (which enables a user to determine relevance and quality). The metadata should be, when possible, an approved (e.g. ISO) or de facto standard.

4. Each type of DDSS delivered (and its associated metadata) should be accessible via web services and/or APIs. The web services for exposing TCS metadata should preferably be defined using a recognized international standard. E.g OGC Catalogue Service CSW [a3], FDSN, OAI-PMH etc.

5. Each web service or API to access DDSS should be, when possible, based on approved (e.g. OSI) or de facto standards. Before building any new web services, a preliminary communication with the ICS team should be carried out in order to verify that it is suitable.

6. If a TCS needs to develop a metadata standard for its community, a preliminary check should be carried out with the ICS team and other communities, in order to verify what metadata standards (elements) are appropriate and/or include in a minimum the metadata elements defined in the EPOS Baseline, which are based on internationally recognised metadata standards (E.g. ISO, INSPIRE).

Metadata Protocols

As stated in the previous section, the general recommendation is to use standards, for instance OGC web services (such as CSW services) and other types of web services that may also be relevant. The metadata could also be extracted from other types of OGC services not expressed as CSW such as a WMS. For TCS communities that search a portal solution and who want to expose their metadata via web-services, it might be an option to reuse existing software. GeoNetwork, for instance, allows to harvest metadata from various sources (file system directories, web accessible folders, OGC CSW, OAI-PMH and others), it can be used as metadata portal and this software is able to expose the metadata in various protocols (OGC CSW / OAI-PMH).

Metadata Templates

As part of the process to facilitate the easy extraction of metadata from the TCS, EPOS is working on a series of metadata templates based on the EPOS Metadata baseline to help those communities that would find it easier to populate such templates to provide their metadata. This is not exclusive but as an option for certain communities to provide their metadata to EPOS.

We aim to provide these templates in for example JSON-LD, XML formats to help those communities that will require it and haven’t got existing metadata harvesting mechanisms. It would be different for every community, so we expect to interact directly with each TCS to understand their metadata capacities and setup and adopt the appropriate mechanism suitable to easily extract their metadata to map to the EPOS metadata catalogue.

Metadata Harvesting

Objectives and simple user story

In the framework of the TCS Harvesting, the goal of this section is to draw the main general guidelines. Such guidelines will be implemented "case by case" (i.e. with a use case approach) in the collaboration with each TCS when implementing the harvesting technologies.

The above process will be driven by user stories. An example of extremely simple user story is the following:

"As a Scientist, I want a list of FACILITIES/DATASETS/PERSON in this specific bounding box (and time range)"

Harvesting Technical Aspects

Types of services TCS should provide

The heterogeneity of the EPOS community implies that TCS have different maturity levels. Some communities may only give access to metadata attached to raw Digital Objects (e.g. ftp repository containing files with metadata in the header). However the best option is to provide a software layer which implements services for (a) “Data discovery”; (b) “Data Access and Retrieval” (e.g. APIs, RESTful); (c) “data analysis/visualisation/modelling/mining”, possibly based on existing standards.

1. Data Discovery services: are used to discover, through metadata, the data of interest in remote repositories. EPOS ICS is indifferent with respect to topology (single-sited, distributed or federated repository) and access points (single or multiple) as long as clear definitions about how to access to the services are provided.

2. Data Access and Retrieval services: are used to access a digital object (i.e. a file, a dataset, a document) and use it. The term access used can be thought also as a download process. These services may include also IAAA . For both services the use of already existing, international, community accepted standards is highly recommended. OGC-services standards, INSPIRE based standards, Dublin Core and other international standards can be relatively easily integrated because a number of tools to manage data and metadata already exist. Reusing existing software tools will save resources for software maintenance in the long term on both TCS and ICS side.

3. Data analysis/ visualisation/ modelling/ mining services: are used to extract meaning from the data using the software services provided. EPOS ICS will discover appropriate software services for the kinds of data requested by the user and (semi-automatically) compose the software stack to process the dataset(s) as required. Such aspects are still an ongoing work in EPOS, in particular in the Computational Earth Science layer.

Supported methods

Web services are the recommended way of exposing the metadata. However in the process fo building their e-infrastructure, the TCS may provide metadata with other methods, for instance pushing programmatically (e.g. daily cron job) XML or JSON files to the ICS-C e-infrastructure (details to be specified). Main methods can be summarised as follows:

1. Web services (PULL)

  -APIs or web services
  -Database dump from local (TCS) databases

2. Programmatic provision (PUSH), which may consist for instance in a programmatic (chron job) "sending" of metadata files to an ICS endpoint

Metadata Exchange / Harvesting

With respect to the metadata and related web services, there are two possible strategies to follow in order to achieve the ICS –TCS interoperability:

1. Metadata dump: the metadata from TCS is fully copied to the ICS metadata catalogue. It guarantees that the metadata is fully managed by the ICS, and it lowers the burden of TCS in providing a highly efficient and robust system providing access to the metadata. However, it requires periodic (e.g. daily) polling/copying procedures and synchronization mechanisms must be put in place to guarantee consistency.

2. Metadata Runtime Access: The access to metadata is done at runtime by querying web services with the defined APIs. The APIs specification must be stored in the ICS Metadata catalogue (to enable ICS to access the system in an autonomic way). It avoids the error-prone procedure of dumping the metadata. However it requires that TCS build very reliable and robust systems, able to manage a high number of concurrent queries as generated by users of the ICS.

Payload and schema

The payload of the metadata should be compliant with the main standards and methods. The suggestion is to use XML or JSON (JSON-LD AS WELL) , which are by the way the main standards used in most of the services. With respect to the schema of the metadata payload, the suggestion is to use standard schema (to be decided in the framework of collaboration between TCS and ICS).