Suggested schema change
To enable encoding of additional provenance information that is located outside the ANDS Collections Registry and is described using a URI, this proposal recommends to add a new type “provenance” to the relatedInfo element type.
"provenance": any information about how the data was derived or produced; including what events (such as creation, assemble, annotate, and transform etc.) happened to data, which instruments or software were used, who were involved, when and where an event happened.
Problem this suggestion addresses
RIF-CS v1.6.0 has some capacity to encode provenance information for a collection object, for example, description:type=”lineage”, date:type=”dc.created”, as well as related service and party objects. However, data providers may have fine-grained provenance information that can’t be transformed into a RIF-CS record, or may not own thus not be able to describe all resources in a provenance chain. Some data providers may even take a further step of setting-up a provenance access and query service and generating a provenance report on the fly. The proposed addition of type= “provenance” to the relatedInfo suggested vocabulary is intended to meet these requirements.
RIF-CS schema components affected
relatedInfo Type vocabulary
An entry in a Collection record may appear as:
Use case 1: data provider makes available a provenance access and query service or provenance report generate service (e.g Galaxy IReport), which will generate a whole trail of provenance information around a data collection:
<title>CSIRO Provenance Access and Query Service</title>
<identifier type="uri"> </identifier> #a current RIF-CS record with a PID 1234
<note>This provenance query results in a trail of provenance information of the collection.</note>
Note: This element is repeatable, if a RIF-CS record is a collection that has more than one data item, while data provider’s provenance is data item based.
Use case 2: data provider has captured and saved provenance information into a text file, e.g. an un-processed log file or a report generated from R code:
<title>Steps taken to produce this collection </title>
<identifier type="uri"> </identifier> #knitr generated report from R markdown
Use case 3: data provider has captured provenance information in a metadata record external to Research Data Australia:
<title>More provenance information in this landing page/metadata record</title>
<note>The URL points to a landing page with more provenance information.</note>
Data providers are highly recommended to follow an approach as described in use case 1 and 2. Use case 3 describes a scenario that sets low entry point for data providers who have provenance information but haven’t implemented policy, procedure and system to manage provenance information. In some cases, the URI in the use case 3 may be the same as in rifcs:location when location URI points to a landing page (source metadata managed by data providers), however it still worths to point out in the relatedInfo:type=provenance that the target page has provenance information not included in the RIF-CS record in RDA.
Impact on content providers
This addition provides more options to data providers; there will be no impact on those organisations that choose not to use it. For those who do wish to use the option, they will need to decide: does the page a provenance URI points to have provenance information additional to that already presented in the current RIF-CS record? If the answer is yes, they will then need to implement the option in their own systems, or accommodate it in their harvest of records to RDA. It is the responsibility of data providers to decide what and how to present provenance information on the linked page.
More and more ANDS providers are considering data provenance issues, especially those involved in data-intensive computing such as virtual labs. Some of them have taken the step of capturing and storing provenance information in log files or in a database, and describing provenance information from free-text to in PROV-O (provenance ontology) which may or may not be encoded directly in a metadata record. The proposed “provenance” option will offer these data providers the option to link a RIF-CS record to richer provenance information than that provided in a RIF-CS record.
If a data provider chooses to adopt this vocabulary option, the provider will need to implement it in their own system, if capturing metadata in native RIF-CS, and/or make a change to their harvest to RIF-CS.
- The new relatedInfo type “provenance” will need to be added to the ORCA registry database table.
- The vocabs.xml file will need to be amended to add the new relatedInfo type and definition.
- The vocabularies.html will need to be regenerated to reflect the addition of the new relatedInfo type.
- Changes will be required to the Content Providers Guide and Metadata Content Requirement document.
RIF-CS description types include an attribute “lineage”. A clear guidance should give to data providers on when description:type=lineage and relatedInfo:type=provenance should be used. The description:type=lineage is used for collection records to articulate lineage information in RIF-CS records for those repositories capturing such detail in their native schema. For example, ISO19115-2 lineage statement element can be naturally mapped to RIF-CS description:type=lineage. While relatedInfo:type=provenance is mainly to show what other resources are involved in creating or use a collection and how these resources are related to the collection.