Useful links and materials
Contents
- 1 General
- 2 Standards, iso norms, standardisation
- 3 Software tools
- 4 Community standards for data exchange in collection domain
- 5 Archive (file) formats and archive files
- 6 FAIR data archiving and "distributed" data archiving, visions and concepts
- 7 Archiving and long-term storage organisations in Europe with AIPs from the Science Collection domain
- 8 Further materials for discussion
- 9 2019: Biodiversity_Next Symposium (SI55): "Federated Infrastructures for Sustainable Biodiversity Data Management"
- 10 2020: CETAF Joint ISTC and Digitisation Working Groups Virtual Meeting
General
List of digital preservation initiatives
EOSC Marketplace for Data Storage and Data Archiving, see also Solutions for a sustainable EOSC: A FAIR Lady (olim Iron Lady) report from the EOSC Sustainability Working Group 2020 and PID Architecture for the EOSC - Report from the EOSC Executive Board Working Group (WG) Architecture PID Task Force (TF)and the project OCRE
GAIA-X Technical Infrastructure June 2020
Open consultation for the EOSC Strategic Research and Innovation Agenda with EOSC Open Consultation Booklet July 2020
Principles of Archival of Digital Assets, published by iRODS, 2014 (bit preservation and functional preservation)
Digitale Bestandserhaltung in der Praxis – Entwicklung eines Preservation-Planning-Konzepts zur Langzeitarchivierung von digitalem Kulturgut am Beispiel der Verbundlösung Berlin-Brandenburg by C. Loose, 2016, FH Potsdam
Funktionale Langzeitarchivierung digitaler Objekte – Erfolgsbedingungen des Einsatzes von Emulationsstrategien, Suchodoletz 2009, Universität Freiburg
nestor Handbuch. – Eine kleine Enzyklopädie der digitalen Langzeitarchivierung. Version 2.3, 2010 hrsg. v. H. Neuroth, A. Oßwald, R. Scheffel, S. Strathmann, K. Huth im Rahmen des Projektes: nestor - Kompetenznetzwerk Langzeitarchivierung und Langzeitverfügbarkeit digitaler Ressourcen für Deutschland. urn:nbn:de:0008-2010071949
Best practices for sharing and archiving datasets – Polar data catalogue, 2014
Long-term preservation of biomedical research data, 2018
Scientific collections, 2009 comprize also artefacts, technical objects, DNA samples
FAIR Data and Services in Biodiversity Science and Geoscience, DiSSCo context, Lannom et al. 2019
Provisional Data Management Plan for DiSSCo infrastructure, 2019: "All data that can be linked to collection objects (specimens) are in scope."
DiSSCo Technical Infrastructure, see also DiSSCo Prepare and DiSSCo Knowledge Base
RDA group Interoperable Data Archiving and Migration Using the RDRI Working Group Recommendations with iROD and DVUploader, see https://www.rd-alliance.org/sites/default/files/InteroperableDatasetExchange.RDA2020_0.pdf , BagIt specification complemented with BagIt Profiles, recommending to include DataCite metadata in each package
RDA group Research Data Repository Interoperability WG Final Recommendations with pdf.
RDA group FAIR Data Maturity Model WG
RDA group Assessment of Data Fitness for Use WG
Wikipedia Digital preservation
Neuroth et al. 2014 Nestor -Langzeitarchivierung von Forschungsdaten - eine Bestandsaufnahme
Data complexity (in the size and intricacy of data): Size, structure, variety, abstraction
problem of researchers to find appropriate data repositories for published data, see data repositories recommended by NATURE under https://www.nature.com/sdata/policies/repositories and policies for data preservation there
Bähr, T. 2016 Dienstleistungen für die Digitale Langzeitarchivierung
Standards, iso norms, standardisation
Table on ISO and DIN norms relevant for archiving
Norm | Title | Purpose/ Notes |
---|---|---|
ISO 11506:2017 | Document management applications — Archiving of electronic data — Computer output microform (COM)/Computer output laser disc (COLD) | it applies to different types of electronic data, such as text and two-dimensional graphic data which can be represented as a black-and-white image |
ISO 14641:2018 | Electronic document management – Design and operation of an information system for the preservation of electronic documents – Specifications | Attention: This document is not applicable to information systems in which users have the ability to substitute or alter documents after capture. |
ISO 14721:2012 | Space data and information transfer systems — Open archival information system (OAIS) — Reference model | see Reference Model for an open archival information system (OAIS): OAIS/ISO 14721 Version 2012 online: https://public.ccsds.org/pubs/650x0m2.pdf; see also GFBio Overview on Iso Standards for Digital Archives |
ISO 15948:2004 | Information technology — Computer graphics and image processing — Portable Network Graphics (PNG): Functional specification | specifies a datastream and an associated file format, Portable Network Graphics (PNG) |
ISO 16363/TDR | Space data and information transfer systems — Audit and certification of trustworthy digital repositories | see also GFBio Overview on Iso Standards for Digital Archives |
ISO 16919:2014 | Space data and information transfer systems — Requirements for bodies providing audit and certification of candidate trustworthy digital repositories | see also GFBio Overview on Iso Standards for Digital Archives |
ISO 19005-1:2005 | Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1) | how to use the Portable Document Format (PDF) 1.4 for long-term preservation of electronic documents. It is applicable to documents containing combinations of character, raster and vector data. |
ISO 19566-1:2016 | Information technology — JPEG Systems — Part 1: Packaging of information using codestreams and file formats | describes common elements of a system layer for JPEG standards, referred to as JPEG Systems (example: JPG) |
ISO 20614:2017 | Information and documentation – Data exchange protocol for interoperability and preservation (DEPIP) | DEPIP specifies a standardized framework for the various data (including both data and related metadata) exchange transactions between an archive and its producers and consumers. Interchanges between archives (including archives integrated in organizations, public archives, storage service suppliers) are also considered.... |
DIN 31644:2012-04 | Information und Dokumentation - Kriterien für vertrauenswürdige digitale Langzeitarchive (Information and documentation - Criteria for trustworthy digital archives) | |
DIN 31645:2011-11 | Leitfaden zur Informationsübernahme in digitale Langzeitarchive (Information and documentation - Guide to the transfer of information objects into digital long-term archives) |
Nestor – Standardisation by DNB
DOA architecture with DONA Specification, 2018
SIARD-Dateiformat und Standard eCH-0165 SIARD-Formatspezifikation (SIARD = Software-Independent Archival of Relational Databases), 2018. Es handelt sich um eine normative Beschreibung eines Dateiformats für die langfristige Erhaltung von relationalen Datenbanken, siehe eCH-0165 SIARD Format Specification
Software tools
DBPTK Database Preservation Toolkit
SIARD Suite with SIARD Suite GitHub
KEEP Solutions Portugal, tools for preservation
E-ARK with Deliverables, E-ARK AIP pilot specification and E-ARK SIP Specification for Submission Information Packages
eArchiving project services and tools
LOCKSS technology with LOCKSS software
PERICLES github and publication under file:///C:/Users/Gast/Downloads/PERICLES_AV_Insider_rd_publication.pdf
CORDRA software - Highly configurable software for managing digital objects at scale.
Open Preservation Foundation Products
Rosetta from ExLibris group
Community standards for data exchange in collection domain
useful to improve functional long-term preservation by including schema definitions as xsd?
Data exchange standards, protocols and formats relevant for the collection data domain, overview with emphasis on the GFBio network
Digital Curation Centre DCC: List of disciplinary metadata standards, see also Metadata Guidance and RDA Metadata Standards Directory
Archive (file) formats and archive files
Electronic file formats, see "Archiving"
Public Record Office and Nôm 喃 (PRONOM) is a web-based technical registry to support digital preservation services. It is an operational public file format registry, see PRONOM
Sustainability of Digital Formats: Planning for Library of Congress Collections: Format Descriptions
File format. A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.
Digital File Types - Preservation
Best File Formats for Archiving by Fabian M. Suchanek, 2019
Recommended File Formats for Archiving Research Data
Media Types (formerly known as MIME types) specified by RFC6657
FAIR data archiving and "distributed" data archiving, visions and concepts
LOCKSS Lessons Learned in Successful Community Collaboration, LOCKSS as digital library program in the digital preservation field
Archiving in a FAIR way, an Overview of Data Archive Costs
Prompting an EOSC in practice: Final report 2018
Save Archive FEderation SAFE-PLN with MoU
Archiving and long-term storage organisations in Europe with AIPs from the Science Collection domain
The table includes a first selection of trusted data repositories/ data centers with goals in archiving scientific collection and biodiversity research data (last changes, April 2020).
name, country | kind of organisation/ affiliation with respect to archiving services | general mission, scope | science collection metadata standards used for AIPs (see GFBio checklist) | archive formats (see FACILE checklist) | references and pilot studies | AIP-PIDs | contact person in WG4 context (preferably WG4 members) | notes, certification | |
---|---|---|---|---|---|---|---|---|---|
CINES, France | national public institution | national e-infrastructure | DublinCore with extension? | FACILE - list pour un archivage sur la plateforme PAC du CINES | pilot description, ICEDIG document | ePIC... | Nicolas Cazenave | archiving together with EUDAT-CDI? | |
EGI, The Netherlands | a federated (European) e-Infrastructure, publicly funded | European and international e-infrastructure | DublinCore with extension? | various archive formats? | ? | ||||
FinBIF, archiving network, Finnland | service infrastructure at one national history museum, publicly funded | national e-infrastructure | Darwin Core | few selected archive formats?, e.g., XML+XSD?, JPEG 1.0?, tiff? | pilot description, ICEDIG document, Schulman et al. (2021) | HTTP URI... | ? | ||
GBIF data publishers, network of long-term storage and archiving institutions/ organizations | an international federated e-Infrastructure, funded by member states and by single participating archiving institutions | international e-infrastructure + national e-infrastructure + institutional e-infrastructure | archiving done by GBIF data publishers via Darwin Core, see GBIF Darwin Core; alternatively ABCD | few selected archive formats?, e.g. XML+XSD, JPEG 1.0? | DOI, HTTP URI... | Fabien Caviere | local installation of IPT: GBIF Integrated Publishing Toolkit or BioCASe provider software generating AIPs for local archiving published data assets; ; GBIF nodes may act as data publishers on the national level; GBIF downloads are stored on GBIF servers for 6 months, see https://www.gbif.org/faq?q=DOI | ||
GFBio network of data centers and archiving institutions, Germany | service infrastructure at several national history museums and other archiving institutions, publicly funded | national e-infrastructure + institutional e-infrastructure | ABCD, Darwin Core | few selected archive formats?, e.g., XML+XSD?, JPEG 1.0?, tiff?, wav? and? | data archiving descriptions | HTTP URI, DOI ... | Peter Grobe, Tanja Weibulat | AIPs for archiving published and non-published data assets from the science collection domain; partly together with regional (super)computing centers | |
GWDG, Germany | institute operated and funded by the University of Göttingen and the Max-Planck-Gesellschaft zur Förderung der Wissenschaften e. V. (GmbH) | international e-infrastructure + institutional e-infrastructure | ? | various archive formats? | ePIC, DOI | Sven Bingert | Offering ePIC service for AIPs for different science domains, with public repositories for scientific data | ||
VIAA, now meemoo, Belgium | Belgique/ Flemish institute for archives, publicly funded | national and regional e-infrastructure | DublinCore with extension? | various archive formats used in library domain?? | Brecht Declercq | Flemish Institute for Archives | |||
Zenodo, Switzerland | public services operated by CERN (the latter funded by member states) | international e-infrastructure | OAI-PMH and others, see under Zenodo metadata formats | various archive formats used in library domain? | pilot description, ICEDIG document | DOI... | Donat Agosti? | general-purpose open-access repository, AIPs for archiving published data assets |
Images of scientific collections, scientific collection objects and parts of them as well as of of natural science taxa with occurrence and descriptive data are in the focus of scientific collections. Other images and information gained for research studies and published in scientific papers might be linked to scientitic collection object data. This data might be long-term stored and even archived, e.g., by the BioImage Archive, see Ellenberg et al. (2018), BioStudies Archive, see Sarkans et al. (2018) and ArrayExpress.
see also TIB Hannover, https://www.tib.eu/de/publizieren-archivieren/digitale-langzeitarchivierung and https://www.tib.eu/fileadmin/Daten/presse/dokumente/baehr-schwab_bub_2018-11.pdf
Further materials for discussion
https://www.gbif.org/data-processing
Data on the Web Best Practices
LERU Roadmap for Research Data
ICEDIG Deliverables: https://icedig.eu/content/deliverables
Digitisation infrastructure design for EUDAT / CINES: Report 2019: specifies the requirements for adapting CINES & EUDAT services for long-term storage of large-scale digitised biodiversity data
Digitisation infrastructure design for Zenodo 2019
Design of a collection digitisation dashboard: Report 2019, MIDS in Table 8
Digitalisation infrastructure for national open science clouds Report 2019, Finland
California Digital Library with CDL Guidelines for Digital Objects(CDL GDO)
What is a digital object (philosophy)
LIBER Fairness Repositories Report
Levels of digital preservation of the National Digital Stewardship Alliance (NDSA)
Digital Preservation Handbook of the Digital Preservation Coalition
CETAF Specimen Preview Profile (SPP) with Sourceforce Persistent Collection Objects Identifiers and Best practices for stable URIs and CETAF Specimen URI Tester, see Güntsch et al. (2017) Actionable, long-term stable and semantic web compatible identifiers for access to biological collection objects
UK National Archives: Archive Principles and Practice: an introduction to archives for non-archivists, 2016, see 3.5.6 and 3.5.7
Integrating Institutional Archives with Disciplinary Web Repositories Workshop (iDigBio related, January 2020
GFBio OAIS standard data pipelines for collection and specimen data
Heuscher, Stephan & Jaermann, Stephan & Keller-Marxer, Peter & Moehle, Frank. (2004). Providing Authentic Long-term Archival Access to Complex Relational Data.. Proceedings of the ESA/ESRIN Symposium PV-2004: Ensuring Long-Term Preservation and Adding Value to Scientific and Technical Data, Frascati, Italy, October 2004. ESA WPP. 241-261. see https://arxiv.org/abs/cs/0408054 with pdf
Core Trust Seal Certification Glossary, based on OAIS terms
UUID discussion: Triebel, D., Reichert, W., Bosert, S., Feulner, M., Osieko Okach, D., Slimani, A. & Rambold, G. 2018. A generic workflow for effective sampling of environmental vouchers with UUID assignment and image processing. – Database, 2018 (Article ID bax096), 1–10. (doi.org/10.1093/database/bax096), see https://academic.oup.com/database/article-abstract/doi/10.1093/database/bax096/4797113.
Handle system, handle-based system, e.g., DOI system with DOI registration agencies like DataCite
LSID is a URN specification; LSID resolver, see LSID in Wikipedia, issuing authorities like zoobank (is not registered with the Internet Assigned Numbers Authority - IANA)
CETAF stable identifiers (CSI) for specimens, see Güntsch, A., Hyam, R., Hagedorn, G., Chagnoux, S., Röpert, D., Casino A., Droege, G., Glöckler, F., Gödderz, K., Groom, Q., Hoffmann, J., Holleman, A., Kempa, M., Koivula, H., Marhold, K., Nicolson, N., Smith, V. S. & Triebel, D. 2017. Actionable, long-term stable and semantic web compatible identifiers for access to biological collection objects. – Database, 2017, 1–9. (doi.org/10.1093/database/bax003)
HTTP URI (see RFC3986)
ICEDIG Digital Specimen Repository with Natural Science Identifier (NSId); see https://nsidr.org/#objects/?query=*%3A*&sortFields=/name, issuing authority, e.g, https://www.allianceforbio.org/: "Technically, an NSId is a unique alphanumeric name string registered in the Handle System that acts as an opaque abstract reference to the thing that is identified; in this case, a Digital Specimen. Administration of the Handle System globally is a shared responsibility overseen at its top-level by the DONA Foundation (Geneva). At the sub-global (Europe and other continents) level, DiSSCo is presently (mid-2020) analysing different options for the technical implementation."
DOI; see also Handle Resolver under https://www.handle.net/
multilingual European DOI Registration Agency (mEDRA) with DOI Multiple Resolution - a multiple resolution service, where one DOI points to multiple resources and services associated to the DOI and "Learning objects"
IGSN handle system for geosamples, with IGSN e.V. as registration service and a number of allocating agents of the IGSN
ePIC Persistent Identifiers for eResearch, handle system
DID Decentralized Identifiers W3C proposed recommendation
A Persistent Identifier (PID) policy for the European Open Science Cloud (EOSC), 2020, 10.2777/926037; Wittenberg as co-author
PIDINST - Persistent identifiers for instruments. https://datascience.codata.org/articles/10.5334/dsj-2020-018/; (ePIC related)
ARK IDs, Archival Resource Key with N4T Resolver: Keeps names (identifiers) persistent, forwarding (resolving) them to the best known web addresses. Names --> Things: Any kind of name – ARK, DOI, URN, Handle, PMID, PDB, Taxon, GRID, arxiv, ISSN, ... --> Any kind of thing – web pages, data, physical specimens, vocabulary terms, living beings, groups, ...; for authorities see https://n2t.net/e/ark_ids.html; cooperation with DataCite
REFEDS (Research and Education FEDerations) with eduPersons
Arms, W.Y. 1995. Key Concepts in the Architecture of the Digital Library. https://www.dlib.org/dlib/July95/07arms.html
Kahn R. & Wilensky R. 1995. A Framework for Distributed Digital Object Services. http://www.cnri.reston.va.us/home/cstr/arch/k-w.html with CNRI and DOI
Weigel, T., Kindermann, S. and Lautenschlager, M., 2014. Actionable Persistent Identifier Collections. Data Science Journal, 12, pp.191–206. DOI: http://doi.org/10.2481/dsj.12-058
Klump, J et al 2017 Editorial: 20 Years of Persistent Identifiers – Applications and Future Directions. Data Science Journal, 16: 52, pp. 1–7, DOI: https://doi.org/10.5334/dsj-2017-052
T. Weigel, U. Schwardmann, J. Klump, S. Bendoukha & R. Quick. Making data and workflows findable for machines. Data Intelligence 2(2020), 40–46. doi: 10.1162/dint_a_00026
Schwardmann, U., 2020. Digital Objects – FAIR Digital Objects: Which Services Are Required?. Data Science Journal, 19(1), p.15. DOI: http://doi.org/10.5334/dsj-2020-015
Harjes, J., Link, A., Weibulat, T., Triebel, D. & Rambold, G. 2020. FAIR digital objects in environmental and life sciences should comprise workflow operation design data and method information for repeatability of study setups and reproducibility of results, Database, 2020 (Article ID baaa059), 1–20. (doi.org/10.1093/database/baaa059).
Natural Science Identifiers versus CETAF stable identifiers with discussion on IGSN relation
CatRIS - Catalogue of Research Infrastructure Services with ELViS listed
The research data repository of the Environmental Data Initiative (EDI) and LTER initiative: Gries et al. 2020. Change in Pictures: Creating best practices in archiving ecological imagery for reuse; see also https://dilcis.eu/images/Specifications/AIP/DASBOARD_E-ARK_AIP_1_0.pdf
EUROCRIS with The Common European Research Information Format (CERIF). It is the comprehensive information model for the domain of scientific research. It is intended to support interchange of research information between and with CRISs. It is used by OpenAIRE.
Archiving of BLOBs and similar large binary objects, see https://de.wikipedia.org/wiki/Binary_Large_Object
An Overview of End-to-End Entity Resolution for Big Data: https://dl.acm.org/doi/abs/10.1145/3418896 (relation between linked entities and physical objects)
COSTS of federal scientific collections
https://iwgsc.nal.usda.gov/economic-analyses-federal-scientific-collections
Schindel, D. E. and the Economic Study Group of the Interagency Working Group on Scientific Collections. 2020. “Economic Analyses of Federal Scientific Collections: Methods for Documenting Costs and Benefits.” Report. Washington, DC: Smithsonian Scholarly Press. https://doi.org/10.5479/si.13241612
2019: Biodiversity_Next Symposium (SI55): "Federated Infrastructures for Sustainable Biodiversity Data Management"
SI55 talks:
There were 6 talks with published abstracts (see hyperlinks and DOIs), four of them were strongly related to WP4. The results gave a good overview on the landscape of federated repositories in the Biodiversity domain.
- Gardens4Science: Setting Up a Trusted Network for German Botanic Gardens Using Open Source Technologies (abstract: https://doi.org/10.3897/biss.3.35368)
- The Freshwater Information Platform: An online network supporting freshwater biodiversity research and data publishing (abstract: https://doi.org/10.3897/biss.3.37378)
- SEINet: A Centralized Specimen Resource Managed by a Distributed Network of Researchers (abstract: https://doi.org/10.3897/biss.3.37424, subject strongly related to WG4)
- Long-Term Reusability of Biodiversity and Collection Data using a National Federated Data Infrastructure (abstract:https://doi.org/10.3897/biss.3.37414, subject strongly related to WG4)
- NFDI4BioDiversity: Biodiversity, ecology and environmental data (abstract: https://doi.org/10.3897/biss.3.37282 subject strongly related to WG4)
- Use of European Open Science Cloud and National e-Infrastructures for the Long-Term Storage of Digitised Assets from Natural History Collections (abstract: https://doi.org/10.3897/biss.3.37164, subject strongly related to WG4)
2020: CETAF Joint ISTC and Digitisation Working Groups Virtual Meeting
COST MOBILISE WG4 talk:
- Data archiving strategies in regard to CETAF facilities and planned DiSSCo services – highlighted by COST Mobilise (Dagmar Triebel)
Back to Working Group WG4
Back to WG4 Workshop "Data storage and archiving strategies" in Sofia (NMNHS)
Back to WG4 Workshop "Towards a documentation and guideline" in Warsaw
Back to WG4 Workshop: Towards publishing a "Guideline for long-term preservation and archiving of data products from scientific collections facilities", online event
Back to MOBILISE website
see also Definitions of core terms in the data archiving context