Use case class 1: validating and publishing scientific data collections

Scientific data collections produced by individuals or science collaborations are more valuable when made available to the broader community in a manner consistent with the FAIR principles. Doing so requires efficient tools for making the data accessible through standard protocols for selection and retrieval. Support is also needed for metadata provision, minting DOIs, etc. Furthermore, infrastructure is needed to create workflows that can be used for vetting such data for publication and subsequent analysis. The availability of these data collections would then enable new cross-experiment/cross-collaboration data sharing, ultimately leading to new scientific discoveries.

Note: The links at the indidual use cases lead to internal PUNCH4NFDI pages.

1.1 Open data workflows (HEP)

Preparing “open data” sets at sufficiently high abstraction level together with analysis code (examples). Maintain and register these data sets following international standards. Support small experiments to provide open data.

1.2 Multi-messenger analysis data centre (APP)

Multi-Messenger Analysis needs data from many observatories, in particular Data availability: All researchers of the individual experiments or facilities require quick and easy access to the relevant data. Analysis: Fast access to the generally distributed data from measurements and simulations is required. Corresponding computing capacities should also be available. Simulations and methods development: Researchers need an environment for simulations and the development of new methods (machine learning). Real-time analysis center: The multi-messenger ansatz requires a framework to develop and apply methods for joint data stream analysis. Open access: It is necessary to make the scientific data available also to the interested public: public data for public money! Education in data science: Not only data analysis itself, but also the efficient use of central data and computing infrastructures requires special training. Data archive: The valuable scientific data and metadata must be preserved and remain interpretable for later use (data preservation).

1.3 Operation of a German astro data centre

Many scientific data collections (from smaller groups or collaborations) lack an efficient way for making their data accessible according to FAIR standards and protocols, as well for selection and retrieval as for metadata provision, minting DOI etc. We need to provide software packages for that and (possibly also) infrastructure to organise a workflow of vetting data for publication.

1.4 Including large (HPC, HTC) Datasets from Supercomputing Centres in the frame of PUNCH4NFDI

Lots of data sets in Astro-/Particle Physics, in particular those resulting from simulations, are too large to be ingested into a classical research data repository like institutional repositories, Zenodo, and the similar. Making this “dark data” fair is the theme of these use case. This clearly involves the enrichment with appropriate metadata (covering at least basic DataCite or Dublin Core fields), the assignment of a persistent identifier (DOI) and descriptive landing page, and the export of these metadata via appropriate interfaces (OAI-PMH, REST interfaces and more) to the PUNCH4NFDI Science Portal. This effort can also serve as a blueprint to immerse extremely large storage systems with special data (e.g. on a particle-physics experiment) at experimental or observational institutions in PUNCH4NFDI.

1.5 HEP/HuK: Experiment-internal review: reproduction of results

Alice does her PhD at University X. She wants to measure the angular distribution of B -> K* mu mu decays with data collected by experiment ABC. In a previous analysis of a smaller dataset collected by the same experiment a tension with the Standard Model prediction was observed. In the internal review of her analysis Alice is asked to reproduce the previous result. It was the PhD project of Bob at University Y. Alice sends a mail to Bob and asks him about the details of his analysis, but the mail bounces because Bob has left physics and now works at Deepmind. Alice asks Bob's former supervisor, Carl, who can establish the contact with Bob. Bob does not have access to the resources at the university any more and asks Doris, an active member of the working group to search for his code and data. Doris finds three versions of the code, but is unsure which or if any of them was used to produce the final result. Bob had transferred the ownership of 24 TB of ntuple data on the institute's storage server to Doris before he left. Some way has to be found to give Alice access to it. Unfortunately Bob forgot the ntuple data of the control channels that were deleted when his account was deactivated. Alice spends a few weeks on collecting all available code and data of the previous analysis and then tries to reproduce its result, without success. In the review process several suggestions are made about what she could try. After three months Alice and the review committee give up and proceed without reproducing the old result. The question of reproduction of the old result comes up again in the collaboration wide review and the journal review, but cannot be answered successfully.

1.6 APP: Particle tracking in electromagnetic fields

Low energy particle tracking in electromagnetic fields was developed for KATRIN, but there are large requests also from other communities.

1.7 APP: Data Infrastructure Einstein Telescope

The Data Infrastructure of the Einstein Telescope is not yet defined - there is the chance to employ all the FAIR and data lake concepts from beginning.

1.8 Astro/APP: Processing and data sharing environment

In astronomy, many collaborations are formed with building instruments, and in recent years, also providing the data reduction pipeline for removing the instrument characteristics etc. Their ROI is observation time and -depending on the size of the collaboration - further analysis of the data is organised not always in a very efficient manner. This led to the notion of collaborative research environments and also to more elaborate Server-Software-Structures. Some of the Astro-Developments are listed here.

1.9 APP/Astro: analysis software preservation

Ensuring the long-term executability of time-dependent analysis software for large data volumes through virtualization

1.10 Tools for publishing smaller or specialised data collections

The Magneticum Web Portal ( https://c2papcosmosim.uc.lrz.de/ ; 1612.06380) is a platform for accessing and sharing the output of large, cosmological, hydro-dynamical simulations with a broad scientific community. It has a multi-layer structure: a web portal, a job control layer, a computing cluster and a HPC storage system. The outer layer enables users to choose an object from the simulations of the Magneticum set ( http://www.magneticum.org/ ). The user can run analysis tools on a chosen object. These services allow users to run analysis tools on the raw simulation data. The job control layer is responsible for handling and performing the analysis jobs, which are executed on a computing cluster. The innermost layer is formed by a HPC storage system which hosts the large, raw simulation data.