Analysing distributed or local data sets consitutes a core element of the scientific enterprise
and for this it is essential to deploy a data management system that enables data finding, efficient
data movement, and the execution of compute jobs. It is also crucial for users to be able to control
and monitor these analyses through a uniform portal interface. Three types of analyses needs to
be supported: downloading intermediate data sets from an archive to a local compute platform, moving the data to a large scale shared compute resource, extremely large datasets where the so-called
code-to data approach needs to be adopted. The PUNCH4NFDI science portal will provide a working
solution for PUNCH and other communities.
(simulation steered) analysis/processing of large data sets organised access to archived data on tape porting data workflows to smaller experiments
Many of the large surveys but increasingly also simulations require machine-learning techniques to extract the relevant information. In contrast to traditional data analysis this requires generally applicable ML codes, but also ML algorithms specific of the envisaged data analysis and suitable training sets.
Analyses of large datasets cannot necessarily be achieved by first downloading the data and then processing them. In such a case there is a need to take a code-to-data approach. In the case of a joint analysis of two large datasets (e.g., in astro datasets from different wavelengths that are all too large for efficient movement), the problem can be more complex, requiring the code-to-data analyses of each dataset followed by the exchange of metadata that enables combination of, e.g., likelihood information or matched filter information from the two datasets without moving the entire dataset. An application is the multiwavelength characterization of an astrophysical source in the limit that the datasets are extremely large. In the limit that the exchanged metadata are small compared to the original data, the described approach leads to efficiency gains.
analysis of large simulated raw data sets on new hardware further analysis of primary multi community observables.
In this subsection remain a collection of use cases contributed by participants in the Astro@NFDI proposal. They represent various domain specific applications that typically involve the analysis of pixel or catalog data. They have not been fleshed out but provide examples from the astro domain that involve data analysis. The workflows would differ in detail but would benefit from access to statistical and machine learning libraries. Many of these analyses can be implemented with small or large datasets, suggesting that code-to-data analysis techniques would be relevant for most in the limit of future large datasets, but that more traditional data-download followed by local analysis would also be useful (and would be the typical way that these are addressed with current datasets).