Finding and reusing research data


The Leibniz Supertomputing Centre's new FAIR data portal makes it possible to find and use research data processed and stored on its computer systems.

Weather data from Bavaria or Italy, simulations of earthquakes or the formation of stars, models of human organs and vessels: the data storage facilities of the Leibniz Supercomputing Centre (LRZ), such as the Data Science Storage (DSS) and the Data Science Archive (DSA), currently can hold more than 250 petabytes of data generated on the high-performance computing (HPC) clusters or the artificial intelligence (AI) systems. "Most of the data sets are too large for the usual research data repositories, such as those of university libraries," says Dr Stephan Hachinger, head of the LRZ research data team, which is respnsible for the new portal, describing a fundamental problem. "We are therefore trying to publish them directly from the DSS and DSA and present them in the FAIR Data Portal.


The LRZ FAIR Data Portal is still in the test phase and currently lists only two publications. But they already show the functionality of the portal as an example: an overview of research results from all possible scientific disciplines produced at the LRZ, which can be searched by keywords, author names and other criteria. "With the portal, we want to make research data findable and accessible," explains Johannes Munke, an engineer from the Research Data Management team. "We still enter the information manually, but we are already working on a stylish input mask and ways to record the information automatically." More datasets and research will be added to the portal over time.

Transparency and connectivity in science

Digital research data contain much more information than is needed for a specific project. They should therefore be professionally organised and, where possible, recorded and stored in accordance with the international FAIR principles. This will make them discoverable, accessible, interoperable and reusable. Simulation results, visualisations or training data for analyses using AI methods can be verified, but above all they can be further explored, enriched with additional data and recalculated. This makes science transparent, connects researchers and avoids the need to repeat complex experiments or measurements. "In the past, you had to be networked in the research community to get data," says chemist Alex Wellmann, who is also working on the FAIR data portal. "Today, a lot of research data is FAIR and can increasingly be searched in portals like ours."

Personal networking is still important in science, but searching for digital research results becomes easier if they are provided with additional data about the content, the authors, the research institutions involved and more. The LRZ records this metadata according to the so-called DataCite standard and at the same time provides each data record with a so-called Digital Object Identifier (DOI). This digital identifier makes it possible to uniquely identify the digital information and thus fulfil the first rule of the FAIR Principles. Each record also contains information about the formats in which it is available, and can be linked to articles and publications produced with it. The FAIR Data Portal also links to the DSS or DSA to make information available via GLOBUS, an international infrastructure for sharing data and computing power for science. So datasets archived on tape at the DSA are copied back to local hard drives and can be edited and processed. The FAIR Data Portal edits or stores new results under its own title: "The prerequisite is that scientists want to publish their datasets," explains Munke. "The metadata can be captured and added later to facilitate research. But the research data itself, which is hidden behind a DOI, will always remain the same - we guarantee that".


Entry in the FAIR Data Portal: authors are named, institutions, a short description of the datasets and keywords

Open source software for the portal

In addition to the FAIR principles and the DataCite schema, the FAIR Data Portal is based on the open source software InvenioRDM. This was mainly developed at CERN near Geneva and is used in many research data storage systems. The research team is currently testing and improving the functionality with users and is working on a metadata input mask. Researchers can use the LRZ servicedesk to get advice on data management according to the FAIR principles and on how to handle their research results. "The importance of data management is now well known in the scientific community, and funders in particular insist that data be FAIR-compliant and therefore discoverable and reusable," notes Munke. His colleague Wellmann adds: "The need for data will increase rapidly, especially through AI methods; conversely, training data should be verifiable and researchable." The LRZ-FAIR data portal will therefore soon include more data sets: not only from SuperMUC-NG, but from all LRZ resources and various research disciplines. (vs/ssc)