Better search and use of research data

Results of simulations, Measurement data, interviews, images, social media data, statistics: In Research datas you will find noit only one answer, so its should be made available to as many users as possible. This is the central goal of several projects of the National Research Data Infrastructure (NFDI), which brings together IT service providers such as the Leibniz Supercomputing Centre (LRZ) with research institutions, colleges and universities in Germany. Since 2018 and for the time being up to 10 years, the federal and state governments have been funding the development of open data platforms and portals for various scientific fields with 90 million euros annually, which will be equipped with new, smart analytical tools based on artificial intelligence and statistical methods.


File cabinet - the traditional data storage. Photo: Jan A. Kolar/Unsplash

The LRZ is already directly involved in two consortia and won five more in 2021. The expertise of the Research Data Management, File and Storage Systems and Big Data and Artificial Intelligence teams is particularly in demand. The aim is to make research datasets easier to find and use internationally, and to do so for as long as possible. The LRZ has committed itself to the FAIR rules, according to which data should be findable, openly accessible, interoperable and reusable. "This simplifies the verifiability as well as the sharing of research results", Dr. Stephan Hachinger says in the interview below. He leads the Research Data Menagement Team at LRZ. "A prerequisite for FAIR data management is the provision of research results with additional information." These standardised metadata are indexed by search engines and describe the content of data sets like a library index describes books. In addition to author, content and instructions for use, a checksum in the metadata ensures that a data set has not been manipulated or changed during use.

NFDI projects with LRZ participation

BERD@NFDI: This project makes accessible data on the economic and labour situation as well as social developments and provides artificial intelligence and machine learning methods for evaluation. BERD@NFDI is led by the University of Mannheim.
FAIRmat takes care of information and research results from the data-intensive material sciences, physics and chemical physics and is led by the Humboldt University of Berlin.
NFDI4Earth focuses on the earth sciences. The LRZ is contributing its experience in setting up terrabyte and helping to simplify the use of simulation data. The Technical University of Dresden is leading this project.
PUNCH4NFDI, on the other hand, is creating a platform on which particle and astroparticle physicists can store their mass data and develop integrated data and metadata tools and file formats for this purpose. This project is organised by the German Electron Synchrotron (DESY) in Hamburg and Zeuten.
• Finally, Text+ addresses text- and language-based data from the humanities, i.e. information such as books, interviews, lectures, and aims to make them accessible for science. The LRZ supports the project with technical know-how.

Since 2019, the LRZ has also been involved in the German Human Genome-Phenome Archive (GHGA), which will provide the technical infrastructure for data storage and analysis of information on human genetics and medicine under the leadership of the German Cancer Research Center in Heidelberg. The LRZ is also involved in the NFDI Platform for Engineers (NFDI4Ing); the leadership for this project lies with the RWTH Aachen. NFDI4Ing is particularly concerned with metadata in high-performance computing or supercomputers such as the SuperMUC-NG - a topic that is of great importance for all NFDI projects.

Harmonise Technical Access and Store Metadata

Make research results publicly accessible for as long as possible and worldwide: This task sounds simple, but it poses many challenges, especially since data sets today usually comprise several petabytes or even terabytes. Dr Stephan Hachinger heads the Research Data Management team at the LRZ and describes the challenges of setting up data platforms and storage for science and research, as the NFDI is currently promoting.

The NFDI promotes the development of a wide variety of data platforms for the social and material sciences, medicine, astrophysics and other scientific fields. What do these projects have in common despite their different contents? Dr Stephan Hachinger: The NFDI consortia are building open data platforms and portals for various research areas, which are also being equipped with new, smart analysis tools based on artificial intelligence or statistical methods. In doing so, they apply the FAIR principles for research data management and look for technical solutions and tools to make data findable, accessible, interoperable and reusable. This simplifies the verifiability as well as the sharing of research results. A prerequisite for FAIR data management is the provision of research results with additional information. On the one hand, this metadata provides information about the content, such as author, date of creation, subject, and on the other hand, it also provides instructions on its use, i.e. file format, quality, storage location. The NFDI association brings together all consortia and projects to develop common minimum standards for data use and data storage as well as practical solutions to interdisciplinary, technical requirements.

What challenges does this pose? Hachinger: General problems that certainly arise in all scientific disciplines are, for example, the storage of metadata (filing, format), the completeness of the information, its efficient publication and export or dissemination in search engines. The harmonisation of technical access and access options is also not entirely straightforward; researchers at institution A should be able to access the data sets of institution B and, if possible, already be able to evaluate them via cloud services using initial analysis procedures. Last but not least, data volumes in science are growing, but large data sets, for example from simulations in supercomputing, are difficult or impossible to move from the original storage, but should also be accessible, interoperable and reusable or reusable according to the FAIR principles.

What experience does the Leibniz Supercomputing Centre (LRZ) bring to these projects? Hachinger: Firstly, the LRZ has gained a lot of experience in dealing with big data in a wide variety of areas over the past few years. We have built up our own storage facilities, especially for the largest data sets, and together with researchers we offer and optimise methods of artificial intelligence and machine learning, i.e. smart analysis methods. Together with the German Aerospace Centre, the LRZ has just developed the high-performance data platform terrabyte and is now equipping it with analysis tools. And we have been working for several years on how to equip very large data sets with useful metadata in a discipline-agnostic way, i.e. without focusing on just one scientific discipline. This way, they can be published without having to move them out of the original storage system. This, in turn, is important for giant data sets of several hundred terabytes or even several petabytes, such as those created in supercomputing with the SuperMUC-NG. To this end, we are working on possible transmission protocols and procedures to register the information on Big Data from science in search engines or common indices such as EUDAT-B2FIND, i.e. to make it searchable. To do this, however, metadata must be integrated, and we are working on procedures such as the so-called sidecar files or special platforms based on databases. We bring this experience with publishing and disseminating metadata on research results from High Performance Computing to the NFDI consortia. It is important for us to be able to optimise and adapt the procedures we have developed according to the ideas of researchers, so that in the end a practicable, technical solution is created that benefits all participants in the NFDI. Specifically for the LRZ, we would like to use this to develop a standard service for research data management, especially for the management of supercomputing data, and supplement this with more services for specific research areas that are also represented in the NFDI consortia. (vs)