Measurement and laboratory values, descriptions of experiments, charts and tables, papers, presentations: as in all research projects, data and files are continuously generated in Transregio PlantMicrobe (TRR356), in which the Ludwig-Maximilians-Universität München (LMU), the Technical University of Munich (TUM), and Eberhard Karls University of Tübingen (EKUT) have been collaborating since 2023. These concern the coexistence of microbes and plants and include data that biologists and geneticists generate themselves, data they collect from others and use, as well as those they publish. “It is important that researchers structure and manage their data properly from the very beginning,” recommend Alexander Wellmann and Dr. Matthias Krinninger from the Research Data Management Team at the Leibniz Supercomputing Centre (LRZ). “This includes briefly describing files and information, commenting on them when necessary, and organising them clearly—for example through meaningful names or in folders and subfolders.”
Research data management is still a comparatively young task in science. It is rarely taught—if at all, then more often by libraries—but it promotes efficiency and traceability of results, especially in computational work. Through contributions to various projects and to the National Research Data Infrastructure (NFDI), the LRZ has been evaluating data management techniques and methods for around ten years and has also been developing digital tools and platforms for this purpose.
For the biology community TRR356 and its roughly 20 working groups, VERDA was created in collaboration with the Tübingen Centre for Data Processing—a computing and communication platform. It is based on GitLab, an open-source service that developers typically use to organize everything from planning and version control to testing and deploying software. Equipped with useful project management tools, GitLab is also well suited to research groups: they can collaboratively edit text and image files, organise tasks, and document workflows or datasets. The platform can be opened to additional project partners if needed and expanded with further tools for research and documentation.
For rapid communication, the EKUT team integrated the freely available Matrix chat system. VERDA now also hosts other web-based services such as an electronic lab notebook as well as virtual storage and computing resources. According to the Annotated Research Context (ARC) schema, which is widely used in biology, the ARC Manager can be used to structure research results. ARCpub then supports the online publication of preprints or datasets with computations in the repositories or electronic storage systems of the participating universities. “This tool developed by us automatically generates standardized metadata—that is, information that describes a dataset and is required for publication,” explains Wellmann. In addition, a Digital Object Identifier (DOI) is required, which is usually generated by the repositories. While the DOI is a unique and permanent identifier, metadata provide general information about the content, such as the names of the authors, the time the research results were created, the file formats used, the technologies employed, and more. “This,” says Wellmann, “makes datasets and scholarly articles searchable and accessible online. In addition, content can be reproduced, or measurement, simulation, and other data can be reused in other projects.”
These FAIR principles for data stewardship increase efficiency and transparency in science. Structured data help reduce duplicated work or the repetition of costly experiments; they also make it possible to retrace and verify scientific contributions. This task is becoming increasingly important, as publication pressure in academic careers and the use of artificial intelligence (AI) are currently causing the number of scientific publications worldwide to rise sharply. However, this also includes a great deal of junk—plagiarism or fictitious studies produced by so-called paper mills and submitted to academic publishers. It is therefore no coincidence that not only the German Research Foundation but also European funding bodies are increasingly demanding professional research data management in their projects.
“Because research data management is still young, there are no uniform rules yet, but rather many discipline-specific requirements and expectations,” Krinninger outlines a fundamental problem. This results in a wide variety of standards: the ARC format from biology differs from other metadata schemas, for example DataCite, an international consortium working to improve easy access to scientific data. In addition, the repositories of libraries and research institutions are based on different storage technologies and open-source programs, which also influences file formats and thus standards.
This can be illustrated by the example of the Gauss Centre for Supercomputing, an association of Germany’s three high-performance computing centres, including the LRZ. Within this network, the InHPC-DE project aims to harmonise different infrastructures and HPC systems as well as their support services—and thus also the offerings for research data management. The goal is to allow academia and industry to more easily choose between and switch among the centers’ services and resources. Online publication of simulation data or computational results from the various data stores is also intended to become more convenient. If researchers consent, they should be able to make their results accessible to others for further processing.
“Operators of data centres and storage facilities each use their own metadata formats in their data repositories,” observe Krinninger and Wellmann. “So metadata have to be converted—ideally in an automated way.” From the open-source tool HOMER, the LRZ team developed HOMER Fork. This program systematically searches datasets and, for example, extracts information from log files generated during HPC computations to create metadata. The HOMER Converter also supports various metadata formats and converts them as required by the target repository. This also works with metadata that have been entered manually according to the DataCite schema.
Using the HOMER Converter, metadata in ARC format can be transformed into metadata compliant with the DataCite schema. TRR356 could thus also transfer its research data to the LRZ FAIR Data Portal and publish them there. The portal has existed for two years and primarily addresses the HPC community; it could become a permanent LRZ service if long-term funding is secured. “The portal is already available and can be used by people all over the world,” says Wellmann. The team is working to raise awareness of the portal among LRZ users as well as at conferences such as the EGU and through presentations. This enables researchers to make large, largely immobile datasets available for further use. Although these datasets remain archived in LRZ data stores, they can be accessed remotely and processed at other computing centers.
Working with data—particularly the FAIR principles and the expectation that data should be reusable—has now become deeply ingrained in the LRZ team’s approach. As a result, open-source tools such as Ansible are preferred when building platforms or developing tools. With this toolkit, for example, the design of VERDA can be easily reproduced. “The platform is modular, built from open-source tools, and includes strategies or functions that we can also use for other projects,” says Krinninger. VERDA thus serves as a blueprint for further services and technical requirements. “Good data management,” adds his colleague Wellmann, “is when it becomes transparent to outsiders and others can benefit from experience and knowledge.” (vs | LRZ)