A new distributed data analysis framework for better scientific collaborations

DOI

ORCID iD iconPhilipp S. Sommer4* , ORCID iD iconViktoria Wichert4, ORCID iD iconDaniel Eggert3, ORCID iD iconTilman Dinter1, ORCID iD iconKlaus Getzlaff2, ORCID iD iconAndreas Lehmann2, ORCID iD iconChristian Werner6, ORCID iD iconBrenner Silva1, Lennart Schmidt5, ORCID iD iconAngela Schäfer1
  1. Presenting author.
  2. Alfred-Wegener-Institut Helmholtz-Zentrum für Polar- und Meeresforschung (AWI), Germany.
  3. GEOMAR Helmholtz Centre for Ocean Research Kiel, Germany.
  4. German Research Centre for Geosciences (GFZ), Germany.
  5. Helmholtz Zentrum Geesthacht (HZG), Institute for Coastal Research, Germany.
  6. Helmholtz-Zentrum für Umweltforschung GmbH - UFZ, Germany.
  7. Karlsruhe Institute of Technology, Institute of Meteorology and Climate Research - Atmospheric Environmental Research (IMK-IFU), Germany.
  •  
    2021-03-03T10:30:00+00:00 2021-03-03T10:45:00+00:00 A new distributed data analysis framework for better scientific collaborations A common challenge for projects with multiple involved research institutes is a well-defined and productive collaboration. All parties measure and analyze different aspects, depend on each other, share common methods, and exchange the latest results, findings, and data. Today this exchange is often impeded by a lack of ready access to shared computing and storage resources. In our talk, we present a new and innovative remote procedure call (RPC) framework. We focus on a distributed setup, where project partners do not necessarily work at the same institute, and do not have access to each others resources. We present an application programming interface (API) developed in Python that enables scientists to collaboratively explore and analyze sets of distributed data. It offers the functionality to request remote data through a comfortable interface, and to share and invoke single computational methods or even entire analytical workflows and their results. The prototype enables researchers to make their methods accessible as a backend module running on their own infrastructure. Hence researchers from other institutes may apply the available methods through a lightweight python or Javascript API. In the end, the overhead for both, the backend developer and the remote user, is very low. The effort of implementing the necessary workflow and API usage equalizes the writing of code in a non-distributed setup. Besides that, data do not have to be downloaded locally, the analysis can be executed “close to the data” while using the institutional infrastructure where the eligible data set is stored. With our prototype, we demonstrate distributed data access and analysis workflows across institutional borders to enable effective scientific collaboration. This framework has been developed in a joint effort of the DataHub and Digitial Earth initiatives within the Research Centers of the Helmholtz Association of German Research Centres, HGF. https://sorse.github.io//programme/talks/event-048/ SORSE.enquiries@gmail.com SORSE
Add all events to your calendar

A common challenge for projects with multiple involved research institutes is a well-defined and productive collaboration. All parties measure and analyze different aspects, depend on each other, share common methods, and exchange the latest results, findings, and data. Today this exchange is often impeded by a lack of ready access to shared computing and storage resources. In our talk, we present a new and innovative remote procedure call (RPC) framework. We focus on a distributed setup, where project partners do not necessarily work at the same institute, and do not have access to each others resources. We present an application programming interface (API) developed in Python that enables scientists to collaboratively explore and analyze sets of distributed data. It offers the functionality to request remote data through a comfortable interface, and to share and invoke single computational methods or even entire analytical workflows and their results. The prototype enables researchers to make their methods accessible as a backend module running on their own infrastructure. Hence researchers from other institutes may apply the available methods through a lightweight python or Javascript API. In the end, the overhead for both, the backend developer and the remote user, is very low. The effort of implementing the necessary workflow and API usage equalizes the writing of code in a non-distributed setup. Besides that, data do not have to be downloaded locally, the analysis can be executed “close to the data” while using the institutional infrastructure where the eligible data set is stored. With our prototype, we demonstrate distributed data access and analysis workflows across institutional borders to enable effective scientific collaboration. This framework has been developed in a joint effort of the DataHub and Digitial Earth initiatives within the Research Centers of the Helmholtz Association of German Research Centres, HGF.

Language: English Prerequisites: A bit of background in Python would be helpful but not mandatory

Register Slides Talks Download

Updated: