BDH2024 Rationale – Inter-University Institute for Data Intensive Astronomy

Rationale

The astronomical landscape is undergoing a seismic shift. The advent of current large surveys such as Kepler, Gaia, LAMOST and those planned for the imminent future, e.g., the Vera Rubin Observatory and the Square Kilometre Array (SKA), has led to explosive growth in the volume of data, so called Big Data, that one needs to collect, store and analyse. At the same time, ever more powerful high performance computing facilities and improved algorithms have paved the way for increasingly high resolution, more complex and detailed simulations, e.g., for cosmological large-scale structure formation, magnetohydrodynamic turbulence in the interstellar medium or cosmic relic neutrinos, with data cubes that can no longer be downloaded and stored on single computers. Although less mature than the observational counterparts, the simulations are now well on the fast track to producing Big Data.

Parallel growth of Big Data for both simulations and observations, however, is only one part of the changing paradigm; whereas traditionally observations and simulations operated largely independently from one another (albeit with cross-pollination of, for example, input initial conditions or the extraction of synthetic observables for forward modelling to `compare notes’), nowadays, the simulations themselves have become a critical part of the observational process, for example, 100~terabyte-scale simulation libraries were created in order to interpret the black hole images reconstructed by the Event Horizon Telescope (EHT) project.

Just as the convergence of experiment/observation and theory in the pre-computational era catapulted our understanding of modern physics, the convergence of data science and simulations in the exascale computing era, will undoubtedly bring new advances and perspectives in astronomy.

In this Focus Meeting (FM) we aim to explore these new horizons at the interface between computational astrophysics and Big Data. We have identified four keystone areas, and we will discuss the challenges and opportunities that they bring for science, the research community and society more broadly.

– Synergies between simulations and observations

The first is the immense potential for synergies between the two fields. Using simulations to interpret Big Data, as was done for the EHT results, is a prime example. State-of-the art simulations and observations on a variety of science topics, e.g., 21cm cosmology, imaging black holes, gravitational wave detections, solar and heliospheric physics, star formation and protoplanetary disks, can be used for both the interpretation of current data, and also in forward modelling for future missions, e.g., SKA. The potential for new discoveries is great, however, some challenging questions remain as to how in this new space we can avoid introducing biases that would for example preclude us from discovering the `unknown unknowns’. Furthermore, simulations are often used to determine the uncertainties in the observational data, but that approach is less certain when the simulations themselves are interwoven with the observations, thus a question faced by both communities when apart, may become compounded at the interface, namely how to quantify the uncertainties in values measured or predicted.

– Tomorrow’s computational paradigm: programming and infrastructure

The infrastructure, both hardware and software to bring these two fields together also represents a challenge, and at the same time a tremendous opportunity for innovation. The requirements for Big Data and simulations are not necessarily aligned in terms of memory, compute speed, etc. To bring them together will require a shift in platforms and architecture. We will highlight the great strides being made with respect to storing, processing, and analysing data using containers, kubernetes on research clouds (e.g., with MeerKAT and ilifu, the African data-intensive research cloud), and in data centers. Hardware accelerators such as GPUs, FPGAs are already increasingly in use, but much work has to be done to convert existing codes and pipelines. While it is not efficient to use the Python language for simulations, the more traditional languages C and Fortran are not ideal for working with Big Data either. As we shift to exascale computing, what programming languages will we develop in?

– Data-driven astronomy and Artificial Intelligence (AI)

Along with the necessary computational infrastructure, including data processing centres and research clouds, Machine Learning (ML) and Artificial Intelligence (AI) are expected to play a prominent role in the near future, and their usefulness for astronomical research is growing rapidly (e.g., the ALeRCE brokers for transient classification). These tools provide a means of efficient processing of large data sets and are already being used extensively for classification in the observations and in speeding up algorithms in the simulations (e.g., in radiation transport problems). Preparing for the growing data, and the data complexity, we have to develop new techniques in data mining, disseminating information and making the science tractable.

– Building sustainable software, cyberinfrastructure, and communities

The challenges above also bring opportunities for innovation and importantly for the growth of the computational astrophysics community. We will highlight the work being done to build sustainable software, cyberinfrastructure, and communities, from ensuring that we develop standards and good practices for recycling and reusing code, to having public data and online servers and platforms with resources (e.g., Google colab, IDIA Hack4Dev, IAU CB1 webpage, AfAS science portal). Repositories and portals have become increasingly the norm. They can be used to host online training e.g., MOOCs and Software/Data Carpentry, and workshops/hackathons (e.g., DARA Big Data). Acknowledging the great carbon footprint that doing astronomy has, we will also explore ways to minimize that, ensuring true sustainability.

Through panel discussions, the FM will draw-up a roadmap outlining the best ways to support this objective, building on the work already happening in many developing countries, the existing and planned access to computational infrastructure, and training, sharing information and resources.