|
--- |
|
license: mit |
|
pipeline_tag: image-feature-extraction |
|
--- |
|
|
|
[![SDO-FM_Banner.png](https://cdn-uploads.huggingface.co/production/uploads/66aa4018951180b79f1c6574/k-6Zzqp78Ed_0tu8W2pQJ.png)](https://sdofm.org) |
|
<h2 align="center">SDO-FM: A foundation model for the Sun</h2> |
|
|
|
# 1. Introduction |
|
SDO-FM is a foundation model using data from NASA’s Solar Dynamics Observatory |
|
(SDO) spacecraft; integrating three separate instruments to encapsulate the |
|
Sun’s complex physical interactions into a multi-modal embedding space. This |
|
model can be used to streamline scientific investigations involving SDO by making |
|
the enormous datasets more computationally accessible for heliophysics research |
|
and enable investigations that require instrument fusion. |
|
|
|
The overall process for building SDO-FM is composed of |
|
four stages; (1) data preparation, (2) large foundation model (FM) training, (3) embedding extraction, |
|
and (4) fine-tuning or direct embedding usage for scientific validation cases. Collectively we denote |
|
the data preparation as effort completed under SDOML [4], a machine-learning dataset of SDO. |
|
|
|
|
|
|
|
Our models are based upon autoencoders, with training conducted under the objective of image |
|
reconstruction over the period beginning from satellite launch in 2010 to 2023. Once these models |
|
are trained, a compressed representation dataset is created from the embeddings by a full-pass over |
|
the encoder. The compressed representations are called direct embeddings and provide a helpful |
|
result as a set of available SDO features at around two-thousandths (0.002) the original size. Lastly, |
|
the direct embeddings as well as standard model fine-tuning, are used to conduct scientific validation |
|
through a validation harness which is used to check our results against past ML-based heliophysics |
|
approaches and to compare their computational expense. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66aa4018951180b79f1c6574/vB4pPe_x_x-9Zkdj7yWD1.png) |
|
|
|
# 2. Quick Start |
|
To run inference with any of the provided models, you can pull a Docker image [[Docker Hub](https://hub.docker.com/repository/docker/spaceml/sdo-fm/general), or follow installation instructions below. To run any of the scientific tasks: |
|
```bash |
|
python scripts/main.py --config-name=embeddings_nvae_virtualeve |
|
``` |
|
|
|
## 2.1 Installation |
|
SDO-FM can be installed locally by directly installing the package in [this GitHub repository](https://github.com/spaceml-org/SDO-FM). It's advised to use the docker image, however dependencies are contained in the usual [requirements.txt](https://github.com/spaceml-org/SDO-FM/blob/main/requirements.txt). |
|
```bash |
|
pip install -e . |
|
``` |
|
|
|
## 2.2 Usage |
|
|
|
To run any task we assume execution inside a container with the image described in the [Dockerfile](https://github.com/spaceml-org/SDO-FM/blob/main/Dockerfile) and Hydra configurations, these are kept in the [experiments](https://github.com/spaceml-org/SDO-FM/tree/main/experiments) directory. The entry point is [main.py](scripts/main.py) and args will select a configuration: |
|
```bash |
|
python scripts/main.py --config-name=default |
|
``` |
|
CLI overrides are still possible with this selection but be aware of some shells not escaping quotes or sqaure brackets: |
|
```bash |
|
python scripts/main.py --config-name=default experiment.seed=37 |
|
``` |
|
|
|
## 2.3 Pre-training |
|
```bash |
|
python scripts/main.py --config-name=pretrain_32.2M_samae_HP |
|
``` |
|
|
|
## 2.4 Notebooks |
|
A series of [notebooks](https://github.com/spaceml-org/SDO-FM/tree/main/notebooks) are available to explore each of the four downstream tasks described later in this document. |
|
|
|
# 3 Method |
|
SDO-FM is composed of a backbone, optional neck, and head. We define the backbone |
|
as the model initially trained on the reconstruction task, the neck as the converter between backbone |
|
and head, the former then selected for the downstream application (or validation task). We implement |
|
two model families as backbones, one stemming from a Nouveau Variational Autoencoder (NVAE) |
|
[11], the other from a MAE [12]. They are both adapted to better accommodate our scientific dataset |
|
and for intermediate export of their latent spaces in the form of “embeddings.” We additionally |
|
evaluated various feature engineering options regarding how to manage the solar disk, the most |
|
effective included a simple look up from Stonyhurst coordinates, a heliographic coordinate system |
|
for a fixed observer on Earth (suitable given the geosynchronous orbit), to pixel space. |
|
|
|
|
|
## 3.1 Model choice |
|
Model selection was initially determined by the ability to capture solar phenomena, guided by |
|
applicability to SDO imagery and the ease of access to the embeddings in the latent space. The |
|
autoencoder architecture was selected for the backbone for ease of embedding construction and |
|
extraction. By design, autoencoders create a lower-dimensional representation during the encoding |
|
process. Other requirements included engineering efficiencies, such as ability to mask the solar limb |
|
for on-disk experiments, and cheaply bias by importance sampling for areas of interest (e.g. active |
|
regions). |
|
|
|
### 3.1.1 Solar-aware Masked Autoencoder |
|
Masked Autoencoders (MAEs) learn to be capable at reconstructing images with random components |
|
removed [13]. The approach follows the standard ViT-patchification common to transformer computer |
|
vision approaches for deconstruction of the image that the attention mechanism can learn between. |
|
The source of this “powerful expressivity” is attributed to “rich hidden representation” [14]. This is |
|
particularly of interest in our scenario, as we seek to learn which components of solar imagery are of |
|
value for our scientific validation cases. This model was expanded to increase suitability for temporal |
|
information for remote sensing tasks [15]. We have continued to iterate, adding “solar-awareness” by |
|
including the ability to process the nine wavelengths of interest to us via the Atmospheric Imaging |
|
Assembly, efficiencies for processing the solar disk, and the ability to optionally bias the model |
|
towards learning active regions of scientific interest. |
|
|
|
### 3.1.2 Nouveau-VAE |
|
The Nouveau Variational Autoencoder (NVAE) is a deep hierarchical VAE created for image |
|
generation. Like the MAE, it is able to create a rich latent space using depth-wise separable |
|
convolutions and batch normalization. The NVIDIA team’s codebase was modified to permit access |
|
to the hierarchical structure to successfully extract embeddings. |
|
|
|
# 4. Scientific Validation Cases |
|
**Predict F10.7** This index is a proxy for solar irradiance, which can be measured from the ground, |
|
as this frequency is not absorbed by the atmosphere. Can we achieve good agreement with ground |
|
measurements? There is limited scientific value in this prediction of a proxy measure such as F10.7, |
|
however this simple task clearly indicates learned capacity in a single result. |
|
|
|
**Virtual EVE** In 2014, an instrument malfunction resulted in the loss of the MEGS-A module |
|
of SDO/EVE. With four years of overlapping data, [16, 17] used a hybrid CNN/linear regression |
|
model to successfully demonstrate the capability of machine learning methods to estimate missing |
|
EUV irradiance measurements from MEGS-A (and the degraded MEGS-B components of the EVE |
|
instrument). This validation task employs the embeddings constructed from AIA to understand the |
|
contributions from solar features on the EUV spectra, as the mapping between instruments exists due |
|
to the narrow-band images (SDO/AIA) and sun-as-a-star spectra (SDO/EVE) observing the same |
|
plasma distribution. A linear model accounts for a large portion of the relationship, while a CNN is |
|
used to correct for outlier events such as solar flares. There are known concerns regarding the model’s |
|
performance post-2020, as AIA instrument performance deviates further from the 2014 baseline. |
|
Some of these issues can be addressed by incorporating other sources of irradiance, such as data from sounding rockets, for training over longer periods, although these are sparse. Importantly, this |
|
outperforms a physics-based inversion approach [18]. |
|
|
|
**Missing Channel Reconstruction** The reconstruction of missing extreme ultraviolet (EUV) images |
|
from wavelength images is a crucial task given the often low or unusable quality of image data frames |
|
from the Solar Dynamics Observatory (SDO). Currently, there is no effective method to recover these |
|
missing steps. However, the foundation model is capable of reconstructing individual frames by |
|
leveraging contextual information available in other wavelength channels. This approach allows for |
|
interpolation to provide a best-guess estimate of missing data at any arbitrary time step. |
|
As with the Virtual EVE project, and differential emission measure analysis [18], the overlapping |
|
temperature range covered by different SDO/AIA wavelength channels allows for the temperature |
|
distribution of the underlying plasma to be reconstructed, may enable the inference of properties of |
|
different temperature ranges. |
|
|
|
The overlapping range covered by different wavelength channels may enable the inference of properties |
|
of different temperature ranges. This overlap can be used within a machine learning model to |
|
produce an estimation to replace data that is either missing, corrupted, or otherwise unusable. Our |
|
objective is to develop a more robust model that operates with higher computational efficiency while |
|
producing results comparable to the current SOTA. Special attention is given to the model’s ability to |
|
capture non-linear relationships or rare events, such as intensity values in flaring regions. |
|
There are several uncertainties inherent in this process. Some channels may be more readily recreated |
|
than others due to the physical assumptions that channels in the middle of the temperature/wavelength |
|
ranges will have the most overlap with other channels, potentially yielding better results. However, |
|
this overlap might not always correspond to the actual missing data in the SDO. Addressing these |
|
uncertainties requires an understanding of the shortfalls to determine the appropriateness of this |
|
reconstruction technique in different scenarios. |
|
|
|
**Autocalibration** The SDO/AIA EUV channels exhibit degradation due to exposure to the same |
|
emissions they are intended to measure. This degradation results in apparent dimming over time |
|
across multiple EUV channels with unique characteristics. This poses challenges for long-term |
|
studies, as degradation trends within the dataset need to be corrected. Until 2014, SDO utilized EVE |
|
to correct this degradation. As discussed, a malfunction of SDO/EVE resulted in the loss of the |
|
MEGS-A component, and calibration is currently performed by sounding rocket flights. In response |
|
to this, [19] used a CNN to reconstruct the Atmospheric Imaging Assembly (AIA) multi-channel |
|
degradation curves. |
|
|
|
Data requirements for this study include the SDOML data from AIA as well as older correction |
|
tables. The sampling requirement is minimal, with data being required once per day or even less |
|
frequently. Traditional SOTA methods, such as those performed by the Lockheed Martin Solar and |
|
Astrophysics Laboratory (LMSAL), involve calibration using sounding rocket flights. These methods, |
|
while accurate, are expensive and technically demanding. Our goal is to reproduce the results from |
|
[19] with greater efficiency in terms of data required and computational resources. This efficiency |
|
is evaluated through an examination of the resultant images compared to those produced by SOTA |
|
calibration pipelines, alongside intensity histograms, data spike analysis, and other metrics. |
|
|
|
# 5. Results |
|
Overall, our model families were evaluated for their backbone reconstruction task and against our four |
|
scientific validation cases. In all but the autocalibration task they reached the same level of accuracy |
|
or surpassed their classical counterparts in a fraction of the required time. In the autocalibration case, |
|
the direct embedding approach was able to match, but took additional training time. |
|
|
|
## 5.1 Reconstruction |
|
Loss for the reconstruction task is measured by pixel RMSE within the solar disk. SAMAE results |
|
presented in fig. 5 indicate a clear ability to reconstruct most wavelengths under a small embedding |
|
dimension (128) and within a short number of training epochs (10). Interestingly this model struggles |
|
to reconstruct 131 & 171Å, which is likely due to a normalization error we’re still investigating. The |
|
Nouveau-VAE model on raw pixel intensity performs better, even when including the solar limb |
|
|
|
## 5.2 Direct Embeddings |
|
Training each scientific validation case on the embeddings directly led to generally much faster |
|
training time and matching or surpassing of accuracy. The was an effort made to best evaluate |
|
the embeddings outside of the scientific cases to consider embedding-to-embedding comparison. |
|
The common TSNE approach was taken over a small one-year sample and there was |
|
seperation by solar activity. This approach however is still fairly opaque and hence the validation |
|
approaches are considered more appropriate. |
|
|
|
# Acknowledgements |
|
|
|
This work is the research product of the SDO-FM: A Multi-Modal Foundation Model POC |
|
for SDO. This has been funded and supported by NASA under **Grant award No |
|
80NSSC24K0701**. Any opinions, findings, and conclusions or recommendations expressed |
|
in this material are those of the authors and do not necessarily reflect the views of the |
|
National Aeronautics and Space Administration (NASA). The research and its |
|
outputs have been designed, managed and delivered by Trillium Technologies Inc |
|
(https://trillium.tech). Trillium is a research and development company with a focus on |
|
intelligent systems and collaborative communities for planetary stewardship, space |
|
exploration and human health. Trillium aspires to ensure that the |
|
latest tools and techniques in Artificial Intelligence (AI) and Machine Learning (ML) are |
|
applied to developing open science for all Humankind. |
|
|
|
**Authors** |
|
|
|
James Walsh, University of Cambridge |
|
Daniel Gass, University of Central Lancashire |
|
Raul Ramos Pollan, Universidad Industrial de Santander |
|
Richard Galvez, Pure Storage |
|
Paul Wright, Dublin Institute for Advanced Studies |
|
Atılım Güneş Baydin, University of Oxford |
|
Noah Kasmanoff, AE Studio |
|
Jason Naradowsky, University of Tokyo |
|
|
|
PI: Anne Spalding, Trillium Technolgies Inc |
|
Co-I: James Parr, Trillium Technologies Inc |
|
|
|
|