SDO-FM / README.md

Update README.md

07f61e2 verified about 1 month ago

14.4 kB

	---
	license: mit
	pipeline_tag: image-feature-extraction
	---

	[![SDO-FM_Banner.png](https://cdn-uploads.huggingface.co/production/uploads/66aa4018951180b79f1c6574/k-6Zzqp78Ed_0tu8W2pQJ.png)](https://sdofm.org)
	<h2 align="center">SDO-FM: A foundation model for the Sun</h2>

	# 1. Introduction
	SDO-FM is a foundation model using data from NASA’s Solar Dynamics Observatory
	(SDO) spacecraft; integrating three separate instruments to encapsulate the
	Sun’s complex physical interactions into a multi-modal embedding space. This
	model can be used to streamline scientific investigations involving SDO by making
	the enormous datasets more computationally accessible for heliophysics research
	and enable investigations that require instrument fusion.

	The overall process for building SDO-FM is composed of
	four stages; (1) data preparation, (2) large foundation model (FM) training, (3) embedding extraction,
	and (4) fine-tuning or direct embedding usage for scientific validation cases. Collectively we denote
	the data preparation as effort completed under SDOML [4], a machine-learning dataset of SDO.



	Our models are based upon autoencoders, with training conducted under the objective of image
	reconstruction over the period beginning from satellite launch in 2010 to 2023. Once these models
	are trained, a compressed representation dataset is created from the embeddings by a full-pass over
	the encoder. The compressed representations are called direct embeddings and provide a helpful
	result as a set of available SDO features at around two-thousandths (0.002) the original size. Lastly,
	the direct embeddings as well as standard model fine-tuning, are used to conduct scientific validation
	through a validation harness which is used to check our results against past ML-based heliophysics
	approaches and to compare their computational expense.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/66aa4018951180b79f1c6574/vB4pPe_x_x-9Zkdj7yWD1.png)

	# 2. Quick Start
	To run inference with any of the provided models, you can pull a Docker image [[Docker Hub](https://hub.docker.com/repository/docker/spaceml/sdo-fm/general), or follow installation instructions below. To run any of the scientific tasks:
	```bash
	python scripts/main.py --config-name=embeddings_nvae_virtualeve
	```

	## 2.1 Installation
	SDO-FM can be installed locally by directly installing the package in [this GitHub repository](https://github.com/spaceml-org/SDO-FM). It's advised to use the docker image, however dependencies are contained in the usual [requirements.txt](https://github.com/spaceml-org/SDO-FM/blob/main/requirements.txt).
	```bash
	pip install -e .
	```

	## 2.2 Usage

	To run any task we assume execution inside a container with the image described in the [Dockerfile](https://github.com/spaceml-org/SDO-FM/blob/main/Dockerfile) and Hydra configurations, these are kept in the [experiments](https://github.com/spaceml-org/SDO-FM/tree/main/experiments) directory. The entry point is [main.py](scripts/main.py) and args will select a configuration:
	```bash
	python scripts/main.py --config-name=default
	```
	CLI overrides are still possible with this selection but be aware of some shells not escaping quotes or sqaure brackets:
	```bash
	python scripts/main.py --config-name=default experiment.seed=37
	```

	## 2.3 Pre-training
	```bash
	python scripts/main.py --config-name=pretrain_32.2M_samae_HP
	```

	## 2.4 Notebooks
	A series of [notebooks](https://github.com/spaceml-org/SDO-FM/tree/main/notebooks) are available to explore each of the four downstream tasks described later in this document.

	# 3 Method
	SDO-FM is composed of a backbone, optional neck, and head. We define the backbone
	as the model initially trained on the reconstruction task, the neck as the converter between backbone
	and head, the former then selected for the downstream application (or validation task). We implement
	two model families as backbones, one stemming from a Nouveau Variational Autoencoder (NVAE)
	[11], the other from a MAE [12]. They are both adapted to better accommodate our scientific dataset
	and for intermediate export of their latent spaces in the form of “embeddings.” We additionally
	evaluated various feature engineering options regarding how to manage the solar disk, the most
	effective included a simple look up from Stonyhurst coordinates, a heliographic coordinate system
	for a fixed observer on Earth (suitable given the geosynchronous orbit), to pixel space.


	## 3.1 Model choice
	Model selection was initially determined by the ability to capture solar phenomena, guided by
	applicability to SDO imagery and the ease of access to the embeddings in the latent space. The
	autoencoder architecture was selected for the backbone for ease of embedding construction and
	extraction. By design, autoencoders create a lower-dimensional representation during the encoding
	process. Other requirements included engineering efficiencies, such as ability to mask the solar limb
	for on-disk experiments, and cheaply bias by importance sampling for areas of interest (e.g. active
	regions).

	### 3.1.1 Solar-aware Masked Autoencoder
	Masked Autoencoders (MAEs) learn to be capable at reconstructing images with random components
	removed [13]. The approach follows the standard ViT-patchification common to transformer computer
	vision approaches for deconstruction of the image that the attention mechanism can learn between.
	The source of this “powerful expressivity” is attributed to “rich hidden representation” [14]. This is
	particularly of interest in our scenario, as we seek to learn which components of solar imagery are of
	value for our scientific validation cases. This model was expanded to increase suitability for temporal
	information for remote sensing tasks [15]. We have continued to iterate, adding “solar-awareness” by
	including the ability to process the nine wavelengths of interest to us via the Atmospheric Imaging
	Assembly, efficiencies for processing the solar disk, and the ability to optionally bias the model
	towards learning active regions of scientific interest.

	### 3.1.2 Nouveau-VAE
	The Nouveau Variational Autoencoder (NVAE) is a deep hierarchical VAE created for image
	generation. Like the MAE, it is able to create a rich latent space using depth-wise separable
	convolutions and batch normalization. The NVIDIA team’s codebase was modified to permit access
	to the hierarchical structure to successfully extract embeddings.

	# 4. Scientific Validation Cases
	Predict F10.7 This index is a proxy for solar irradiance, which can be measured from the ground,
	as this frequency is not absorbed by the atmosphere. Can we achieve good agreement with ground
	measurements? There is limited scientific value in this prediction of a proxy measure such as F10.7,
	however this simple task clearly indicates learned capacity in a single result.

	Virtual EVE In 2014, an instrument malfunction resulted in the loss of the MEGS-A module
	of SDO/EVE. With four years of overlapping data, [16, 17] used a hybrid CNN/linear regression
	model to successfully demonstrate the capability of machine learning methods to estimate missing
	EUV irradiance measurements from MEGS-A (and the degraded MEGS-B components of the EVE
	instrument). This validation task employs the embeddings constructed from AIA to understand the
	contributions from solar features on the EUV spectra, as the mapping between instruments exists due
	to the narrow-band images (SDO/AIA) and sun-as-a-star spectra (SDO/EVE) observing the same
	plasma distribution. A linear model accounts for a large portion of the relationship, while a CNN is
	used to correct for outlier events such as solar flares. There are known concerns regarding the model’s
	performance post-2020, as AIA instrument performance deviates further from the 2014 baseline.
	Some of these issues can be addressed by incorporating other sources of irradiance, such as data from sounding rockets, for training over longer periods, although these are sparse. Importantly, this
	outperforms a physics-based inversion approach [18].

	Missing Channel Reconstruction The reconstruction of missing extreme ultraviolet (EUV) images
	from wavelength images is a crucial task given the often low or unusable quality of image data frames
	from the Solar Dynamics Observatory (SDO). Currently, there is no effective method to recover these
	missing steps. However, the foundation model is capable of reconstructing individual frames by
	leveraging contextual information available in other wavelength channels. This approach allows for
	interpolation to provide a best-guess estimate of missing data at any arbitrary time step.
	As with the Virtual EVE project, and differential emission measure analysis [18], the overlapping
	temperature range covered by different SDO/AIA wavelength channels allows for the temperature
	distribution of the underlying plasma to be reconstructed, may enable the inference of properties of
	different temperature ranges.

	The overlapping range covered by different wavelength channels may enable the inference of properties
	of different temperature ranges. This overlap can be used within a machine learning model to
	produce an estimation to replace data that is either missing, corrupted, or otherwise unusable. Our
	objective is to develop a more robust model that operates with higher computational efficiency while
	producing results comparable to the current SOTA. Special attention is given to the model’s ability to
	capture non-linear relationships or rare events, such as intensity values in flaring regions.
	There are several uncertainties inherent in this process. Some channels may be more readily recreated
	than others due to the physical assumptions that channels in the middle of the temperature/wavelength
	ranges will have the most overlap with other channels, potentially yielding better results. However,
	this overlap might not always correspond to the actual missing data in the SDO. Addressing these
	uncertainties requires an understanding of the shortfalls to determine the appropriateness of this
	reconstruction technique in different scenarios.

	Autocalibration The SDO/AIA EUV channels exhibit degradation due to exposure to the same
	emissions they are intended to measure. This degradation results in apparent dimming over time
	across multiple EUV channels with unique characteristics. This poses challenges for long-term
	studies, as degradation trends within the dataset need to be corrected. Until 2014, SDO utilized EVE
	to correct this degradation. As discussed, a malfunction of SDO/EVE resulted in the loss of the
	MEGS-A component, and calibration is currently performed by sounding rocket flights. In response
	to this, [19] used a CNN to reconstruct the Atmospheric Imaging Assembly (AIA) multi-channel
	degradation curves.

	Data requirements for this study include the SDOML data from AIA as well as older correction
	tables. The sampling requirement is minimal, with data being required once per day or even less
	frequently. Traditional SOTA methods, such as those performed by the Lockheed Martin Solar and
	Astrophysics Laboratory (LMSAL), involve calibration using sounding rocket flights. These methods,
	while accurate, are expensive and technically demanding. Our goal is to reproduce the results from
	[19] with greater efficiency in terms of data required and computational resources. This efficiency
	is evaluated through an examination of the resultant images compared to those produced by SOTA
	calibration pipelines, alongside intensity histograms, data spike analysis, and other metrics.

	# 5. Results
	Overall, our model families were evaluated for their backbone reconstruction task and against our four
	scientific validation cases. In all but the autocalibration task they reached the same level of accuracy
	or surpassed their classical counterparts in a fraction of the required time. In the autocalibration case,
	the direct embedding approach was able to match, but took additional training time.

	## 5.1 Reconstruction
	Loss for the reconstruction task is measured by pixel RMSE within the solar disk. SAMAE results
	presented in fig. 5 indicate a clear ability to reconstruct most wavelengths under a small embedding
	dimension (128) and within a short number of training epochs (10). Interestingly this model struggles
	to reconstruct 131 & 171Å, which is likely due to a normalization error we’re still investigating. The
	Nouveau-VAE model on raw pixel intensity performs better, even when including the solar limb

	## 5.2 Direct Embeddings
	Training each scientific validation case on the embeddings directly led to generally much faster
	training time and matching or surpassing of accuracy. The was an effort made to best evaluate
	the embeddings outside of the scientific cases to consider embedding-to-embedding comparison.
	The common TSNE approach was taken over a small one-year sample and there was
	seperation by solar activity. This approach however is still fairly opaque and hence the validation
	approaches are considered more appropriate.

	# Acknowledgements

	This work is the research product of the SDO-FM: A Multi-Modal Foundation Model POC
	for SDO. This has been funded and supported by NASA under **Grant award No
	80NSSC24K0701**. Any opinions, findings, and conclusions or recommendations expressed
	in this material are those of the authors and do not necessarily reflect the views of the
	National Aeronautics and Space Administration (NASA). The research and its
	outputs have been designed, managed and delivered by Trillium Technologies Inc
	(https://trillium.tech). Trillium is a research and development company with a focus on
	intelligent systems and collaborative communities for planetary stewardship, space
	exploration and human health. Trillium aspires to ensure that the
	latest tools and techniques in Artificial Intelligence (AI) and Machine Learning (ML) are
	applied to developing open science for all Humankind.

	Authors

	James Walsh, University of Cambridge
	Daniel Gass, University of Central Lancashire
	Raul Ramos Pollan, Universidad Industrial de Santander
	Richard Galvez, Pure Storage
	Paul Wright, Dublin Institute for Advanced Studies
	Atılım Güneş Baydin, University of Oxford
	Noah Kasmanoff, AE Studio
	Jason Naradowsky, University of Tokyo

	PI: Anne Spalding, Trillium Technolgies Inc
	Co-I: James Parr, Trillium Technologies Inc