SkySense: Multimodal Remote Sensing Foundation Model

CVPR 30/08/2024

3 main points
✔️ We proposed a large-scale remote sensing Foundation model called SkySense that can handle a variety of tasks and multimodal data.
✔️ SkySense is a Factorized Multi-Modal Spatiotemporal Encoder that processes multimodal time series data, Multi-Granularity Contrastive Learning that learns features of various granularities, Geo-Context Prototype Learning to extract geo-context information.
✔️ The SOTA was updated by comparing it to 18 existing remote sensing Foundation models.

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
written by Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, Yansheng Li
(Submitted on 15 Dec 2023)
Comments: Accepted by CVPR2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Interpretation of earth observation remote sensing images is very important for various tasks such as crop monitoring and disaster management. However, these models need to be created separately for each task. Recently, trained foundation models that can be used for various downstream tasks have been attracting attention and RSFM (Remote Sensing Foundation Model) has been studied. Unlike ordinary images, remote sensing images are multimodal (optical and SAR sensors), have different resolutions, and have time series and geographic information. Therefore, RSFM needs to learn such geo-contextual information. In this paper, we improve the existing RSFM and train a model called SkySense with 2 billion parameters on a set of 20 million multimodal remote sensing datasets.

Data-set

As pre-training data, we created multimodal data from various sensors including World View-3,4, Sentinel-1, and Sentinel-2. The total number of images was 21.5 million, and the input to SkySense was {$x_{HR}, x_{Ms}, x_{SAR}$}. Where$x_{HR}$$ is World View, $x_{Ms}$ is Sentinel-2, and $x_{SAR}$ is Sentinel-1.

Architecture

The architecture is shown in the figure below. Factorized Multi-Modal Spatiotemporal Encoder

Extract and fuse spatial features from each modality independently. Let $g$ be the spatial encoder.

$$F_i=g_i(x_i), i\in {HR, Ms, SAR},$$

$$F_T=Concat[F_{HR}, F_{Ms}, F_{SAR}]$$

Next, add the position encoding $P_{DTPE}[:,\bf{t},:]$, which takes into account the time information, and combine it with the extra token $F_{\bf e}$.

$$F_T^{date}=F_T+P_{DTPE}[:,{\bf t},:],$$

$$F_T^{cat}=Concat[F_{\bf{e}}, F_T^{date}]$$

where ${\bf t}$ is a vector containing all acquisition dates in the batch. The $F_T^{cat}$ is input to multiple Transformer encoder layers, yielding the multimodal spatiotemporal feature $F_{\bf fus}^{mm}$.

Attention Geo-Context Integration

Because geographic information in remote sensing images is an important geo-context, we combined the attentions with prototype features for each area, called region-specific prototype set $\mathcal{P}$.

$$F_{\bf fus}=Concat\left[F_{\bf fus}^{mm}, Softmax\left(\frac{QK^T}{\sqrt d}\right)V\right], Q=F_{\bf fus}^{mm}, K=V=\mathcal P_r$$

Prior learning

Multi-Granularity Contrastive Learning

Perform two types of data augmentation on theinputs{$x_{HR}, x_{Ms}, x_{SAR}$} as $\{u_i\}, \{v_i\}$. Student and teacher spatial encoders as $g_i, g'_i$, respectively.

$$F_i=g_i(u_i), F'_i=g'_i(v_i)\ i\in\{HR,Ms,SAR\}$$

Pixel-level, object-level, and image-level contrastive learning was introduced to address a variety of tasks and resolutions.

$${\mathcal L}_{\bf pix}(F_i, F_i')=\frac{1}{N_ST_i}\sum_s \sum_t {\mathcal L}_{CL}(f_i^{\bf pix}, f_i^{\bf pix'})$$

where $N_S$ is the spatial feature size, $T_i$ is the sequence length, $\mathcal L_{CL}$ is the training loss, $f_i^{\bf pix}$ is the feature extracted from a certain time element $F_i^{\bf pix}$ in $F_i$,$f_i^{\bf pix'}$ is the corresponding feature in the same area corresponding features.

$${\mathcal L}_{\bf obj}(F_i, F_i')=\frac{1}{N_CT_i}\sum_s \sum_t {\mathcal L}_{CL}(f_i^{\bf obj}, f_i^{\bf obj'})$$

where$f_i^{\bf obj}$ is thecluster center of the unsupervised clustering features of$f_i^{\bf pix}$and$N_C$ is the number of clusters.

$${\mathcal L}_{\bf img}(F_i, F_i')=\frac{1}{T_i}\sum_t {\mathcal L}_{CL}(F_i^{\bf img}, F_i^{\bf img'})$$

where $F_i^{\bf img}$ is the average pooling of $F_i^{\bf pix}$.

Finally, the abovepixel-level, object-level, and image-level contrastive learning losses are added together to obtain the fine-grained contrastive learning loss ${\mathcal L}_{FGCL}$, and the Multi- Glanularity Contrastive Learning loss ${\mathcal L}_{MGCL}$ is as follows

$${\mathcal L}_{MGCL}=\sum_{i\in \{HR,Ms,SAR\}}{\mathcal L}_{FGCL}(F_i,F_i')+{\mathcal L}_{FGCL}(F_{\bf fus}, F'_{\bf fus})$$

This allows for learning a variety of spatial information, single and multimodality.

Unsupervised Geo-Context Prototype Learning

Geocontext is introduced into the student model because it is important information. We divide the earth into $N_R$ areas and define a prototype subset ${\mathcal P}_r$ for each area. We then compute the cosine similarity matrix $\bf M$ with $F_{\bf fus}^{mm}$. Combining the Sinkhorn algorithm with EMA (https://arxiv.org/abs/1911.05722), we update it as follows

$$\bar{{\mathcal P}_r}={\bf S}^TF_{fus}^{mm}, {\mathcal P}_r \leftarrow m{\mathcal P}_r+(1-m)\bar{{\mathcal P}_r}$$

where $\bf S$ is $F_{\bf fus}^{mm}$ and the optimal allocation matrix of the prototype and $m\in [0,1)$ is the moment coefficient. This allows us to learn area-aware features.

Result

The performance for various data sets and tasks is shown in the figure below. It can be seen that we have updated SOTA for almost everything over the existing model. We also updated SOTA for different tasks, both single-modal and multi-modal.

Summary

In this paper, we proposed a large-scale multimodal remote sensing Foundation model called SkySense. By introducing modules to learn different scenarios, we were able to show generalization performance that updates accuracy in various tasks. Future research includes combining language modalities to have more applied performance.