Using reinforcement learning to improve drone-based inference of greenhouse gas fluxes

Accurate mapping of greenhouse gas fluxes at the Earth's surface is essential for the validation and calibration of climate models. In this study, we present a framework for surface flux estimation with drones. Our approach uses data assimilation (DA) to infer fluxes from drone-based observations, and reinforcement learning (RL) to optimize the drone's sampling strategy. Herein, we demonstrate that a RL-trained drone can quantify a CO2 hotspot more accurately than a drone sampling along a predefined flight path that traverses the emission plume. We find that information-based reward functions can match the performance of an error-based reward function that quantifies the difference between the estimated surface flux and the true value. Reward functions based on information gain and information entropy can motivate actions that increase the drone's confidence in its updated belief, without requiring knowledge of the true surface flux. These findings provide valuable insights for further development of the framework for the mapping of more complex surface flux fields.


Introduction
To date, the inability of climate models to capture the large spatio-temporal variability of land-atmosphere interactions remains a significant source of uncertainty in climate change projections [1,2].Model validation and calibration are complicated by the scale mismatch between land-atmosphere processes captured by site observations (≤ 1 km), and the grid scales at which climate models are run (1 − 100 km).We develop a framework for the mapping of greenhouse gas fluxes at the grid scale of climate models using drone observations.Our aim is to facilitate model calibration and validation, and ultimately contribute to reducing uncertainties in climate projections.
In the past decade, there has been increased interest in using meteorological sensors on drones to estimate fluxes.For example, [3], [4] and [5] applied a mass balance approach to quantify the strength of a methane hotspot from drone observations of gas concentrations downwind and upwind of a source.Moreover, [6] used drone observations in combination with the inversion of a Gaussian plume model to approximate fluxes.Recently, [7] applied data assimilation (DA) techniques to fuse drone observations with an atmospheric model to estimate surface heat fluxes.We build upon the latter work by taking a similar approach for the source term estimation of greenhouse gas fluxes.
Optimal sampling strategies for inferring surface fluxes from drone observations are not necessarily evident a priori.On the one hand, the sample duration should be long enough and the number of observations sufficiently large, to capture the time-averaged emission plume as well as possible.On the other hand, sampling for too long a period of time leads to potential errors caused by changes in atmospheric stability, source strength, or prevailing weather during the sampling period.This issue becomes more pressing when sampling larger areas, which is the intention of this framework.This study represents an initial step in the development of a comprehensive framework capable of mapping surface fluxes in diverse environments and under varying atmospheric conditions.We focus on estimating the magnitude of a greenhouse gas surface flux at a known location, assuming static weather conditions.
The strength of reinforcement learning (RL) [8] is its ability to learn a sequence of actions to complete a task in the most optimal way.RL combined with deep learning has been successful in source localization tasks using synthetic observations of mobile sensors [9,10,11].Furthermore, [12] developed a deep RL-approach to localize and estimate a gas leak with a mobile sensor in a simulated environment.These studies aim to minimize the number of sensor observations to reach the source or achieve a predefined confidence.In contrast, the objective of this study is to develop a sampling strategy with RL that gives the most accurate and precise estimation of a strong CO 2 flux while exploiting the full battery life of the drone to collect informative observations.Herein, we use tabular RL to investigate which reward function could be best for this task.

Materials and methods
Figure 1 illustrates the workflow of our framework.Its components are discussed in the following sections.

Bayesian data assimilation
Greenhouse gas surface fluxes are inferred from sparse and uncertain drone-based observations of gas concentration.Solving this inverse problem requires a forward (data generating) model that maps from the latent surface fluxes to the drone observations.This is done through c = F(φ) + ϵ, where c is an observation of gas concentration, F is the forward model, φ is the surface flux, and ϵ is the observation error.We apply Bayesian inference to obtain an estimate of the surface flux.
Following Bayes' theorem: p(φ|c) = p(c|φ)p(φ)/p(c), the prior beliefs about surface flux p(φ) are updated by a likelihood function p(c|φ) which measures the likelihood that a flux has generated the observed concentration [13].This update results in posterior belief p(φ|c), expressing the probability of a surface flux given the observation.In this context, the model evidence p(c) is just a normalizing constant.In the geosciences, we refer to this Bayesian inference exercise as DA [14].
For the DA scheme, our framework uses an iterative Ensemble Kalman Filter (EnKF) [15,16] with 100 ensemble members and 4 iterations.In our synthetic experiments, we use the EnKF to estimate surface fluxes with true values between 200 and 300 mg CO 2 m −2 s −1 , corresponding to a strong emission hotspot from e.g.industrial facilities.To ensure strictly positive values, the prior distribution p(φ) of the surface flux is set to a lognormal distribution with median 100 mg CO 2 m −2 s −1 and mode 30 mg CO 2 m −2 s −1 (shown in Figure 3).

Gaussian plume model
A Gaussian plume model [17] is used as the forward model.This time-independent dispersion model assumes that the horizontal and vertical gas concentrations in the emission plume follow a Gaussian spatial distribution.The model's low computational cost enables real-time decision-making, which will be essential for the future application of our framework in the field.
Given this plume model, the steady-state solution for the gas concentration at location (x, y , z) is a function of the form: c = F(φ, U, D, x, y , z , x s , y s , z s , σ y , σ z ), where U is horizontal wind velocity, D is wind direction, and (x s , y s , z s ) defines the source location.The standard deviations σ y and σ z of the Gaussian plume are defined in this model by atmospheric stability classes ranging from 1 "very unstable" to 6 "very stable" [18].Herein, all input parameters to the Gaussian plume model, except for surface flux φ, are considered known parameters.
In our experiments, the grid of our two-dimensional horizontal domain is 10 by 10 grid cells with an equal grid spacing of 100 m.The location of the surface flux is (x s , y s , z s ) = (150, 850, 0) m.The unperturbed concentration field at z = 10 m for a flux of 250 mg CO 2 m −2 s −1 is shown in Figure 3.The known parameters are chosen to match a typical field case.Wind speed is U = 4 m/s and wind direction is D = 320 • (north-westerly).The dispersion coefficients σ y and σ z follow from stability class 2 "moderately unstable".Furthermore, a background concentration of 400 ppm is assumed.

Synthetic observations
The measurement uncertainty of CO 2 sensors is typically ±30 ppm + 3% in the measurement range 400 − 10, 000 ppm [19].In the synthetic experiments, drone observations are simulated through c = F(φ) + ϵ with Gaussian error ϵ ∼ N (0, σ obs ), where σ obs is the observation error standard deviation.Assuming a battery life of 32 min and sampling frequency of 0.1 Hz, we consider a drone taking 12 measurements at each of 16 locations.This is simulated by generating a 2-minute-averaged synthetic observation at each measurement location and for each time step with σ obs = 30/ √ 12 ppm.

Reinforcement learning
In RL, an agent -the drone -receives a reward r associated with its state s t at time t in response to performing an action a t (see Figure 1).In our experimental setup, the action space is defined by 5 possible actions: moving one grid cell in positive x-direction, negative x-direction, positive y -direction, negative y -direction, or staying in the current grid cell.
Actions that would take the drone off the grid are unavailable.The state space of the drone is defined by its location and the current time step: (x, y , t).Including time in the state space enables the drone to stay in a grid cell for a few steps without getting stuck, and allows for different actions when revisiting grid cells.An episodeflight -ends when the drone reaches one of the terminal states: (x, y , t = 15).
The agent conducts a trial-and-error search to learn how to map states to actions so as to maximize expected return G t .This is the expected sum of future rewards from time t (assuming discount factor γ = 1 as is common in episodic tasks).State-action values describe the expected return that follows from taking action a in state s: q(s, a a where T is the final time step of an episode.The optimal policy -sampling strategy -follows from selecting the highest state-action values in all possible states.
The reward function is the incentive mechanism to teach the drone desirable behavior.We aim to reward action-selection that updates the initial prior to the episode's final posterior in the DA algorithm as much as possible towards the true, but in practice unknown, flux.The agent receives a reward after each observation.We compare reward functions based on: (a) The negative of the continuous ranked probability score (r = −CRPS) [20].The CRPS generalizes the deterministic mean absolute error to probabilistic estimates.In our case, it measures the distance between the estimated posterior and the true surface flux.The best possible score of zero only occurs for a degenerate ensemble centered on the truth.This reward function is an incentive for what we want to accomplish: an accurate (low bias) and precise (low variance) posterior flux estimate.As a result, it acts as a benchmark against which we compare the effectiveness of other reward functions.Its computation requires knowledge of the true surface flux of the training data.This is not a restriction for offline learning with synthetic data but forms a limitation for further online learning in the field during which the drone could fine-tune its policy based on real-time feedback.
(b) The Kullback-Leibler divergence (r = D KL ) [13,21].The forward D KL (posterior||prior) is a non-negative measure of the amount of information gained when updating the initial prior distribution to the dynamic posterior distribution as a result of the observed data.Knowledge of the true surface flux is not required for this reward function.
(c) The negative differential entropy of the lognormal posterior (r = −H) [22].The entropy, H, is related to the information content of a probability distribution in that it is a measure of uncertainty: It is maximum if the distribution is uniform, and it is zero if the distribution is degenerate.H can be seen as a shifted and scaled version of D KL between the estimated dynamic posterior and a uniform distribution [21].Like D KL , computing H does not require knowledge of the true flux.
We estimate the state-action values q(s, a) using a tabular Q-learning algorithm [23].The learning process is performed over 1.5 × 10 6 episodes by epsilon-greedy action selection with ϵ max = 1.0 and ϵ min = 0.01.The learning rate is set to α = 0.1.During the training phase, the environment is reset after each episode of 16 observations.A true surface flux is randomly selected in the range 200 to 300 mg CO 2 m −2 s −1 such that the resulting state-action value model is trained to find the optimal sampling strategy across a range of fluxes.For each episode, the initial position of the drone is randomly selected among one of the edge grid cells.

RL training convergence
Figure 2 shows the learning curves of the models trained with different reward functions.A moving average over 1,000 episodes is applied to smooth out fluctuations caused by noisy observations and the drone's alternating initial positions.During training, the drone learns to accumulate more rewards per episode.As the exploration rate ϵ decreases, the learning curves of all models converge.

Sampling strategy
The left panels of Figure 3 show the sampling strategies of drones trained with different reward functions.Independent from the drone's initial position, the developed optimal sampling strategy is a shortest path to the highest concentration cell.The drone trained with (c) r = −H stays here for the remaining flight time, and hence takes the maximum possible number of observations in the highest concentration cell.The drones trained with (a) r = −CRPS and (b) r = D KL mainly sample the highest concentration cell but also observe the adjacent grid cell(s) if the drone started upwind from the hotspot (orange paths) or crosswind from the hotspot (green paths): The drone trained with r = −CRPS takes one observation in an adjacent grid cell, and the drone trained with r = D KL takes three observations in adjacent grid cells.In general, the drones starting downwind from the hotspot (purple paths), travel through the emission plume to the highest concentration cell.The drones take the maximum possible number of two samples in this grid cell.

Estimated posterior and final CRPS
The prior and estimated posteriors after 16 observations are shown in the right panels of Figure 3.For comparison, the results of a drone flying a grid path across the emission plume are added.For all RL-trained cases, the prior belief is updated toward the true flux.The estimated posteriors of RL-trained drones are narrower than the estimated posterior of the drone flying a grid path.To evaluate the performance of the synthetic experiments, we use the CRPS of the estimated posterior after the flight.The results of the different models are presented in Table 1.Considering the three initial positions investigated in this study, the RL-trained drones have an average CRPS below 6 mg CO 2 m −2 s −1 .The drone sampling along a predefined grid path across the plume has an average CRPS of approximately 20 mg CO 2 m −2 s −1 .upwind downwind crosswind r = −CRPS 2.0 ± 1.4 5.5 ± 4.0 2.7 ± 1.9 r = D KL 2.1 ± 1.5 5.6 ± 4.0 3.1 ± 2.2 r = −H 1.9 ± 1.4 5.5 ± 3.9 2.5 ± 1.8 grid path 20.3 ± 13.4 Table 1: Final posterior CRPS for drones trained with different reward functions starting in locations upwind, downwind and crosswind from a flux of 250 mg CO 2 m −2 s −1 , and a drone following a grid path.The average and standard deviation over 5,000 sampling paths is presented.

Discussion
In this study, we empirically demonstrate the application of RL to develop sampling strategies for drone observations to infer the strength of a greenhouse gas hotspot.
Our findings show that RL-trained drones can outperform drones following a predefined grid sampling path across the emission plume.The optimal RL policies identified in this study remain consistent across various true surface flux values and initial drone positions.The learned sampling strategies select a shortest path to the highest concentration cell and sample that grid cell multiple times.This approach is effective because the DA update in this cell may be very informative due to its lower relative measurement uncertainty in comparison to cells with a lower concentration.As a result, the drone can considerably increase its confidence in the updated belief, which is the objective defined in the design of this RL task.
The results of this study offer valuable insights into the use of error-based and information-based reward functions for improving drone-based inference of greenhouse gas fluxes.Using accurate but noisy data, we find that drones trained with information-based reward functions that do not rely on knowledge of the underlying true field of interest can still be effective in finding an optimal sampling policy.When comparing reward functions based on information gain or information entropy, we find that the choice of reference distribution, whether it is the prior at the start of the flight (r = D KL ) or a uniform probability distribution (r = −H), has minor impact on the final estimate of the surface flux strength.Both approaches learn a sampling strategy where the highest concentration cell is sampled multiple times.
The RL-developed sampling strategies in this study may align closely with human intuition for this source term estimation task where many variables are known.However, in more general tasks with numerous unknown parameters, it is much less likely that our intuition approaches the optimal sampling strategy.In future work, we aim to adapt the framework to deal with such general settings, enabling its operation in diverse atmospheric conditions and environments.Additionally, our focus will extend to the mapping of more complex surface flux fields, such as hotspots at unknown locations, multiple hotspots, and jointly inferring different types of greenhouse gas fluxes.These future challenges will markedly increase the dimensionality and complexity of the RL task, surpassing the computational capabilities of tabular RL within a reasonable timeframe.Consequently, future studies will use function approximation algorithms, such as neural networks, to fit the state-action model.

Figure 1 :
Figure 1: Overview of the framework.Drone observations of gas concentration are fused with a Gaussian plume model.Through data assimilation (in orange) a more accurate estimate of the unobserved surface flux is inferred.A reinforcement learning algorithm (in green) is used to learn an optimal sampling policy for the positions of drone observations to reduce the model uncertainty as much as possible, and consequently increase the accuracy of the estimated surface flux.

Figure 2 :
Figure 2: The range normalized sum of rewards per episode for models trained with different reward functions.A moving average over 1,000 episodes is shown.

Figure 3 :
Figure 3: Comparison of flux strength estimation by drones trained with different reward functions: (a) r = −CRPS, (b) r = D KL and (c) r = −H, and a drone flying a grid path.Left: The unperturbed concentration field at z = 10 m for a flux of 250 mg CO 2 m −2 s −1 with 400 + 30/ √ 12 ppm isoline (dark grey), including the sampling paths of RL-trained drones starting upwind (orange), crosswind (green) and downwind (purple) from the hotspot at ×, and a baseline grid path (light grey) with data collection at 16 locations (dots).Right: The prior, true flux and estimated posterior of 10 random individual flights.Differences between flights are a result of noisy observations.