\NewEnviron

todoenv[1][]^{inline, caption=2do, #1}^{inline, caption=2do, #1}todo: inline, caption=2do, #1 \BODY

Physics-Informed Gaussian Process Regression
Generalizes Linear PDE Solvers

\nameMarvin Pförtner¹ \emailmarvin.pfoertner@uni-tuebingen.de
\nameIngo Steinwart² \emailingo.steinwart@mathematik.uni-stuttgart.de
\namePhilipp Hennig¹ \emailphilipp.hennig@uni-tuebingen.de
\nameJonathan Wenger¹ \emailjonathan.wenger@uni-tuebingen.de
\addr¹University of Tübingen, Tübingen AI Center
²University of Stuttgart

Abstract

Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models with a downstream application and thus error quantification plays a key role. However, by ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of the Gaussian process inference theorem to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models by blurring the boundaries between numerical analysis and Bayesian inference.

Keywords: physics-informed machine learning, probabilistic numerics, partial differential equations, method of weighted residuals, Galerkin methods, Gaussian processes, bounded linear operators

1 Introduction

Partial differential equations (PDEs) are powerful mechanistic models of static and dynamic systems with continuous spatial interactions (Borthwick, 2018). They are widely used in the natural sciences, especially in physics, and in applied fields like engineering, medicine and finance. Linear PDEs form a subclass describing physical phenomena such as heat diffusion (Fourier, 1822), electromagnetism (Maxwell, 1865), and continuum mechanics (Lautrup, 2005). Additionally, they are used in applications as diverse as computer graphics (Kazhdan et al., 2006), medical imaging (Holder, 2005), or option pricing (Black and Scholes, 1973).

Scientific inference with PDEs

Given a mechanistic model of a (physical) system in the form of a linear PDE $\mathcal{D}[{\bm{u}}]=f$ , where $\mathcal{D}$ is a linear differential operator mapping between vector spaces of functions, the system can be simulated by solving the PDE subject to a set of linear boundary conditions (BC), given by a linear operator $\mathcal{B}$ and a function $g$ defined on the boundary of the domain, s.t. $\mathcal{B}[{\bm{u}}]=g$ (Evans, 2010). For instance, given all material parameters and heat sources involved, a PDE can describe the temperature distribution in an electronic component, while the boundary conditions describe the heat flux out of the component at the surface. Since hardly any practically relevant PDE can be solved analytically (Borthwick, 2018), in practice, specialized numerical methods relying on discretization are employed. Often such solvers are embedded into larger scientific models, where model parameters are inferred from measurement and downstream analyses depend on the resulting simulation. For example, we would like to model whether said electronic component hits critical temperature thresholds during operation to assess its longevity.

Challenges when solving PDEs

When performing scientific inference with PDEs via numerical simulation, one is faced with three fundamental challenges.

(C1)

Limited computation. Any numerically computed solution $\hat{{\bm{u}}}\approx{\bm{u}}$ suffers from approximation error. In practice, a sufficiently accurate simulation often requires vast amounts of computational resources.
(C2)

Partially-known physics. While the underlying physical mechanism is encoded in the formulation of the PDE, in practice, its exact parameters and boundary conditions are often unknown. For example, the position and strength of heat sources $f$ within the aforementioned electric component are only approximately known. Similarly, material parameters like thermal conductivity, which define $\mathcal{D}$ , can often only be estimated. Finally, the initial or boundary conditions $\mathcal{B}[{\bm{u}}]=g$ are also only partially known. For example, how much heat an electrical component dissipates via its surface.
(C3)

Error propagation. Limited computation and partially-known physics inevitably introduce error into the simulation. This resulting bias can fundamentally alter conclusions drawn from downstream analysis steps, in particular if these are sensitive to input variability. For example, an electronic component may be deemed safe based on the simulation, although its true internal temperature hits safety-critical levels repeatedly.

Solving PDEs as a learning problem

The challenges of scientific inference with PDEs are fundamentally issues of partial information. Here, we interpret solving a PDE as a learning problem, specifically as physics-informed regression, in the spirit of probabilistic numerics (Hennig et al., 2015; Cockayne et al., 2019b; Oates and Sullivan, 2019; Owhadi et al., 2019; Hennig et al., 2022). By leveraging the tools of Bayesian inference, we can tackle the challenges (C1), (C2) and (C3). As illustrated in figure 1(a), we model the solution of the PDE with a Gaussian process, which we condition on observations of the boundary conditions, the PDE itself and any physical measurements:

•

Encoding prior knowledge. We can efficiently leverage any available computation by encoding inductive bias about the solution of the PDE. For example, we can identify the solution space by “partial derivative counting”. Moreover, since PDEs typically model physical systems, expert knowledge is often available. This includes known physical properties of the system such as symmetries, as well as more subjective estimates from previous experience with similar systems or computationally cheap approximations.
•

Conditioning on the boundary conditions. The linear boundary conditions can be interpreted as measurements of the solution of the PDE on the boundary. By conditioning on (some of) these measurements, we are not limited to satisfying the boundary conditions exactly, but can directly model uncertain constraints without having to resort to point estimates. Instead, we propagate the uncertainty to the solution estimate. This also allows us to handle cases where we do not have a functional form $g$ of the constraints, but only a discrete set of constraints at boundary points.
•

Conditioning on the PDE. Conditioning a probability measure over the solution on the analytic “observation” that the PDE holds is generally intractable. In the spirit of classic approaches for solving PDEs, we relax the PDE-constraint by requiring only a finite number of projections of the associated PDE residual onto carefully chosen test functions to be zero. This choice of projections defines the discretization and allows for control over the amount of expended computation. The resulting posterior quantifies the algorithm’s uncertainty within a whole set of solution candidates.
•

Conditioning on measurements. Finally, we can also condition on direct measurements of the solution itself. This is especially useful if parameters of the differential operator or boundary conditions are uncertain, or if the computational budget is restrictive.

The resulting posterior belief quantifies the uncertainty about the true solution induced by limited computation and partially-known physics (see figure 1(b)). By quantifying this error probabilistically, we can propagate it to any downstream analysis or decision. For example, to project the longevity of a newly designed electrical component, we want to simulate how likely it will hit a critical temperature threshold during operation. Given our posterior belief, we can simply compute the marginal probability instead of performing Monte-Carlo sampling, which would require repeated PDE solves at significant computational expense.

Refer to caption — (a) *Learning to solve the Poisson equation.* A problem-specific Gaussian process prior ${\mathrm{u}}$ is conditioned on partially-known physics, given by uncertain boundary conditions (BC) and a linear PDE, as well as on noisy physical measurements from experiment. The boundary conditions and the right-hand side of the PDE are not known but inferred from a small set of noise-corrupted measurements. The plots juxtapose the belief ${\mathrm{u}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\cdots$ with the true solution $u^{\star}$ of the latent boundary value problem.

Contribution

We introduce a probabilistic learning framework for the solution of (systems of) linear PDEs. Our framework can be viewed as physics-informed Gaussian process regression. It is based on a crucial generalization of a popular result on conditioning GPs on linear observations to observations made via an arbitrary bounded linear operator with values in $\mathbb{R}^{n}$ (theorem 1). This enables combined quantification of uncertainty from the inherent discretization error, uncertain initial or boundary conditions, as well as noisy measurements of the solution. While connections between GP inference and the solution of PDEs were made in the past (see Section 3.5), corresponding methods have largely focused on estimating strong solutions by leveraging finite difference or collocation schemes. In contrast, our framework applies to both weak and strong formulations and generalizes a significantly broader class of existing numerical methods. Our approach is a strict probabilistic generalization of methods of weighted residuals (corollary 4), including collocation, finite volume, (pseudo)spectral, and (generalized) Galerkin methods such as finite element methods. The resulting probabilistic methods thus have the same convergence properties as their classic counterparts, while providing a structured error estimate. Moreover, the probabilistic viewpoint allows to incorporate partially-known physics and (noisy) experimental measurements.

2 Background

2.1 Linear Partial Differential Equations

A linear partial differential equation (PDE) is an equation of the form

\mathcal{D}[{\bm{u}}]=f,

(2.1)

where $\mathcal{D}\colon{\mathbb{U}}\to{\mathbb{V}}$ is a linear differential operator (see definition 23) between a Banach space ${\mathbb{U}}$ of $\mathbb{R}^{d^{\prime}}$ -valued functions and a Banach space ${\mathbb{V}}$ of real-valued functions on a common open and bounded domain ${\mathbb{D}}\subset\mathbb{R}^{d}$ , and $f\in{\mathbb{V}}$ is the right-hand side function. For simplicity of exposition, we will often focus on the case $d^{\prime}=1$ , in which case we write $u$ instead of ${\bm{u}}$ . Systems modeled by linear PDEs are often further constrained by linear boundary conditions (BCs) $\mathcal{B}[{\bm{u}}]=g$ describing the behavior of the system on the boundary $\partial{\mathbb{D}}$ of the domain, where $\mathcal{B}$ is a linear operator mapping functions ${\bm{u}}\in{\mathbb{U}}$ onto functions $\mathcal{B}[{\bm{u}}]\colon\partial{\mathbb{D}}\to\mathbb{R}$ defined on the boundary and $g\colon\partial{\mathbb{D}}\to\mathbb{R}$ . Common types of boundary conditions for $d^{\prime}=1$ are:

•

Dirichlet: Specify the values of the solution on the boundary, i.e. $\mathcal{B}[u]=u|_{\partial{\mathbb{D}}}$ .
•

Neumann: Specify the exterior normal derivative on the boundary, i.e. $\mathcal{B}[u]({\bm{x}})\coloneqq\partial_{{\bm{\eta}}({\bm{x}})}u\left({\bm{x% }}\right)$ , where ${\bm{\eta}}({\bm{x}})$ is the exterior normal vector at each point of the boundary.

A PDE and a set of boundary conditions is referred to as a boundary value problem (BVP). A prototypical example of a linear PDE, used in thermodynamics, electrostatics and Newtonian gravity, is the Poisson equation $-\Delta u=f$ , where $\Delta u=\sum_{i=1}^{d}\frac{\partial^{2}u}{\partial{x}_{i}^{2}}$ is the Laplacian.

2.1.1 Weak Formulation

Many models of physical phenomena are expressed as functions ${\bm{u}}$ , which are not (continuously) differentiable or even continuous (Evans, 2010; Borthwick, 2018; von Harrach, 2021). In other words, they are not strong solutions to any PDE. There are also PDEs derived from established physical principles, which do not admit strong solutions at all. To address this, one can weaken the notion of differentiability leading to the concept of weak solutions. Many of the aforementioned physical phenomena are in fact weak solutions. As an example¹¹1Our exposition is a strongly abbreviated version of Evans (2010, Section 6.1.2). , consider the weak formulation of the stationary heat equation for non-homogeneous media

-\operatorname{div}\left(\kappa\nabla u\right)=\dot{q}_{V}.

(2.2)

Assume that $u\in C^{2}({\mathbb{D}})$ , $\kappa\in C^{1}({\mathbb{D}})$ , and $\dot{q}_{V}\in C^{0}({\mathbb{D}})$ . If $u$ is a solution to equation 2.2, then we can integrate both sides of the equation against a test function $v\in C_{c}^{\infty}\left({\mathbb{D}}\right)$ , i.e. an infinitely smooth function with compact support (see definition 24), which results in

-\int_{{\mathbb{D}}}\operatorname{div}\left(\kappa\nabla u\right)\left({\bm{x}% }\right)v({\bm{x}})\,\mathrm{d}{\bm{x}}=\int_{{\mathbb{D}}}\dot{q}_{V}({\bm{x}% })v({\bm{x}})\,\mathrm{d}{\bm{x}}.

Since both $u$ and $v$ are sufficiently differentiable, we can apply integration by parts (Green’s first identity) to the first integral to obtain

\underbrace{\int_{{\mathbb{D}}}\langle\kappa({\bm{x}})\nabla u\left({\bm{x}}% \right),\nabla v\left({\bm{x}}\right)\rangle\,\mathrm{d}{\bm{x}}}_{\eqqcolon B% [u,v]}=\int_{{\mathbb{D}}}\dot{q}_{V}({\bm{x}})v({\bm{x}})\,\mathrm{d}{\bm{x}},

(2.3)

since $v|_{\partial{\mathbb{D}}}=0$ . This expression does not require $u$ to be twice differentiable. Rather, $u$ only needs to be once weakly differentiable (see Evans 2010, Section 5.2.1) with $(\nabla u)_{i}\in L_{2}\left({\mathbb{D}}\right)$ . Intuitively speaking, a weak derivative of a (classically non-differentiable) function “behaves like a derivative” when integrated against a smooth test function. These relaxed requirements on $u$ are exactly the defining properties of the Sobolev space $H^{1}\left({\mathbb{D}}\right)$ , i.e. it suffices that $u\in H^{1}\left({\mathbb{D}}\right)$ . Similarly, we can weaken all other assumptions to $v\in H^{1}_{0}\left({\mathbb{D}}\right)$ , $\dot{q}_{V}\in L_{2}\left({\mathbb{D}}\right)$ and $\kappa\in L_{\infty}\left({\mathbb{D}}\right)$ . Then, for $u\in H^{1}\left({\mathbb{D}}\right)$ and $v\in H^{1}_{0}\left({\mathbb{D}}\right)$ , equation 2.3 is equivalent to

\mathcal{D}^{w}[u]=f^{w},

(2.4)

where $\mathcal{D}^{w}\colon H^{1}\left({\mathbb{D}}\right)\to H^{1}_{0}\left({% \mathbb{D}}\right)^{\prime},u\mapsto B[u,\cdot]$ and $f^{w}=\langle\dot{q}_{V},\cdot\rangle_{L_{2}\left({\mathbb{D}}\right)}\in H^{1% }_{0}\left({\mathbb{D}}\right)^{\prime}$ . Here, $H^{1}_{0}\left({\mathbb{D}}\right)^{\prime}$ denotes the continuous dual space of $H^{1}_{0}\left({\mathbb{D}}\right)$ . We define a weak solution of equation 2.2 as $u\in H^{1}\left({\mathbb{D}}\right)$ such that equation 2.4, known as the weak or variational formulation, holds.

Definition 1.

A weak formulation of a linear PDE $\mathcal{D}[{\bm{u}}]=f$ is an equation of the form

\mathcal{D}^{w}[{\bm{u}}]=f^{w},

(2.5)

where $\mathcal{D}^{w}\colon{\mathbb{U}}\to{\mathbb{V}}^{\prime}$ is a linear operator induced by the differential operator $\mathcal{D}$ and $f^{w}\in{\mathbb{V}}^{\prime}$ is a linear functional induced by the right-hand side $f$ . A solution to equation 2.5 is called a weak solution of the PDE. In this context, $\mathcal{D}[{\bm{u}}]=f$ is called the strong formulation of the PDE and any solution to it is called a strong or classical solution.

¹¹todo: 1Explain why we use nonstandard notation

2.1.2 Methods of Weighted Residuals

Unfortunately, linear PDEs both in weak and strong formulation are in general not analytically solvable, so approximate solutions are sought instead. Methods of weighted residuals (MWR) constitute a large family of popular numerical approximation schemes for linear PDEs, including collocation, finite volume, (pseudo)spectral, and (generalized) Galerkin methods such as finite-element methods (Fletcher, 1984). Loosely speaking, MWRs interpret a linear PDE as a root-finding problem for the associated PDE residual, i.e. $\mathcal{D}[{\bm{u}}]-f=0.$ Finding the solution of such a system of an uncountably infinite number of equations with infinitely many unknowns is generally intractable. To render the problem tractable, we reduce the number of equations by “projecting” onto $\mathbb{R}^{n}$ using a finite number of continuous linear test functionals $l^{(1)},\dotsc,l^{(n)}\in{\mathbb{V}}^{\prime}$ , i.e. we use that the residual being zero implies

l^{(i)}[\mathcal{D}[{\bm{u}}]-f]=(l^{(i)}\circ\mathcal{D})[{\bm{u}}]-l^{(i)}[f% ]=0

(2.6)

for all $i=1,\dotsc,n$ . This is a relaxation of the original problem, since the above is not an equivalence but only an implication.²²2 This means that equation 2.6 will generally have infinitely many solutions and needs regularization to have a unique solution. A common choice for the test functionals appearing in a large class of MWRs is the integral $l^{(i)}[v]\coloneqq\int_{{\mathbb{D}}}\psi^{(i)}(x)v(x)\,\mathrm{d}x,$ where $\psi^{(i)}\in{\mathbb{V}}$ is a so-called test function. In this case, the test functionals define a weighted average of the current residual, giving rise to the name of the method.

To reduce the number of unknowns, MWRs also often approximate the unknown solution function ${\bm{u}}$ via finite linear combinations of trial functions ${\bm{\phi}}^{(1)},\dotsc,{\bm{\phi}}^{(m)}\in{\mathbb{U}},$ i.e.

{\bm{u}}\approx\hat{{\bm{u}}}\coloneqq\sum_{i=1}^{m}{c}_{i}{\bm{\phi}}^{(i)},

(2.7)

where ${\bm{c}}\in\mathbb{R}^{m}$ is the coordinate vector of $\hat{{\bm{u}}}$ in the finite-dimensional subspace $\hat{{\mathbb{U}}}\coloneqq\operatorname{span}\left({\bm{\phi}}^{(1)},\dotsc,{% \bm{\phi}}^{(m)}\right)\subset{\mathbb{U}}$ . By substituting equation 2.7 into equation 2.6, we arrive at a linear system $\hat{{\bm{D}}}{\bm{c}}=\hat{{\bm{f}}},$ where $\hat{{D}}_{ij}\coloneqq l^{(i)}[\mathcal{D}[{\bm{\phi}}^{(j)}]]$ and $\hat{{f}}_{i}\coloneqq l^{(i)}[f].$ Hence, the approximate solution function obtained from this method is given by

{\bm{u}}^{\mathrm{MWR}}=\sum_{i=1}^{m}{c}^{\mathrm{MWR}}_{i}{\bm{\phi}}^{(i)},% \qquad\text{where}\qquad{\bm{c}}^{\mathrm{MWR}}=\hat{{\bm{D}}}^{-1}\hat{{\bm{f% }}}

(2.8)

assuming that $\hat{{\bm{D}}}$ is invertible. Above, we implicitly assume that the trial functions ${\bm{\phi}}^{(i)}$ satisfy the boundary conditions, i.e. we describe so-called interior methods.³³3 By stacking the residuals corresponding to the PDE and the boundary conditions, the approach outlined here can be used to realized mixed methods, which solve the boundary value problem without requiring that $\hat{{\bm{u}}}$ fulfills the boundary conditions by construction.

The procedure outlined above can also be applied to approximate weak solutions to linear PDEs by simply substituting $\mathcal{D}\leftarrow\mathcal{D}^{w}$ , $f\leftarrow f^{w}$ , and ${\mathbb{V}}\leftarrow{\mathbb{V}}^{\prime}$ . In this case, it is customary to employ test functionals $l^{(i)}\in{\mathbb{V}}^{\prime\prime}$ induced by test functions $\psi^{(i)}\in{\mathbb{V}}$ such that $l^{(i)}[\mathcal{D}^{w}[{\bm{u}}]]=\mathcal{D}^{w}[{\bm{u}}][\psi^{(i)}]$ and $l^{(i)}[f^{w}]=f^{w}[\psi^{(i)}].$ ⁴⁴4 This uses the fact that there is an isometric embedding $\iota\colon{\mathbb{V}}\to{\mathbb{V}}^{\prime\prime},v\mapsto(l\mapsto l[v]),$ where ${\mathbb{V}}^{\prime\prime}$ denotes the strong bidual of ${\mathbb{V}}$ (Yosida, 1995, Section IV.8). In particular, in the example from section 2.1.1, this implies $l^{(i)}[\mathcal{D}^{w}[{\bm{u}}]]=B[{\bm{u}},\psi^{(i)}]$ and $l^{(i)}[f^{w}]=\langle\dot{q}_{V},\psi^{(i)}\rangle_{L_{2}\left({\mathbb{D}}% \right)}.$ Following Fletcher (1984), we will also refer to these methods as methods of weighted residuals.

Table 1 lists the aforementioned examples of MWRs together with the corresponding trial and test function(al)s that induce the method.

2.2 Gaussian Processes

A Gaussian process (GP) ${\mathrm{f}}$ with index set ${\mathbb{X}}$ is a family $\{{\mathrm{f}}_{\bm{x}}\}_{{\bm{x}}\in{\mathbb{X}}}$ of real-valued random variables on a common probability space $(\Omega,\mathcal{F},\mathrm{P})$ , such that, for each finite set of indices ${\bm{x}}_{1},\dotsc,{\bm{x}}_{n}\in{\mathbb{X}}$ , the joint distribution of ${\mathrm{f}}_{{\bm{x}}_{1}},\dotsc,{\mathrm{f}}_{{\bm{x}}_{n}}$ is Gaussian. We also write ${\mathrm{f}}({\bm{x}})\coloneqq{\mathrm{f}}_{\bm{x}}$ and ${\mathrm{f}}({\bm{x}},\omega)\coloneqq{\mathrm{f}}_{\bm{x}}(\omega)$ . The function ${\bm{x}}\mapsto\operatorname{\mathbb{E}}_{\mathrm{P}}\left[{\mathrm{f}}({\bm{x% }})\right]$ is called the mean (function) of ${\mathrm{f}}$ and the function $({\bm{x}}_{1},{\bm{x}}_{2})\mapsto\operatorname{Cov}_{\mathrm{P}}\left[{% \mathrm{f}}({\bm{x}}_{1}),{\mathrm{f}}({\bm{x}}_{2})\right]$ is called the covariance function or kernel of ${\mathrm{f}}$ . We write ${\mathrm{f}}\sim{\operatorname{\mathcal{GP}}\left(m,k\right)}$ to indicate that ${\mathrm{f}}$ is a Gaussian process with mean function $m$ and covariance function $k$ . For each $\omega\in\Omega$ , the function ${\mathrm{f}}(\cdot,\omega)\colon{\mathbb{X}}\to\mathbb{R},{\bm{x}}\mapsto{% \mathrm{f}}({\bm{x}},\omega)$ is called a sample or (sample) path of the Gaussian process. We denote the set of all sample paths of ${\mathrm{f}}$ by $\operatorname{paths}\left({\mathrm{f}}\right)\coloneqq\{{\mathrm{f}}(\cdot,% \omega)\colon\omega\in\Omega\}\subset\mathbb{R}^{{\mathbb{X}}}.$

The sample paths of Gaussian processes are always real-valued. However, especially in the context of PDEs, vector-valued functions are ubiquitous, e.g. when dealing with vector fields such as the electric field. Fortunately, the index set of a Gaussian process can be chosen freely, which means that we can “emulate” vector-valued GPs. More precisely, a function ${\bm{f}}\colon{\mathbb{X}}\to\mathbb{R}^{d^{\prime}}$ can be equivalently viewed as a function $\tilde{f}\colon\{1,\dotsc,d^{\prime}\}\times{\mathbb{X}}\to\mathbb{R}$ with $\tilde{f}(i,{\bm{x}})\coloneqq{f}_{i}({\bm{x}})$ . Applying this construction to a Gaussian process leads to the notion of a multi-output Gaussian process: A $d^{\prime}$ -output Gaussian process ${\bm{\mathrm{f}}}$ with index set ${\mathbb{X}}$ is a family $\{{\bm{\mathrm{f}}}_{\bm{x}}\}_{{\bm{x}}\in{\mathbb{X}}}$ of $\mathbb{R}^{d^{\prime}}$ -valued random variables on $(\Omega,\mathcal{F},\mathrm{P})$ such that $\tilde{{\mathrm{f}}}\coloneqq\{({\bm{\mathrm{f}}}_{\bm{x}})_{i}\}_{(i,{\bm{x}}% )\in\{1,\dotsc,d\}\times{\mathbb{X}}}$ is a Gaussian process. As before, we define ${\bm{\mathrm{f}}}({\bm{x}})\coloneqq{\bm{\mathrm{f}}}_{\bm{x}}$ and ${\bm{\mathrm{f}}}({\bm{x}},\omega)\coloneqq{\bm{\mathrm{f}}}_{\bm{x}}(\omega)$ . The mean function ${\bm{m}}\colon{\mathbb{X}}\to\mathbb{R}^{d^{\prime}}$ and covariance function ${\bm{k}}\colon{\mathbb{X}}\times{\mathbb{X}}\to\mathbb{R}^{d^{\prime}\times d^% {\prime}}$ of ${\bm{\mathrm{f}}}$ are defined by

{\bm{m}}({\bm{x}})=\begin{pmatrix}\tilde{m}(1,{\bm{x}})\\ \vdots\\ \tilde{m}(d^{\prime},{\bm{x}})\end{pmatrix}\quad\text{and}\quad{\bm{k}}({\bm{x% }}_{1},{\bm{x}}_{2})=\begin{pmatrix}\tilde{k}((1,{\bm{x}}_{1}),(1,{\bm{x}}_{2}% ))&\ldots&\tilde{k}((1,{\bm{x}}_{1}),(d^{\prime},{\bm{x}}_{2}))\\ \vdots&\ddots&\vdots\\ \tilde{k}((d^{\prime},{\bm{x}}_{1}),(1,{\bm{x}}_{2}))&\ldots&\tilde{k}((d^{% \prime},{\bm{x}}_{1}),(d^{\prime},{\bm{x}}_{2}))\\ \end{pmatrix},

where $\tilde{{\mathrm{f}}}\sim{\operatorname{\mathcal{GP}}\left(\tilde{m},\tilde{k}% \right)}$ , and we write ${\bm{\mathrm{f}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ .

3 Learning the Solution to a Linear PDE

Consider a linear partial differential equation $\mathcal{D}[{\bm{u}}]=f$ subject to linear boundary conditions $\mathcal{B}[{\bm{u}}]=g$ as in section 2.1. Our goal is to find a solution ${\bm{u}}\in{\mathbb{U}}$ satisfying the PDE for (partially) known $(\mathcal{D},f)$ and $(\mathcal{B},g)$ . In general, one cannot find a closed-form expression for the solution ${\bm{u}}$ (Borthwick, 2018). Therefore, we aim to compute an accurate approximation $\hat{{\bm{u}}}\approx{\bm{u}}$ instead. Motivated by the challenges (C1), (C2) and (C3) of partial information inherent to numerically solving PDEs, we approach the problem from a statistical inference perspective. In other words, we will learn the solution of the PDE from multiple heterogeneous sources of information. This way we can quantify the epistemic uncertainty about the solution at any time during the computation, as figure 1(a) illustrates.

Indirectly Observing the Solution of a PDE

Typically, we think of observations as a finite number of direct measurements ${\bm{u}}({\bm{x}}_{i})={\bm{y}}_{i}$ of the latent function ${\bm{u}}$ . As it turns out, we can generalize this notion of a measurement and even interpret the PDE itself as an (indirect) observation of ${\bm{u}}$ . As an example, consider the important case where ${\bm{u}}$ models the state of a physical system. The laws of physics governing such a system are often formulated as conservation laws in the language of PDEs. For example, they may require physical quantities like mass, momentum, charge or energy to be conserved over time.

Example 1 (Thermal Conduction and the Heat Equation).

Say we want to simulate heat conduction in a solid object with shape ${\mathbb{D}}\subset\mathbb{R}^{3}$ , i.e. we want to find the time-varying temperature distribution $u\colon[0,T]\times{\mathbb{D}}\to\mathbb{R}$ . Neglecting radiation and convection, $u(t,{\bm{x}})$ is described by a linear PDE known as the heat equation (Lienhard and Lienhard, 2020). Assuming spatially and temporally uniform material parameters $c_{p},\rho,\kappa\in\mathbb{R}$ , it reduces to

\left(c_{p}\rho\frac{\partial}{\partial t}-\kappa\Delta\right)u-\dot{q}_{V}=0.

(3.1)

Thermal conduction is described by $-\kappa\Delta u$ , while $\dot{q}_{V}$ are local heat sources, e.g. from electric currents. Any energy flowing into a region due to conduction or a heat source is balanced by an increase in energy of the material. The net-zero balance shows that energy is conserved.

Notice how a conservation law is an observation of the behavior of the physical system! To formalize this, we begin by rephrasing the classical notion of an observation at a point ${\bm{x}}_{i}$ as measuring the result of a specific linear operator applied to the solution ${\bm{u}}$ :

{\bm{u}}({\bm{x}}_{i})={\bm{y}}_{i}\iff\delta_{{\bm{x}}_{i}}[{\bm{u}}]={\bm{y}% }_{i}

where $\delta_{{\bm{x}}_{i}}$ is the evaluation functional. Now, the key idea is to generalize the notion of a direct observation to collecting information about the solution via an arbitrary linear operator ${\bm{\mathcal{L}}}$ with values in $\mathbb{R}^{n}$ applied to the solution ${\bm{u}}$ , such that ${\bm{\mathcal{L}}}[{\bm{u}}]={\bm{y}}\iff{\bm{\mathcal{L}}}[{\bm{u}}]-{\bm{y}}% ={\bm{0}}.$ The affine operator

{\bm{\mathcal{I}}}[{\bm{u}}]\coloneqq{\bm{\mathcal{L}}}[{\bm{u}}]-{\bm{y}}

(3.2)

is a specific kind of information operator (Cockayne et al., 2019b). In this setting the information operator may describe a conservation law as in equation 3.1, a general linear PDE of the form (2.1) or an arbitrary affine operator mapping a function space into $\mathbb{R}^{n}$ . This generalized notion of an observation turns out to be very powerful to incorporate different kinds of mathematical, physical, or experimental properties of the solution. Since PDEs and conservation laws are often assumed to hold exactly, we focused on noise-free observations above. However, generally we are not limited to this case and can also model ${\bm{y}}$ as random variable, in which case the information operator $\mathcal{I}[({\bm{u}},{\bm{y}})]$ is a (jointly) linear functional of the solution ${\bm{u}}$ and the right-hand side ${\bm{y}}$ .

3.1 Solving PDEs as a Bayesian Inference Problem

One of the main challenges (C1), (C2) and (C3) outlined in the beginning is the limited computational budget available to us to approximate the solution. Fortunately, in practice, the solution ${\bm{u}}$ is not hopelessly unconstrained, but we usually a-priori have information about it. At the very least, we know the space of functions ${\mathbb{U}}$ in which to search for the solution. Additionally, we might have expert knowledge about its rough shape and value range, or solutions to related PDEs at our disposal. Now, the question becomes: How do we combine this prior knowledge with indirect observations of the solution through the information operator ${\bm{\mathcal{I}}}$ 3.2? To do so, we turn to the Bayesian inference framework. This provides a different perspective on the numerical problem of solving a linear PDE as a learning task.

Gaussian Process Inference

We represent our belief about the solution of the linear PDE via a (multi-output) Gaussian process ${\bm{\mathrm{u}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ with mean function ${\bm{m}}\colon{\mathbb{D}}\to\mathbb{R}^{d^{\prime}}$ and kernel ${\bm{k}}\colon{\mathbb{D}}\times{\mathbb{D}}\to\mathbb{R}^{d^{\prime}\times d^% {\prime}}$ . Gaussian processes are well-suited for this purpose since:

(i)

For an appropriate choice of kernel, the Gaussian process defines a probability measure over the function space in which the PDE’s solution is sought.
(ii)

Kernels provide a powerful modeling toolkit to incorporate prior information (e.g. variability, periodicity, multi-scale effects, in- / equivariances, …) in a modular fashion.
(iii)

Measurement noise often follows a Gaussian distribution.
(iv)

Conditioning a Gaussian process on observations made via a linear map again results in a Gaussian process.

While the result in (iv) is used ubiquitously in the literature, its general form where observations are made via arbitrary linear operators with values in $\mathbb{R}^{n}$ as opposed to finite-dimensional linear maps, has only been rigorously demonstrated for Gaussian measures on separable Hilbert spaces, not for the Gaussian process perspective, to the best of our knowledge. The two perspectives are closely related, but there are thorny technical difficulties to consider. We intentionally frame the problem from the Gaussian process perspective to make use of the expressive modeling capabilities provided by the kernel. Our framework at its core relies on this result, which we explain in detail in section 4 and prove in section B.3.

3.1.1 Encoding Prior Knowledge about the Solution

We can infer the solution of a linear PDE more quickly by specifying inductive biases in the prior, which can encode both provable and approximately known properties of the solution.⁵⁵5In the special case of GP regression, if the prior smoothness matches the smoothness of the target function ${\bm{u}}$ , the convergence rate is optimal in the number of observations (Kanagawa et al., 2018, Thm. 5.1).

Function Space of the Solution

The most basic known property derived from the PDE is an appropriate choice of function space for the solution. For strong solutions, this can be done by inspecting the differential operator $\mathcal{D}$ and keeping track of the partial derivatives. In fact, in implementation this can be automatically derived solely from the problem definition, e.g. by compositionally defining differential operators and storing information on the necessary differentiability. Let ${\beta}_{i}\in\mathbb{N}_{0}$ be the number of times any partial derivative in the differential operator $\mathcal{D}$ differentiates w.r.t. the variable ${x}_{i}$ .⁶⁶6Formally, ${\bm{\beta}}\in\mathbb{N}_{0}^{d}$ is the “smallest” multi-index such that ${\bm{\alpha}}\leq{\bm{\beta}}$ for every multi-index of a partial derivative occurring in $\mathcal{D}$ (see definitions 22 and 23). Then a sensible choice of solution space is the space ${\mathbb{U}}=C^{{\bm{\beta}}}(\overline{{\mathbb{D}}})$ (see section B.2). To define a prior with paths in this solution space, a common choice of prior covariance function is the tensor product of one-dimensional half-integer Matérn kernels $k_{{\nu}_{i}}$ with ${\nu}_{i}={\beta}_{i}+\frac{1}{2}$ (see section B.4). For weak solutions, the Sobolev spaces ${\mathbb{U}}=H^{m}\left({\mathbb{D}}\right)$ are prototypical choices of solution spaces. In this case, a (multivariate) Matérn kernel with smoothness parameter $\nu=m+\frac{1}{2}$ is a useful default prior covariance function. In both cases, a parametric kernel $k({\bm{x}}_{0},{\bm{x}}_{1})={\bm{\phi}}({\bm{x}}_{0})^{\top}{\bm{\Sigma}}{\bm% {\phi}}({\bm{x}}_{1})$ is also a valid choice if ${\phi}_{i}\in{\mathbb{U}}$ . See section B.4.2 for a detailed account on how to choose priors for physics-informed GP regression.

Symmetries, In- and Equivariances

Many solutions of PDEs exhibit a-priori known symmetries. For example, to calculate the strength of a magnet rotated by ${\bm{R}}:\mathbb{R}^{3}\to\mathbb{R}^{3}$ , one can equivalently compute the field ${\bm{B}}$ of the magnet in its original position and rotate the field, i.e. ${\bm{B}}({\bm{R}}{\bm{x}})={\bm{R}}{\bm{B}}({\bm{x}})$ . Inductive biases reflecting symmetries can be encoded via kernels that are invariant ${\bm{k}}({\bm{\rho}}_{g}{\bm{x}}_{0},{\bm{\rho}}_{g}{\bm{x}}_{1})={\bm{k}}({% \bm{x}}_{0},{\bm{x}}_{1}),$ or equivariant ${\bm{k}}({\bm{\rho}}_{g}{\bm{x}}_{0},{\bm{\rho}}_{g}{\bm{x}}_{1})={\bm{\rho}}_% {g}{\bm{k}}({\bm{x}}_{0},{\bm{x}}_{1}){\bm{\rho}}_{g}^{*},$ where ${\bm{\rho}}_{g}$ is a unitary group representation. The most commonly used kernels are stationary, i.e. translation invariant, but one can also construct invariant (Haasdonk and Burkhardt, 2007; Azangulov et al., 2022), as well as equivariant kernels (Reisert and Burkhardt, 2007; Holderrieth et al., 2021) for many other group actions.

Domain Expertise

Domain experts often have approximate knowledge of what solutions can be expected, either from experience, previous experiments, or familiarity with the physical interpretation of the solution ${\bm{u}}$ . For example, an engineer who designs electrical components is likely able to give realistic temperature ranges for a component for which we aim to simulate the temperature distribution. This can be included by choosing the (initial) kernel hyperparameters, such as the output- and lengthscales based on this expertise.

3.1.2 (Indirectly) Observing the Solution

From a computational perspective, the most important reason for choosing Gaussian processes is that when conditioning on linear observations, the resulting posterior is again a Gaussian process with closed form mean and covariance function (Bishop, 2006). We extend this classic result from observations via a finite-dimensional linear map to general $\mathbb{R}^{n}$ -valued linear operators in Theorem 1. This is crucial to condition on the different types of observations, most importantly the PDE itself, made via the information operator in (3.2). Given such an affine observation defined via a linear operator ${\bm{\mathcal{L}}}\colon{\mathbb{U}}\to\mathbb{R}^{n}$ and an independent Gaussian random variable ${\bm{\mathrm{\epsilon}}}\sim{\operatorname{\mathcal{N}}\left({\bm{\mu}},{\bm{% \Sigma}}\right)},$ we can condition our prior belief using theorem 1 on the observations to obtain a posterior of the form $\left.{\bm{\mathrm{u}}}\nonscript\>\middle|\allowbreak\nonscript\>\mathopen{}(% {\bm{\mathcal{L}}}[{\bm{\mathrm{u}}}]+{\bm{\mathrm{\epsilon}}}={\bm{y}})\right% .\sim{\operatorname{\mathcal{GP}}\left({\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>% |\allowbreak\nonscript\>\mathopen{}{\bm{y}}},{\bm{k}}^{{\bm{\mathrm{u}}}% \nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}\right)}$ with mean and covariance function given by

	$\displaystyle{m}^{{\bm{\mathrm{u}}}\nonscript\>\|\allowbreak\nonscript\>% \mathopen{}{\bm{y}}}_{i}({\bm{x}})$	$\displaystyle={m}_{i}({\bm{x}})+{\bm{\mathcal{L}}}[{k}_{:,i}(\cdot,{\bm{x}})]^% {\top}\left({\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}}^{\prime}+{\bm{\Sigma}% }\right)^{\dagger}\left({\bm{y}}-({\bm{\mathcal{L}}}[{\bm{m}}]+{\bm{\mu}})% \right),$		(3.3)
	$\displaystyle{k}^{{\bm{\mathrm{u}}}\nonscript\>\|\allowbreak\nonscript\>% \mathopen{}{\bm{y}}}_{i,j}({\bm{x}}_{1},{\bm{x}}_{2})$	$\displaystyle={k}_{i,j}({\bm{x}}_{1},{\bm{x}}_{2})+{\bm{\mathcal{L}}}[{k}_{:,i% }(\cdot,{\bm{x}}_{1})]^{\top}\left({\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}% }^{\prime}+{\bm{\Sigma}}\right)^{\dagger}{\bm{\mathcal{L}}}[{k}_{:,j}(\cdot,{% \bm{x}}_{2})].$		(3.4)

We will now look more closely at how we can condition on the boundary conditions, the PDE itself and direct measurements of the solution.

Observing the Solution via the PDE

The differential operator $\mathcal{D}$ in equation 2.1 is linear and therefore it is tempting to define the information operator $\mathcal{I}[{\bm{\mathrm{u}}}]=\mathcal{D}[{\bm{\mathrm{u}}}]-f$ and attempt to condition on $\mathcal{I}[{\bm{\mathrm{u}}}]=0$ . Under some assumptions on ${\mathbb{U}}$ , $\mathcal{D}$ , and ${\bm{\mathrm{u}}}$ , one can even show that this is well-defined. Unfortunately, it turns out that computing the posterior moments is then at least as hard as solving the PDE directly and thus typically intractable in practice. Loosely speaking, this is because $f$ is a function and hence $\mathcal{D}[{\bm{\mathrm{u}}}]=f$ corresponds to an infinite number of observations. However, by only enforcing the PDE at a finite number of points in the domain, we can immediately give a canonical example of an approximation to this intractable information operator. Concretely, we can condition ${\bm{\mathrm{u}}}$ on the fact that the PDE holds at a finite sequence of well-chosen domain points ${\bm{X}}_{\text{PDE}}=({\bm{x}}_{i})_{i=1}^{n}\in{\mathbb{D}}^{n},$ i.e. we compute ${\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}(\mathcal{D}[{% \bm{\mathrm{u}}}]({\bm{X}}_{\text{PDE}})-f({\bm{X}}_{\text{PDE}})=0)$ by choosing ${\bm{\mathcal{L}}}=\delta_{{\bm{X}}_{\text{PDE}}}\circ\mathcal{D}$ and ${\bm{y}}=f({\bm{X}}_{\text{PDE}})$ . If the set ${\bm{X}}_{\text{PDE}}$ of domain points is dense enough, we obtain a good approximation to the exact conditional process. This approach, known as the probabilistic meshless method (Cockayne et al., 2017), is analogous to existing non-probabilistic approaches to solving PDEs, commonly referred to as collocation methods, wherein the points ${\bm{X}}$ are called collocation points. Satisfying the PDE at a set of collocation points is far from the only choice within our general framework. For example, we can choose a set of test functions $l^{(i)}\in{\mathbb{V}}^{\prime},$ which we use to observe the PDE with, such that ${\mathcal{L}}_{i}[{\bm{\mathrm{u}}}]=l^{(i)}[\mathcal{D}[{\bm{\mathrm{u}}}]]$ and ${\bm{y}}_{i}=l^{(i)}[f]$ . For efficient evaluation of the differential operator we can further represent the solution in a basis of trial functions from a subspace $\hat{{\mathbb{U}}}$ , resulting in ${\mathcal{L}}_{i}[{\bm{\mathrm{u}}}]=l^{(i)}[\mathcal{D}[\mathcal{P}_{\hat{{% \mathbb{U}}}}[{\bm{\mathrm{u}}}]]]$ . This turns out to be very powerful and is analogous to some of the most successful classical PDE solvers. In fact, for certain priors and choices of subspaces, our framework recovers several important classic solvers in the posterior mean (see Section 3.3.4). The above can be applied to both time-dependent and time-independent PDEs and regardless of the type of linear PDE (e.g. elliptic, parabolic, hyperbolic). Moreover, an extension to systems of linear PDEs is straightforward.

Observing the Solution at the Boundary

As for the PDE, we could attempt to directly condition on the boundary conditions by choosing ${\bm{\mathcal{L}}}=\mathcal{B}$ and ${\bm{y}}=g$ . However, we are faced with the same intractability issues that we discussed above. Instead, we observe that the boundary conditions hold at a finite set of points ${\bm{X}}_{\text{BC}}\subset\partial{\mathbb{D}}$ , i.e. $\mathcal{L}=\delta_{{\bm{X}}_{\text{BC}}}\circ\mathcal{B}$ and ${\bm{y}}=g({\bm{X}}_{\text{BC}})$ . In practice, sometimes the boundary conditions are only known at a finite set of points making this a natural choice.

Observing the Solution Directly

Finally, as in standard GP regression, we can directly condition on (noisy) measurements of the solution, for example from a real world experiment, by choosing ${\bm{\mathcal{L}}}=\delta_{{\bm{X}}_{\text{MEAS}}}$ and ${\bm{y}}={\bm{u}}^{\star}({\bm{X}}_{\text{MEAS}})$ .

In summary, the probabilistic viewpoint allows us to

•

encode prior information about the solution,
•

condition on various kinds of (partial) information, such as the boundary condition, the PDE itself, or direct measurements, and
•

output a structured error estimate, reflecting all obtained information and performed computation.

We will now give concrete examples for some of the possible modeling choices described above in a case study.

3.2 Case Study: Modeling the Temperature Distribution in a CPU

Central processing units (CPUs) are pieces of computing hardware that are constrained by the vast amounts of heat they dissipate under computational load. Surpassing the maximum temperature threshold of a CPU for a prolonged period of time can result in reduced longevity or even permanent hardware damage (Michaud, 2019). To counteract overheating, cooling systems are attached to the CPU, which are controlled by digital thermal sensors (DTS). For simplicity, assume that the CPU is under sustained computational load and that the cooling device is controlled in a way such that the die reaches thermal equilibrium.

Example 2 (Stationary Heat Equation).

The temperature distribution of a solid at thermal equilibrium, i.e. $\frac{\partial u}{\partial t}=0$ in Example 1, is described by the linear PDE

-\kappa\Delta u-\dot{q}_{V}=0,

(3.5)

known as the stationary heat equation (Lienhard and Lienhard, 2020). For our choice of material parameters equation 3.5 is equivalent to the Poisson equation with $f=\frac{\dot{q}_{V}}{\kappa}$ .

While the sensors control cooling, they only provide local, limited-precision measurements of the CPU temperature. This is problematic, since the chip may reach critical temperature thresholds in unmonitored regions. Therefore, our goal will be to infer the temperature in the entire CPU. We will use our framework to integrate the physics of heat flow, the controlled cooling at the boundary, and the noisy temperature measurements from the sensors. See figure 2(b) for an illustration of the result. During manufacturing, the resulting belief over the temperature distribution could then help decide whether the CPU design needs to be changed to avoid premature failure. From here on out, we focus on a 1D slice across the CPU surface, as shown in figure 2(a) (top), to easily visualize uncertainty.

Encoding Prior Knowledge

By inspecting the PDE’s differential operator $\mathcal{D}=-\kappa\Delta=-\kappa\sum_{i=1}^{d}\frac{\partial^{2}}{\partial{x}% _{i}^{2}},$ we can deduce that the paths of our Gaussian process need to be twice-differentiable in every input variable ${x}_{i}$ . The construction in section B.4.1 tells us that that a GP prior whose covariance function is a tensor product $k_{{\bm{\nu}}}$ of one-dimensional Matérn kernels $k_{{\nu}_{i}}$ with ${\nu}_{i}=2+\frac{1}{2}=\frac{5}{2}$ fulfills the desired path properties. Assume we also know what temperature ranges are plausible from similar CPU architectures, meaning we set the kernel output scale to $\sigma_{\text{out}}^{2}=9$ . Figure 3 shows the prior process ${\mathrm{u}}$ on along with its image $\mathcal{D}[{\mathrm{u}}]\sim{\operatorname{\mathcal{GP}}\left(\mathcal{D}[m],% \sigma_{\text{out}}^{2}\mathcal{D}k\mathcal{D}^{\prime}\right)}$ under the differential operator. A draw from $\mathcal{D}[{\mathrm{u}}]$ can be interpreted as the heat sources and sinks that generated the corresponding temperature distribution draw from ${\mathrm{u}}$ .

Conditioning on the PDE

We can now inform our belief about the physics of heat conduction using the mechanistic model defined by the stationary heat equation. We choose a set of collocation points ${\bm{X}}_{\text{PDE}}\in{\mathbb{D}}^{n}$ and then condition on the observation that the PDE holds (exactly) at these points. In other words, we compute the physically-informed Gaussian process ${\mathrm{u}}\mid\text{PDE}\coloneqq{\mathrm{u}}\mid({\bm{\mathcal{I}}}^{\text{% PDE}}[{\mathrm{u}}]={\bm{0}})$ with ${\bm{\mathcal{I}}}^{\text{PDE}}[{\mathrm{u}}]\coloneqq-\kappa\Delta{\mathrm{u}% }\left({\bm{X}}_{\text{PDE}}\right)-\dot{q}_{V}({\bm{X}}_{\text{PDE}})$ visualized in figure 4.

We can see that the resulting conditional process indeed satisfies the PDE exactly at the collocation points (see figure 4(b)). The remaining uncertainty in figure 4(b) is due to the approximation error introduced by only conditioning on a finite number of collocation points. However, while the samples from our belief about the solution in figure 4(a) exhibit much more similarity to the mean function and less spatial variation, the marginal uncertainty hardly decreases. The latter is explained by the PDE not identifying a unique solution, since adding any affine function to $u^{\star}$ does not alter its image under the differential operator, i.e. $\Delta({\bm{a}}^{\top}{\bm{x}}+b)=0$ . There is an at least two-dimensional subspace of functions which can not be observed. This ambiguity can be resolved by introducing boundary conditions.

Conditioning on the Boundary Conditions

We assume that the CPU cooler extracts heat uniformly from all exposed parts of the CPU, in particular also from the sides, rather than just from the top. Instead of directly specifying the value of the temperature distribution at the edge points of the CPU slice, we only know the density $\dot{q}_{A}$ of heat flowing out of each point on the CPU’s boundary based on the cooler specification. We can use another thermodynamical law to turn this assumption into information about the temperature distribution $u$ .

Example 3 (continues=ex:thermal-conduction-heat-equation).

Fourier’s law states that the local density of heat $\dot{q}_{A}$ flowing through a surface with normal vector ${\bm{\eta}}$ is proportional to the inner product of the negative temperature gradient and the surface normal ${\bm{\eta}}$ , i.e. $\dot{q}_{A}=-\kappa\langle{\bm{\eta}},\nabla u\rangle,$ where $\kappa$ is the material’s thermal conductivity in $\mathrm{W}\text{\,}{\mathrm{m}}^{-1}\text{\,}\mathrm{K}$ (Lienhard and Lienhard, 2020).

Assuming sufficient differentiability of $u$ , the inner product above is equal to the directional derivative $\partial_{\eta}u$ of $u$ in direction $\eta$ . We can assign an outward-pointing vector $\eta(x)$ (almost) everywhere on the boundary of the domain. Since the boundary of the CPU domain is its surface, we can summarize the above in a Neumann boundary condition $-\kappa\partial_{\eta(x)}u\left(x\right)=\dot{q}_{A}(x)$ for $x\in\partial{\mathbb{D}}$ . Applying corollary 2 once more, we can inform our estimate of the solution about the boundary conditions by computing ${\mathrm{u}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\text{PDE},\text{% NBC}\coloneqq\left.\left({\mathrm{u}}\nonscript\>|\allowbreak\nonscript\>% \mathopen{}\text{PDE}\right)\nonscript\>\middle|\allowbreak\nonscript\>% \mathopen{}{\bm{\mathcal{I}}}^{\text{NBC}}[({\mathrm{u}},\dot{{\mathrm{q}}}_{A% })]={\bm{0}}\right.,$ where ${\bm{\mathcal{I}}}^{\text{NBC}}[{\mathrm{u}}]\coloneqq-\kappa\partial_{\eta({% \bm{X}}_{\text{NBC}})}{\mathrm{u}}\left({\bm{X}}_{\text{NBC}}\right)-\dot{{% \mathrm{q}}}_{A}({\bm{X}}_{\text{NBC}})$ with ${\bm{X}}_{\text{NBC}}=\{0,w_{\text{CPU}}\}$ is the information operator induced by the boundary conditions. The result is visualized in figure 5(a). The structure of the samples illustrates that most of the remaining uncertainty about the solution lies in a one-dimensional subspace of ${\mathbb{U}}$ corresponding to constant functions. This is due to the fact that two Neumann boundary conditions on both sides of the domain only determine the solution of the PDE up to an additive constant. We need an additional source of information to address the remaining degree of freedom.

Conditioning on Direct Measurements

Fortunately, CPUs are equipped with digital thermal sensors (DTS) located close to each of the cores, which provide (noisy) local measurements of the core temperatures (Michaud, 2019). These measurements can be straightforwardly accounted for in our model by performing standard GP regression using ${\mathrm{u}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\text{PDE},\text{NBC}$ from figure 5(a) as a prior. The resulting belief about the temperature distribution is visualized in figure 5(b).

We can see that integrating the interior measurements effectively reduces the uncertainty due to the remaining degree of freedom, albeit not completely. The remaining uncertainty is due to the model’s consistent accounting for noise in the thermal sensor readings, the uncertainty about the cooling, i.e. the boundary conditions, and the discretization error incurred by only choosing a small set of collocation points.

Uncertainty in the Right-hand Side and the Boundary Function

Above, we assumed the true heat source term $\dot{q}_{V}$ , i.e. the right-hand side of the PDE, and the boundary heat flux $\dot{q}_{A}$ to be known exactly. However, in practice, this is rarely the case. Fortunately, our probabilistic viewpoint admits a straightforward relaxation of this assumption. Namely, we can replace $\dot{q}_{V}$ and $\dot{q}_{A}$ by a joint Gaussian process prior $(\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})$ , whose means are given by estimates of $\dot{q}_{V}$ and $\dot{q}_{A}$ .⁷⁷7 Technically speaking, if the right-hand-side of the PDE is given as a Gaussian process, the PDE turns into a stochastic partial differential equation (SPDE). Above, we assumed that the cooler is controlled in such a way, that the temperature distribution in the CPU does not change over time. However, a naive prior $(\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})$ may break this assumption. We need to encode that the amount of heat entering the CPU is equal to the amount of heat leaving the CPU via its boundary, i.e.

\mathcal{I}^{\text{STAT}}[(\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})]% \coloneqq\int_{{\mathbb{D}}}\dot{{\mathrm{q}}}_{V}({\bm{x}})\,\mathrm{d}{\bm{x% }}-\int_{\partial{\mathbb{D}}}\dot{{\mathrm{q}}}_{A}({\bm{x}})\,\mathrm{d}A=0,

(3.6)

The (jointly) linear information operator $\mathcal{I}^{\text{STAT}}$ computes the net amount of thermal energy that the CPU gains per unit time. Using theorem 1 we can construct a multi-output GP prior $({\mathrm{u}},\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})$ , which is consistent with the assumption of thermal stationarity by conditioning on $\mathcal{I}^{\text{STAT}}[(\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})]=0$ . Here, we assume a-priori that ${\mathrm{u}}$ , $\dot{{\mathrm{q}}}_{V}$ , and $\dot{{\mathrm{q}}}_{A}$ are pairwise independent. In the one-dimensional model, we can simplify equation 3.6 by assuming that heat is drawn uniformly from the sides of the CPU. By encoding this information in the prior $\dot{{\mathrm{q}}}_{A}$ , the information operator corresponding to thermal stationarity resolves to

\mathcal{I}^{\text{STAT}}[(\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})]=h_{% \text{CPU}}\int_{0}^{w_{\text{CPU}}}\dot{{\mathrm{q}}}_{V}(x)\,\mathrm{d}x-h_{% \text{CPU}}\left(\dot{{\mathrm{q}}}_{A}(0)+\dot{{\mathrm{q}}}_{A}(w_{\text{CPU% }})\right).

(3.7)

As above, we can now use corollary 2 to condition our physically-consistent GP prior $({\mathrm{u}},\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})\nonscript\>|% \allowbreak\nonscript\>\mathopen{}\text{STAT}\coloneqq\left.({\mathrm{u}},\dot% {{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})\nonscript\>\middle|\allowbreak% \nonscript\>\mathopen{}\left(\mathcal{I}^{\text{STAT}}[(\dot{{\mathrm{q}}}_{V}% ,\dot{{\mathrm{q}}}_{A})]=0\right)\right.$ on ${\bm{\mathcal{I}}}^{\text{PDE}}[({\mathrm{u}},\dot{{\mathrm{q}}}_{V})]={\bm{0}},$ as well as ${\bm{\mathcal{I}}}^{\text{NBC}}[({\mathrm{u}},\dot{{\mathrm{q}}}_{A})]={\bm{0}}$ and the noisy measurements of the temperature distribution. Here, it is important to keep track of the cross-covariances in $({\mathrm{u}},\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A}),$ since the outputs in $(\dot{{\mathrm{q}}}_{V},\dot{{\mathrm{q}}}_{A})\nonscript\>|\allowbreak% \nonscript\>\mathopen{}\text{STAT}$ become correlated. The resulting process ${\mathrm{u}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\text{PDE},\text{% NBC},\text{STAT},\text{DTS}$ (or rather its marginals) is shown in figure 6.

Comparing figures 6(a) and 5(b), we can see that, due to the uncertainty in the right-hand side $\dot{{\mathrm{q}}}_{V}$ of the PDE, the samples of ${\mathrm{u}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\text{PDE},\text{% NBC},\text{STAT},\text{DTS}$ exhibit much more spatial variation. Moreover, the samples of the GP posterior over $\dot{{\mathrm{q}}}_{V}$ fulfill the stationarity constraint we imposed.

Figure 7: Representation of the CPU model as a directed graphical model. The inference procedure described in section 3.2 is equivalent to the junction tree algorithm (Bishop, 2006, Section 8.4.6) applied to the graphical model above. This example shows that the language of information operators is a powerful tool for aggregating heterogeneous sources of partial information in a joint probabilistic model.

Summary

Stepping back, we can view the problem of modeling the CPU under computational load as a scientific inference problem, where we need to aggregate heterogeneous sources of information in a joint probabilistic model. This inference task is illustrated as a directed graphical model in figure 7. Our physics-informed regression framework is a local computation in the global inference procedure on the graph. Importantly, its implementation does not change based on what happens to the solution estimate and the input data in either upstream or downstream computations. All this information is already handily encoded in the structured uncertainties of the Gaussian processes.

3.3 A General Class of Tractable Information Operators for Linear PDEs

Recall that conditioning on the linear PDE directly via the information operator $\mathcal{I}[{\bm{\mathrm{u}}}]=\mathcal{D}[{\bm{\mathrm{u}}}]-f$ is usually intractable. Instead, in section 3.1.2 we approximated this information operator by ${\bm{\mathcal{I}}}\colon{\mathbb{U}}\to\mathbb{R}^{n}$ with ${\mathcal{I}}_{i}[{\bm{\mathrm{u}}}]\coloneqq\mathcal{D}[{\bm{\mathrm{u}}}]({% \bm{x}}_{i})-f({\bm{x}}_{i})$ where ${\bm{x}}_{i}\in{\mathbb{D}}$ . This implicitly assumes that point evaluation on both $\mathcal{D}[{\bm{\mathrm{u}}}]$ and $f$ is well-defined, which crucially means that this approach applies only to strong solutions of PDEs. In this section, we extend this approximation scheme to a general class of tractable information operators aimed at approximating both weak and strong solutions to linear PDEs. Our framework is inspired by the method of weighted residuals (MWR) (see section 2.1.2). In fact, in section 3.3.4 we will show that GP inference with information operators in this class reproduces any weighted residual method in the posterior mean while additionally providing an estimate of the inherent approximation error.

^{author=Jonathan}^{author=Jonathan}todo: author=JonathanThe paragraphs below need some cleanup. They seem quite disconnected. Give the reader some context of what they are reading next, e.g. by adding a paragraph header “Notation”, or better connecting the different paragraphs.

In the following, we will consider both weak and strong formulations of linear PDEs, which is why we introduce the unifying notation $\mathcal{D}^{(w)}[{\bm{u}}]=f^{(w)}.$ For a strong formulation, $\mathcal{D}^{(w)}\coloneqq\mathcal{D}$ , where $\mathcal{D}\colon{\mathbb{U}}\mapsto{\mathbb{V}}$ is a linear differential operator (see definition 23), and $f^{(w)}\coloneqq f\in{\mathbb{V}}$ is the right-hand side function. In the context of a weak formulation, $\mathcal{D}^{(w)}\coloneqq\mathcal{D}^{w}$ , where $\mathcal{D}^{w}\colon{\mathbb{U}}\mapsto{\mathbb{V}}^{\prime}$ is the weak differential operator, and $f^{(w)}\coloneqq f^{w}\in{\mathbb{V}}^{\prime}$ is the right-hand side functional (see section 2.1.1). Following section 2.1.2, we will apply linear functionals to the PDE residual. To facilitate notation, we define the shorthand ${\mathbb{L}}^{(w)}$ for the space of continuous linear functionals on the image space of $\mathcal{D}^{(w)}$ , i.e. ${\mathbb{L}}^{(w)}\coloneqq{\mathbb{V}}^{\prime}$ in the context of a strong formulation and ${\mathbb{L}}^{(w)}\coloneqq{\mathbb{V}}^{\prime\prime}$ in the context of a weak formulation. We additionally require that $l\circ\mathcal{D}^{(w)}$ is continuous for every $l\in{\mathbb{L}}^{(w)}$ . ^{author=Marvin}^{author=Marvin}todo: author=MarvinMaybe mention why these assumptions are typically fulfilled in practice, e.g. via (Lions-)Lax-Milgram assumptions

Let ${\bm{\mathrm{u}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ be a Gaussian process prior over the solution ${\bm{\mathrm{u}}}$ of the PDE, whose path space can be continuously embedded into the solution space ${\mathbb{U}}$ (see section B.4 for more details on the latter assumption). It is intractable to condition the GP prior on the full information provided by the PDE via the family $\{\mathcal{I}_{l}\}_{l\in{\mathbb{L}}^{(w)}}$ of affine information operators $\mathcal{I}_{l}[{\bm{\mathrm{u}}}]\coloneqq(l\circ\mathcal{D}^{(w)})[{\bm{% \mathrm{u}}}]-l[f^{(w)}],$ since ${\mathbb{L}}^{(w)}$ is typically infinite-dimensional. To identify tractable families of information operators, we take inspiration from the method of weighted residuals. ^{author=Jonathan}^{author=Jonathan}todo: author=Jonathanthere is some duplication here about intractability with the intro of this section

3.3.1 Infinite-Dimensional Trial Function Spaces

Using theorem 1 we can tractably condition on a finite subfamily $\{\mathcal{I}_{l^{(i)}}\}_{i=1}^{n}\subset\{\mathcal{I}_{l}\}_{l\in{\mathbb{L}% }^{(w)}}$ of information operators, where $\{l^{(i)}\}_{i=1}^{n}\subset{\mathbb{L}}^{(w)}$ is a finite subset of test functionals, as long as we can compute $\mathcal{I}_{l^{(i)}}[{\bm{m}}],$ $\mathcal{L}_{l^{(i)}}[{\bm{k}}_{:,j}(\cdot,{\bm{x}})],$ and $\mathcal{L}_{l^{(i)}}{\bm{k}}\mathcal{L}_{l^{(i)}}^{\prime},$ where $\mathcal{L}_{l^{(i)}}=l^{(i)}\circ\mathcal{D}^{(w)}$ . This might not always be possible in closed-form, since $\mathcal{L}_{l^{(i)}}$ often involves computing integrals. However, in these cases one could fall back to an efficient numeric quadrature method, since the integrals are often low-dimensional (typically at most four-dimensional). A prominent example of this approach is the probabilistic meshless method used in section 3.

Example 4 (Symmetric Collocation).

If the PDE is in strong formulation, then $l^{(i)}=\delta_{{\bm{x}}_{i}}\in{\mathbb{V}}^{\prime}$ with ${\bm{x}}_{i}\in{\mathbb{D}}$ is a valid test functional, which induces the information operator

\mathcal{I}_{l^{(i)}}[{\bm{\mathrm{u}}}]=\mathcal{D}[{\bm{\mathrm{u}}}]({\bm{x% }}_{i})-f({\bm{x}}_{i}),

i.e. we recover the probabilistic meshless method by Cockayne et al. (2017). They show that the conditional mean of this approach reproduces symmetric collocation (Fasshauer, 1997, 1999), a well-known method to approximate strong solutions of PDEs.

The probabilistic meshless method can only be used to approximate strong solutions of linear PDEs, since point evaluation functionals are not well-defined on the image space ${\mathbb{V}}^{\prime}$ of $\mathcal{D}^{w}$ . However, other choices of the $l^{(i)}$ lead to approximation schemes for weak solutions.

Example 5 (Weak Formulations).

Consider a linear PDE in weak formulation. As mentioned in section 2.1.2, it is customary to use test functionals $l^{(i)}$ , which are induced by test functions $\psi^{(i)}\in{\mathbb{V}}$ , i.e.

\mathcal{I}_{l^{(i)}}[{\bm{\mathrm{u}}}]=\mathcal{D}^{w}[{\bm{\mathrm{u}}}][% \psi^{(i)}]-f^{w}[\psi^{(i)}].

(3.8)

For instance, if ${\mathbb{D}}=]l,r[\subset\mathbb{R}$ , then a valid set of test functions for the weak formulation from section 2.1.1 is given by

\psi^{(i)}=\begin{cases}\frac{x-x_{i-1}}{x_{i}-x_{i-1}}&\text{if }x_{i-1}\leq x% \leq x_{i},\\ \frac{x_{i+1}-x}{x_{i+1}-x_{i}}&\text{if }x_{i}\leq x\leq x_{i+1},\\ 0&\text{otherwise}.\end{cases}\quad\in H^{1}_{0}\left(]l,r[\right),

(3.9)

where $l=x_{0}<\dotsb<x_{n+1}=r$ . The test functions are visualized in figure 8(a). For the weak formulation in section 2.1.1, the information operator from equation 3.8 is equivalent to $\mathcal{I}_{l^{(i)}}[{\mathrm{u}}]=B[{\mathrm{u}},\psi^{(i)}]-\langle f,\psi^% {(i)}\rangle_{L_{2}}.$

3.3.2 Finite-Dimensional Trial Function Spaces

As opposed to the methods outlined in section 2.1.2, we did not need to choose a finite-dimensional subspace of trial functions to arrive at tractable information operators in section 3.3.1. Nevertheless, in practice, it might still be desirable to specify a finite-dimensional trial function basis ${\bm{\phi}}^{(1)},\dotsc,{\bm{\phi}}^{(m)}$ , e.g. because

•

we want to reproduce the output of a classical method in the posterior mean to use the GP solver as an uncertainty-aware drop-in replacement (see corollary 4);
•

the trial basis encompasses problem-specific knowledge, which is difficult to encode in the prior; or
•

we want to solve the problem in a coarse-to-fine scheme, allowing for mesh refinement strategies, which are informed by the GP’s uncertainty estimate.

Naively, one might achieve this goal by defining the prior over ${\bm{\mathrm{u}}}$ as a parametric Gaussian process with features ${\bm{\phi}}^{(i)}$ . However, this means the posterior can not quantify the inherent approximation error, since the GP has no support outside of the finite subspace of ${\mathbb{U}}$ spanned by the trial functions. Consequently, we need to take a different approach. Starting from a general, potentially nonparametric prior over ${\bm{\mathrm{u}}}$ , we consider a bounded (potentially oblique) projection $\mathcal{P}_{\hat{{\mathbb{U}}}}\colon{\mathbb{U}}\to\hat{{\mathbb{U}}}$ onto a subspace $\hat{{\mathbb{U}}}\subset{\mathbb{U}}$ , i.e. $\mathcal{P}_{\hat{{\mathbb{U}}}}^{2}=\mathcal{P}_{\hat{{\mathbb{U}}}}$ , $\lVert\mathcal{P}_{\hat{{\mathbb{U}}}}\rVert<\infty$ , and $\operatorname{ran}(\mathcal{P}_{\hat{{\mathbb{U}}}})=\hat{{\mathbb{U}}}$ . In general, this subspace need not be finite-dimensional. We apply $\mathcal{P}_{\hat{{\mathbb{U}}}}$ to our GP prior over ${\bm{\mathrm{u}}}$ , which, by corollary 2, results in another GP

\hat{{\bm{\mathrm{u}}}}\coloneqq\mathcal{P}_{\hat{{\mathbb{U}}}}[{\bm{\mathrm{% u}}}]\sim{\operatorname{\mathcal{GP}}\left(\mathcal{P}_{\hat{{\mathbb{U}}}}[{% \bm{m}}],\mathcal{P}_{\hat{{\mathbb{U}}}}{\bm{k}}\mathcal{P}_{\hat{{\mathbb{U}% }}}^{\prime}\right)},

with sample paths in $\hat{{\mathbb{U}}}$ . This discards prior information about $\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ . Hence, especially in case $\dim{\hat{{\mathbb{U}}}}<\infty$ , applying the information operators $\mathcal{I}_{l^{(i)}}$ from section 3.3.1 directly to $\hat{{\bm{\mathrm{u}}}}$ would suffer from similar problems as choosing a parametric prior. However,

\mathcal{I}_{l^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}}\coloneqq\mathcal{I}_{l^% {(i)}}\circ\mathcal{P}_{\hat{{\mathbb{U}}}}=(l^{(i)}\circ\mathcal{D}^{(w)}% \circ\mathcal{P}_{\hat{{\mathbb{U}}}})[\cdot]-l^{(i)}[f^{(w)}]

is a valid information operator for ${\bm{\mathrm{u}}}$ , which leads to a probabilistic generalization of the method of weighted residuals. This is why we refer to $\mathcal{I}_{l^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}}$ as an MWR information operator.

The similarity to the method of weighted residuals is particularly prominent if we choose a finite-dimensional subspace $\hat{{\mathbb{U}}}=\operatorname{span}\left({\bm{\phi}}^{(1)},\dotsc,{\bm{\phi% }}^{(m)}\right)$ as in section 2.1.2. In this case, there is a bounded linear operator ${\bm{\mathcal{P}}}_{\mathbb{R}^{m}}\colon{\mathbb{U}}\to\mathbb{R}^{m}$ such that

\mathcal{P}_{\hat{{\mathbb{U}}}}[{\bm{\mathrm{u}}}]=\sum_{i=1}^{m}{\mathrm{c}}% _{i}{\bm{\phi}}^{(i)}\eqqcolon\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{\mathbb{U}}}% }[{\bm{\mathrm{c}}}],

where the ${\bm{\mathrm{c}}}\coloneqq{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{u}}% }]\in\mathbb{R}^{m}$ are the coordinates of $\mathcal{P}_{\hat{{\mathbb{U}}}}[{\bm{\mathrm{u}}}]$ in $\hat{{\mathbb{U}}}$ and $\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{\mathbb{U}}}}\colon\mathbb{R}^{m}\to\hat{{% \mathbb{U}}}$ is the canonical isomorphism between $\mathbb{R}^{m}$ and $\hat{{\mathbb{U}}}$ . Hence, we get the factorization

\mathcal{P}_{\hat{{\mathbb{U}}}}=\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{\mathbb{U% }}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}},

(3.10)

which implies that $\hat{{\bm{\mathrm{u}}}}$ is a parametric Gaussian process. Moreover, $l^{(i)}[f^{(w)}]=\hat{{f}}_{i}$ and

(l^{(i)}\circ\mathcal{D}^{(w)}\circ\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{\mathbb% {U}}}})[{\bm{\mathrm{c}}}]=\sum_{i=1}^{m}{\mathrm{c}}_{i}(l^{(i)}\circ\mathcal% {D}^{(w)})[{\bm{\phi}}^{(i)}]=(\hat{{\bm{D}}}{\bm{\mathrm{c}}})_{i},

where $\hat{{\bm{D}}}$ and $\hat{{\bm{f}}}$ are defined as in section 2.1.2. Consequently, the MWR information operator is given by $\mathcal{I}_{l^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}}[{\bm{\mathrm{u}}}]=({% \bm{\mathcal{I}}}_{\mathbb{R}^{m}}\circ\mathcal{P}_{\hat{{\mathbb{U}}}})[{\bm{% \mathrm{u}}}]_{i},$ where ${\bm{\mathcal{I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]\coloneqq\hat{{\bm{D}}}{% \bm{\mathrm{c}}}-\hat{{\bm{f}}}.$ This illustrates that we are dealing with the hierarchical model

	$\displaystyle{\bm{\mathrm{u}}}$	$\displaystyle\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$
	$\displaystyle{\bm{\mathrm{c}}}\nonscript\>\|\allowbreak\nonscript\>\mathopen{}{% \bm{\mathrm{u}}}$	$\displaystyle\sim\delta_{{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{u}}}]}$

with observations ${\bm{\mathcal{I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]={\bm{0}},$ where ${\bm{\mathrm{c}}}\sim{\operatorname{\mathcal{N}}\left({\bm{\mathcal{P}}}_{% \mathbb{R}^{m}}[{\bm{m}}],{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}{\bm{k}}{\bm{% \mathcal{P}}}_{\mathbb{R}^{m}}^{\prime}\right)}.$ Inference in this model can be broken down into two steps. First, we update our belief about the solution’s coordinates in $\hat{{\mathbb{U}}}$ by computing the conditional random variable ${\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{\mathcal{% I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]={\bm{0}},$ which is also Gaussian. If $\hat{{\bm{D}}}$ is invertible and ${\bm{\mathrm{c}}}$ has full support on $\mathbb{R}^{m}$ , then the law of ${\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{\mathcal{% I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]={\bm{0}},$ is a Dirac measure whose mean is given by the coordinates of the MWR approximation ${\bm{c}}^{\mathrm{MWR}}=\hat{{\bm{D}}}^{-1}\hat{{\bm{f}}}$ from equation 2.8. Next, we can reuse precomputed quantities from the conditional moments of ${\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{\mathcal{% I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]={\bm{0}},$ such as the representer weights ${\bm{w}}=(\hat{{\bm{D}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}{\bm{k}}{\bm{% \mathcal{P}}}_{\mathbb{R}^{m}}^{\prime}\hat{{\bm{D}}}^{\top})^{\dagger}(\hat{{% \bm{f}}}-\hat{{\bm{D}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}])$ to efficiently compute the conditional random process

({\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}({\bm{% \mathcal{I}}}_{\mathbb{R}^{m}}\circ{\bm{\mathcal{P}}}_{\mathbb{R}^{m}})[{\bm{% \mathrm{c}}}]={\bm{0}})=({\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>% \mathopen{}\{\mathcal{I}_{l^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}}[{\bm{% \mathrm{u}}}]=0\}_{i=1}^{n}),

i.e. the main quantity of interest. Assuming once more that $\hat{{\bm{D}}}$ is invertible and ${\bm{\mathrm{c}}}$ has full support on $\mathbb{R}^{m}$ , the remaining uncertainty of the conditional process lies in the kernel of $\mathcal{P}_{\hat{{\mathbb{U}}}}$ , since the law of ${\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{\mathcal{% I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]={\bm{0}},$ is a Dirac measure and

(\mathcal{P}_{\hat{{\mathbb{U}}}}[{\bm{\mathrm{u}}}]\nonscript\>|\allowbreak% \nonscript\>\mathopen{}\{\mathcal{I}_{l^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}% }[{\bm{\mathrm{u}}}]=0\}_{i=1}^{n})=(\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{% \mathbb{U}}}}[{\bm{\mathrm{c}}}]\nonscript\>|\allowbreak\nonscript\>\mathopen{% }{\bm{\mathcal{I}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{c}}}]={\bm{0}}).

Thus, all remaining uncertainty must be due to $(\operatorname{id}_{{\mathbb{U}}}-\mathcal{P}_{\hat{{\mathbb{U}}}})[{\bm{% \mathrm{u}}}]\nonscript\>|\allowbreak\nonscript\>\mathopen{}\{\mathcal{I}_{l^{% (i)},\mathcal{P}_{\hat{{\mathbb{U}}}}}[{\bm{\mathrm{u}}}]=0\}_{i=1}^{n}.$ Note the striking similarity of this property to the notion of Galerkin orthogonality (Logg et al., 2012, Equation 2.63).

A canonical choice for $\mathcal{P}_{\hat{{\mathbb{U}}}}$ would arguably be an orthogonal projection w.r.t. the RKHS inner product of the sample space of ${\bm{\mathrm{u}}}$ (see e.g. Kanagawa et al. 2018). However, this inner product is generally difficult to compute. Fortunately, we can use the $L_{2}$ inner products or Sobolev inner products on the samples to induce a (usually non-orthogonal) projection $\mathcal{P}_{\hat{{\mathbb{U}}}}$ .

Example 6.

If the elements of ${\mathbb{U}}$ are square-integrable, then the linear operator

{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{u}}}]\coloneqq{\bm{P}}^{-1}% \left(\int_{{\mathbb{D}}}\langle{\bm{\phi}}^{(i)}({\bm{x}}),{\bm{\mathrm{u}}}(% {\bm{x}})\rangle_{\mathbb{R}^{d}}\,\mathrm{d}{\bm{x}}\right)_{i=1}^{m},

where

{P}_{ij}\coloneqq\int_{D}\langle{\bm{\phi}}^{(i)}({\bm{x}}),{\bm{\phi}}^{(j)}(% {\bm{x}})\rangle_{\mathbb{R}^{d}}\,\mathrm{d}{\bm{x}},

induces a projection $\mathcal{P}_{\hat{{\mathbb{U}}}}=\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{\mathbb{U% }}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}$ onto $\hat{{\mathbb{U}}}\subset{\mathbb{U}}$ , even if ${\mathbb{U}}$ is not a Hilbert space with inner product $\langle\cdot,\cdot\rangle_{L_{2}\left({\mathbb{D}}\right)}$ .

At first glance, information operators restricting $\hat{{\mathbb{U}}}$ to be finite-dimensional might seem fundamentally inferior to the information operators from section 3.3.1. However, the conditional mean of a Gaussian process prior conditioned on $\{\mathcal{I}_{l^{(i)}}[{\bm{\mathrm{u}}}]=0\}_{i=1}^{n}$ is updated by a linear combination of $n$ functions, while the covariance function receives an at most rank $n$ downdate. This means that, implicitly, conditioning a Gaussian process on an information operator with $\mathcal{P}_{\hat{{\mathbb{U}}}}=\operatorname{id}_{{\mathbb{U}}}$ also constructs a finite-dimensional trial function space, which depends on the test function basis, the bilinear form $B$ and the prior covariance function ${\bm{k}}$ .

MWR information operators with finite-dimensional trial function bases can be used to realize a GP-based analogue of the finite element method.

Example 7 (A 1D Finite Element Method).

Finite element methods are (generalized) Galerkin methods, where the functions in the test and trial bases have compact support, i.e. they are nonzero only in a highly localized region of the domain. The archetype of a finite element method for the weak formulation from section 2.1.1 uses linear Lagrange elements (Logg et al., 2012, Section 3.3.1) as test and trial functions, i.e. $\phi^{(i)}(x)=\psi^{(i)}(x)$ and $m=n$ . Linear Lagrange elements are piecewise linear on a triangulation of the domain. For instance, on a one-dimensional domain ${\mathbb{D}}=]-1,1[$ , the linear Lagrange elements are given by equation 3.9 from example 5. Multiplying a coordinate vector ${\bm{c}}\in\mathbb{R}^{m}$ with these basis functions leads to a piecewise linear interpolation between the points

(x_{0},0),(x_{1},{c}_{1}),\dotsc,(x_{n},{c}_{n}),(x_{n+1},0),

since, for $x\in[x_{i},x_{i+1}]$ ,

\sum_{i=1}^{m}{c}_{i}\phi^{(i)}(x)={c}_{i}\frac{x_{i+1}-x}{x_{i+1}-x_{i}}+{c}_% {i+1}\frac{x-x_{i}}{x_{i+1}-x_{i}}=\left(1-\frac{x-x_{i}}{x_{i+1}-x_{i}}\right% ){c}_{i}+\left(\frac{x-x_{i}}{x_{i+1}-x_{i}}\right){c}_{i+1}.

The basis functions and an element in their span are visualized in figure 8. The Lagrange elements at the boundary of the domain can also be easily modified such that arbitrary piecewise linear boundary conditions are fulfilled by construction. The effect of MWR information operators based on this set of test and trial functions is visualized in figure 9(a).

3.3.3 MWR Information Operators

Even though the class of information operators introduced above is constructed for linear PDEs, it can naturally be applied to the weak form of an arbitrary operator equation. In particular, we can use MWR information operators for the boundary conditions in an (I)BVP. Moreover, it is straightforward to extend $\mathcal{I}_{l,\mathcal{P}_{\hat{{\mathbb{U}}}}}$ to a joint GP prior over $({\bm{\mathrm{u}}},{\mathrm{f}}^{(w)})$ if the right-hand side $f^{(w)}$ of the operator equation is unknown as in section 2.1. In this case, $\mathcal{I}_{l,\mathcal{P}_{\hat{{\mathbb{U}}}}}$ is jointly linear in $({\bm{\mathrm{u}}},{\mathrm{f}}^{(w)})$ . Summarizing sections 3.3.1 and 3.3.2 and incorporating the extensions discussed here, we define an MWR information operator as follows:

Definition 2 (MWR Information Operator).

Let $\mathcal{D}^{(w)}[{\bm{\mathrm{u}}}]=f^{(w)}$ be an operator equation in strong or weak formulation. An MWR information operator for said operator equation is a continuous affine functional

\mathcal{I}_{l,\mathcal{P}_{\hat{{\mathbb{U}}}}}\coloneqq(l\circ\mathcal{D}^{(% w)}\circ\mathcal{P}_{\hat{{\mathbb{U}}}})[\cdot]-l[f^{(w)}]

parameterized by a test functional $l\in{\mathbb{L}}^{(w)}$ and a bounded (potentially oblique) projection $\mathcal{P}_{\hat{{\mathbb{U}}}}$ onto a subspace $\hat{{\mathbb{U}}}\subset{\mathbb{U}}$ . We also write $\mathcal{I}_{l}\coloneqq\mathcal{I}_{l,\operatorname{id}_{{\mathbb{U}}}}$ . The input of $\mathcal{I}_{l,\mathcal{P}_{\hat{{\mathbb{U}}}}}$ can be extended to the right-hand side ${\mathrm{f}}^{(w)}$ of the operator equation, i.e.

\mathcal{I}_{l,\mathcal{P}_{\hat{{\mathbb{U}}}}}[({\bm{\mathrm{u}}},{\mathrm{f% }}^{(w)})]\coloneqq(l\circ\mathcal{D}^{(w)}\circ\mathcal{P}_{\hat{{\mathbb{U}}% }})[{\bm{\mathrm{u}}}]-l[{\mathrm{f}}^{(w)}],

which is jointly linear in $({\bm{\mathrm{u}}},{\mathrm{f}}^{(w)})$ .

	Method	Trial Functions ${\bm{\phi}}^{(i)}$	Test Functionals $l^{(i)}$ / Functions $\psi^{(i)}$
Strong Solutions	Collocation	arbitrary	$l^{(i)}=\delta_{{\bm{x}}_{i}}$ for ${\bm{x}}_{i}\in{\mathbb{D}}$ $\Rightarrow(l^{(i)}\circ\mathcal{D})[{\bm{\mathrm{u}}}]=\mathcal{D}[{\bm{% \mathrm{u}}}]({\bm{x}}_{i})$
	Subdomain (Finite Volume)	arbitrary	$\psi^{(i)}=\chi_{{\mathbb{D}}_{i}}$ for ${\mathbb{D}}_{i}\subset{\mathbb{D}}$ $\Rightarrow(l^{(i)}\circ\mathcal{D})[{\bm{\mathrm{u}}}]=\int_{{\mathbb{D}}_{i}% }\mathcal{D}[{\bm{\mathrm{u}}}]({\bm{x}})\,\mathrm{d}{\bm{x}}$
	Pseudospectral	orthogonal and globally supported (e.g. Fourier basis or Chebychev polynomials)	$l^{(i)}=\delta_{{\bm{x}}_{i}}$ for ${\bm{x}}_{i}\in{\mathbb{D}}$ $\Rightarrow(l^{(i)}\circ\mathcal{D})[{\bm{\mathrm{u}}}]=\mathcal{D}[{\bm{% \mathrm{u}}}]({\bm{x}}_{i})$
Weak & Strong Solutions	Generalized Galerkin	arbitrary	arbitrary, but in general $\psi^{(i)}\neq\phi^{(i)}$
	Finite Element	locally supported (e.g. piecewise polynomial)	same class as trial functions, but in general $\psi^{(i)}\neq\phi^{(i)}$
	Spectral (Galerkin)	orthogonal and globally supported (e.g. Fourier basis or Chebychev polynomials)	same class as trial functions, but in general $\psi^{(i)}\neq\phi^{(i)}$
	(Ritz-)Galerkin	arbitrary	$\psi^{(i)}=\phi^{(i)}$

Table 1: Trial and test function(al)s defining commonly used methods of weighted residuals. If used as part of an MWR information operator, the GP posterior mean recovers the corresponding classic method (see corollary 4).

3.3.4 Recovery of Classical Methods

In this section we will show that, under certain assumptions, the posterior mean of a GP prior conditioned on a set of MWR information operators is identical to the approximation generated by the corresponding traditional method of weighted residuals, examples of which are given in Table 1. More precisely, we will show that there is a flexible family of GP priors ${\bm{\mathrm{u}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ whose posterior means after conditioning on $\{\mathcal{I}_{l^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}}\}_{i=1}^{m}$ are identical to the corresponding classical MWR approximation ${\bm{c}}^{\mathrm{MWR}}$ to the solution of the same weak form linear PDE, where we use the same trial functions ${\bm{\phi}}^{(1)},\dotsc,{\bm{\phi}}^{(m)}$ and test functionals $l^{(1)},\dotsc,l^{(n)}$ in both cases, i.e. $\hat{{\mathbb{U}}}=\operatorname{span}\left({\bm{\phi}}^{(1)},\dotsc,{\bm{\phi% }}^{(m)}\right)$ . As in section 2.1.2, we assume that the trial functions are already constructed in such a way that the boundary conditions are fulfilled. However, it is possible to extend the results below to the general case by adding MWR information operators corresponding to the boundary conditions and using

{\bm{c}}^{\mathrm{MWR}}=\begin{pmatrix}\hat{{\bm{D}}}_{\text{PDE}}\\ \hat{{\bm{D}}}_{\text{BC}}\end{pmatrix}^{-1}\begin{pmatrix}\hat{{\bm{f}}}_{% \text{PDE}}\\ \hat{{\bm{f}}}_{\text{BC}}\end{pmatrix}

as coordinates for the reference solution generated by the traditional MWR.

Proposition 3.

If $\hat{{\bm{D}}}\in\mathbb{R}^{n\times m}$ and ${\bm{\Sigma}}_{\bm{\mathrm{c}}}\coloneqq{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}{% \bm{k}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}^{\prime}\in\mathbb{R}^{m\times m}$ are invertible, then

{\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}}% {\bm{\mathrm{c}}}-\hat{{\bm{f}}}={\bm{0}}\sim\delta_{{\bm{c}}^{\mathrm{MWR}}}

and the conditional mean ${\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat% {{\bm{D}}},\hat{{\bm{f}}}}$ of ${\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}}% {\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{\mathrm{u}}}]-\hat{{\bm{f}}}={\bm{0}}$ admits a unique additive decomposition

{\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat% {{\bm{D}}},\hat{{\bm{f}}}}={\bm{u}}^{\mathrm{MWR}}+{\bm{u}}_{\operatorname{ker% }(\mathcal{P}_{\hat{{\mathbb{U}}}})}

(3.11)

with ${\bm{u}}^{\mathrm{MWR}}\in\hat{{\mathbb{U}}}$ and ${\bm{u}}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}\in% \operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ .

Corollary 4 (MWR Generalization).

If, additionally, ${\bm{m}}\in\hat{{\mathbb{U}}}$ and $\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}{\bm{k}}{\bm% {\mathcal{P}}}_{\mathbb{R}^{m}}^{\prime}={\bm{0}}$ , then the conditional mean function ${\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat% {{\bm{D}}},\hat{{\bm{f}}}}$ is equal to the MWR approximation, i.e.

{\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat% {{\bm{D}}},\hat{{\bm{f}}}}={\bm{u}}^{\mathrm{MWR}}.

It turns out that it is possible to transform any admissible GP prior over the (weak) solution of the PDE into a prior that fulfills the assumptions of corollary 4.

Proposition 5 (MWR Recovery Prior).

Let $\tilde{{\bm{\mathrm{u}}}}\sim{\operatorname{\mathcal{GP}}\left(\tilde{{\bm{m}}% },\tilde{{\bm{k}}}\right)}$ with mean and sample paths in ${\mathbb{U}}$ . Then ${\bm{\mathrm{u}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ with ${\bm{m}}\coloneqq\mathcal{P}_{\hat{{\mathbb{U}}}}[\tilde{{\bm{m}}}]$ and

	$\displaystyle{\bm{k}}$	$\displaystyle\coloneqq\mathcal{P}_{\hat{{\mathbb{U}}}}\tilde{{\bm{k}}}\mathcal% {P}_{\hat{{\mathbb{U}}}}^{\prime}+\mathcal{P}_{\operatorname{ker}(\mathcal{P}_% {\hat{{\mathbb{U}}}})}\tilde{{\bm{k}}}\mathcal{P}_{\operatorname{ker}(\mathcal% {P}_{\hat{{\mathbb{U}}}})}^{\prime}$
		$\displaystyle=\mathcal{P}_{\hat{{\mathbb{U}}}}\tilde{{\bm{k}}}\mathcal{P}_{% \hat{{\mathbb{U}}}}^{\prime}+(\operatorname{id}_{{\mathbb{U}}}-\mathcal{P}_{% \hat{{\mathbb{U}}}})\tilde{{\bm{k}}}(\operatorname{id}_{{\mathbb{U}}}-\mathcal% {P}_{\hat{{\mathbb{U}}}})^{\prime}$
		$\displaystyle=\tilde{{\bm{k}}}-\mathcal{P}_{\hat{{\mathbb{U}}}}\tilde{{\bm{k}}% }-\tilde{{\bm{k}}}\mathcal{P}_{\hat{{\mathbb{U}}}}^{\prime}+2\mathcal{P}_{\hat% {{\mathbb{U}}}}\tilde{{\bm{k}}}\mathcal{P}_{\hat{{\mathbb{U}}}}^{\prime}$

has sample paths in ${\mathbb{U}}$ , ${\bm{m}}\in\hat{{\mathbb{U}}}$ , and $\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}{\bm{k}}{\bm% {\mathcal{P}}}_{\mathbb{R}^{m}}^{\prime}={\bm{0}}$ .

Figure 9(b) visualizes how a prior of this form reproduces a 1D finite element method in the posterior mean and figure 9 as a whole contrasts the difference between $\tilde{{\bm{\mathrm{u}}}}$ and ${\bm{\mathrm{u}}}$ . Intuitively speaking, the construction for the covariance from proposition 5 enforces statistical independence between the subspaces $\hat{{\mathbb{U}}}$ and $\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ of the GP’s path space. This way, an observation of the GP prior in the subspace $\hat{{\mathbb{U}}}$ gains no information about $\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ , which means that the posterior process will not be updated along $\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ . Since ${\bm{m}}\in\hat{{\mathbb{U}}}$ , i.e. $\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}[{\bm{m}}]$ , it follows that the posterior mean will also lie in $\hat{{\mathbb{U}}}$ . Even though this choice of prior is somewhat restrictive, there are good reasons to use it in practice, arguably the most important of which is that the uncertainty quantification provided by the GP can be added on top of traditional MWR solvers in existing pipelines in a plug-and-play fashion. This is because given the MWR recovery prior, the mean estimate of the probabilistic numerical method agrees with the point estimate produced by the classical solver.

3.4 Algorithm

Algorithm 1 summarizes our framework from an algorithmic standpoint. It outlines how a GP prior can be conditioned on heterogeneous sources of information such as mechanistic knowledge given in the form of a linear boundary value problem, and noisy measurement data by leveraging the notion of a linear information operator. All GP posteriors in this article were computed by this algorithm with different choices of prior, PDE, boundary conditions and policy.

Algorithm 1 Solving PDEs via Gaussian Process Inference

Input: Joint GP prior $({\bm{\mathrm{u}}},{\mathrm{f}}^{(w)},{\mathrm{g}}^{(w)},{\bm{\mathrm{\epsilon% }}})\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ , linear PDE $(\mathcal{D}^{(w)},{\mathrm{f}}^{(w)})$ , boundary conditions $(\mathcal{B}^{(w)},{\mathrm{g}}^{(w)})$ , (noisy) measurements $({\bm{X}}_{\text{MEAS}},{\bm{y}}_{\text{MEAS}})$ , $\dotsc$
Output: GP posterior ${\operatorname{\mathcal{GP}}\left({\bm{m}}^{(i)},{\bm{k}}^{(i)}\right)}$

1procedure LinPDE-GP(

{\bm{m}},{\bm{k}},\mathcal{I}_{\cdot,\cdot}^{\text{PDE}},\mathcal{I}_{\cdot,% \cdot}^{\text{BC}},{\bm{X}}_{\text{MEAS}},{\bm{y}}_{\text{MEAS}}

)

i\leftarrow 0

({\bm{m}}^{(0)},{\bm{k}}^{(0)})\leftarrow({\bm{m}},{\bm{k}})

{\bm{w}}^{(0)}\leftarrow()

{\bm{G}}^{(0)}\leftarrow()

6 while not StoppingCriterion() do

i\leftarrow i+1

(l_{\text{PDE}}^{(i)},l_{\text{BC}}^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}^{(i% )},\dotsc,{\bm{v}}_{\text{MEAS}}^{(i)})\leftarrow\textsc{Policy}({\bm{m}}^{(i)% },{\bm{k}}^{(i)})

\vartriangleright

Action

{\bm{\mathcal{I}}}^{(i)}\leftarrow({\bm{u}},f^{(w)},g^{(w)},{\bm{\epsilon}})% \mapsto\begin{pmatrix}{\bm{\mathcal{I}}}_{l_{\text{PDE}}^{(i)},\mathcal{P}_{% \hat{{\mathbb{U}}}}^{(i)}}^{\text{PDE}}[({\bm{u}},f^{(w)})]\\ {\bm{\mathcal{I}}}_{l_{\text{BC}}^{(i)},\mathcal{P}_{\hat{{\mathbb{U}}}}^{(i)}% }^{\text{BC}}[({\bm{u}},g^{(w)})]\\ \vdots\\ \langle{\bm{v}}_{\text{MEAS}}^{(i)},{\bm{u}}({\bm{X}}_{\text{MEAS}})+{\bm{% \epsilon}}\rangle\end{pmatrix}

\vartriangleright

Information operator

{\bm{y}}^{(i)}\leftarrow\begin{pmatrix}0&0&\ldots&\langle{\bm{v}}_{\text{MEAS}% }^{(i)},{\bm{y}}_{\text{MEAS}}\rangle\end{pmatrix}^{\top}

\vartriangleright

Observations

{\bm{G}}^{(i)}\leftarrow\begin{pmatrix}{\bm{G}}^{(i-1)}&{\bm{\mathcal{I}}}^{(1% :i-1)}{\bm{k}}({\bm{\mathcal{I}}}^{(i)})^{\prime}\\ {\bm{\mathcal{I}}}^{(i)}{\bm{k}}({\bm{\mathcal{I}}}^{(1:i-1)})^{\prime}&{\bm{% \mathcal{I}}}^{(i)}{\bm{k}}({\bm{\mathcal{I}}}^{(i)})^{\prime}\end{pmatrix}

\vartriangleright

Update Gram matrix

{\bm{w}}^{(i)}\leftarrow({\bm{G}}^{(i)})^{\dagger}({\bm{y}}^{(1:i)}-{\bm{% \mathcal{I}}}^{(1:i)}[{\bm{m}}])

\vartriangleright

Update representer weights

{m}^{(i)}_{j}\leftarrow{\bm{x}}\mapsto{m}_{j}({\bm{x}})+{\bm{\mathcal{I}}}^{(1% :i)}[{\bm{k}}_{:,j}(\cdot,{\bm{x}})]^{\top}{\bm{w}}^{(i)}

\vartriangleright

Belief Update

{k}^{(i)}_{j_{1},j_{2}}\leftarrow({\bm{x}}_{1},{\bm{x}}_{2})\mapsto{k}_{j_{1},% j_{2}}({\bm{x}}_{1},{\bm{x}}_{2})-{\bm{\mathcal{I}}}^{(1:i)}[{\bm{k}}_{:,j_{1}% }(\cdot,{\bm{x}}_{1})]^{\top}({\bm{G}}^{(i)})^{\dagger}{\bm{\mathcal{I}}}^{(1:% i)}[{\bm{k}}_{:,j_{2}}(\cdot,{\bm{x}}_{2})]

15 return

{\operatorname{\mathcal{GP}}\left({\bm{m}}^{(i)},{\bm{k}}^{(i)}\right)}

Modeling uncertainty over the right-hand side $f^{(w)}$ of the PDE and the boundary function(al) $g^{(w)}$ is achieved by specifying a joint prior $({\bm{\mathrm{u}}},{\mathrm{f}}^{(w)},{\mathrm{g}}^{(w)},{\bm{\mathrm{\epsilon% }}})$ . Therefore, Algorithm 1 also returns a multi-output Gaussian process posterior over the same objects. This means that our method can be used to solve PDE-constrained Bayesian inverse problems for the right-hand side $f^{(w)}$ and the boundary function $g^{(w)}$ , while computing a consistent distributional estimate for the corresponding solution ${\bm{u}}$ of the forward problem. This is a generalization of a linear latent force model (Alvarez et al., 2009). If $f^{(w)}$ and $g^{(w)}$ are not uncertain, the corresponding covariance functions in the joint prior can simply be set to 0, which (in the absence of measurements) reduces the joint prior to a simple prior over the solution ${\bm{\mathrm{u}}}$ . To condition the GP on the PDE and the boundary conditions, we make use of MWR information operators (see definition 2), where the test functions and projections are chosen by an arbitrary policy in each iteration of the method. An example of such a policy which reproduces figure 1(c) chooses $\mathcal{P}_{\hat{{\mathbb{U}}}}$ as the $L_{2}$ projection onto the basis from example 7 in every iteration, the test functions $l_{\text{BC}}\in\{\delta_{-1},\delta_{1}\}$ , and $l_{\text{PDE}}=0$ in the first two iterations; and $l_{\text{PDE}}$ is induced by $\psi^{(i-2)}=\phi^{(i-2)}$ (and $l_{\text{BC}}=0$ ) from iteration 3 onward. The ellipses in the information operator ${\bm{\mathcal{I}}}^{(i)}$ and the observations ${\bm{y}}^{(i)}$ indicate that adding additional information operators is possible in the same fashion. For instance, adding additional PDE information operators enables the solution of systems of linear PDEs.

Performance Considerations

Instead of naively conditioning the previous conditional process on the new observation in each iteration, algorithm 1 always conditions the prior on the accumulated observations. This is because the naive expressions for the conditional moments become more and more complex over time. While, in principle, it is possible to use automatic differentiation (AD) to compute ${\bm{\mathcal{I}}}^{(i)}[{\bm{m}}^{(i)}]$ , ${\bm{\mathcal{I}}}^{(i)}[{\bm{k}}^{(i-1)}_{:,j}(\cdot,{\bm{x}})]$ , and ${\bm{\mathcal{I}}}^{(i)}{\bm{k}}^{(i-1)}({\bm{\mathcal{I}}}^{(i)})^{\prime}$ in each iteration and then evaluate equations 4.10 and 4.11 naively, we found that this is detrimental to the performance of the algorithm. In algorithm 1, we only need to compute ${\bm{\mathcal{I}}}^{(i)}[{\bm{m}}]$ , ${\bm{\mathcal{I}}}^{(i)}[{\bm{k}}_{:,j}(\cdot,{\bm{x}})]$ , and ${\bm{\mathcal{I}}}^{(i)}{\bm{k}}({\bm{\mathcal{I}}}^{(i)})^{\prime}$ on the prior moments, which are much less complex and cheaper to evaluate. For maximum efficiency, for many information operator / kernel combinations one can compute optimized closed-form expressions for these terms, alleviating the need for automatic differentiation or quadrature. We can avoid unnecessary recomputation of the representer weights at every iteration of the method by means of block-matrix inversion. For instance, if a Cholesky decomposition is used to invert the Gramian ${\bm{G}}^{(i)}$ , we can use a variant of the block Cholesky decomposition (Golub and Van Loan, 2013) to update the Cholesky factor of ${\bm{G}}^{(i-1)}$ .

Code

A Python implementation of algorithm 1 based on ProbNum (Wenger et al., 2021) and JAX (Bradbury et al., 2018) is available at:

https://github.com/marvinpfoertner/linpde-gp

3.5 Related Work

The area of physics-informed machine learning (Karniadakis et al., 2021) aims at augmenting machine learning models with mechanistic knowledge about physical phenomena, mostly in the form of ordinary and partial differential equations. Recently, there has been growing interest in deep learning–based approaches (Raissi et al., 2019; Li et al., 2020, 2021). However, this model choice makes it inherently difficult to quantify the uncertainty about the solution induced by noise-corrupted input data and inevitable approximation error. Instead, we approach the problem through the lens of probabilistic numerics (Hennig et al., 2015; Cockayne et al., 2019b; Oates and Sullivan, 2019; Owhadi et al., 2019; Hennig et al., 2022), which frames numerical problems as statistical estimation tasks. Probabilistic numerical methods for the solution of PDEs are predominantly based on Gaussian process priors. Our work builds upon and extends these works. Many existing methods aim to find a strong solution to a linear PDE using a collocation scheme (e.g. Graepel 2003; Cockayne et al. 2017; Raissi et al. 2017). Unfortunately, many practically relevant (linear) PDEs only admit weak solutions. Our framework extends existing collocation approaches to weak formulations. Probabilistic numerical methods approximating weak formulations are primarily based on discretization. For example, Cockayne et al. (2019a); Wenger and Hennig (2020) apply a probabilistic linear solver to the linear system arising from discretization. Girolami et al. (2021) propose a statistical version of the finite element method (statFEM), which uses a specific parametric GP prior. However, these approaches do not quantify the inherent discretization error – often the largest source of uncertainty about the solution. In contrast, our framework models this error and additionally admits a broader class of discretizations. Wang et al. (2021); Krämer et al. (2022) propose GP-based solvers for strong formulations of time-dependent nonlinear PDEs by leveraging finite-difference approximations to the differential operator and linearization-based approximate inference. While it is possible to apply such methods to linear PDEs, the finite difference approximation of the differential operator introduces additional estimation error. In contrast, the evaluation of the differential operator in our method is exact. Cockayne et al. (2017); Raissi et al. (2017); Girolami et al. (2021) also apply their methods to solve PDE-constrained (Bayesian) inverse problems. Särkkä (2011) directly infers the right-hand side of a linear PDE in strong formulation by observing measurements of the solution through the associated Green’s function. Our approach also builds a belief over an unknown right-hand side without requiring access to a Green’s function. The aforementioned methods use the closure of Gaussian processes under conditioning on observations of the sample paths through a linear operator without proof. Owhadi and Scovel (2018) show how to condition Gaussian measures on an orthogonal direct sum of separable Hilbert spaces on observations of one of the summands. However, this result does not apply to separable Banach spaces such as Hölder spaces, which are ubiquitous in the study of strong solutions of linear PDEs. Furthermore, when it can be applied, it does not translate to Gaussian processes without significant effort.⁸⁸8The theoretical results of an earlier version of this work were based on the result by Owhadi and Scovel (2018). In order to generalize our framework to Banach spaces, we’ve adopted a different proof strategy. Our work therefore provides the theoretical basis for conditioning Gaussian processes on observations of their sample paths made through an arbitrary bounded linear operator with values in $\mathbb{R}^{n}$ . Recent results about the sample spaces of GPs (Steinwart, 2019; Kanagawa et al., 2018) ensure the applicability of our work to practical GP regression problems. From a practitioner’s perspective this allows the modeling flexibility of Gaussian processes via the kernel, while ensuring that conditioning on observations of the sample paths through a linear operator is possible. To our knowledge this is the first complete proof of this widely used property of GPs. Thus, theorem 1 provides the theoretical basis for physics-informed GP regression, including the aforementioned methods for the solution of PDEs. In our work, it enables conditioning on information operators constructed from e.g. PDEs, boundary conditions and general integral equations.

4 Gaussian Process Inference with Linear Operator Observations

Our framework fundamentally relies on the fact that when a Gaussian process prior is conditioned on linear observations of its paths, one obtains a closed-form posterior. This section provides the theoretical foundation for this result. While this property is used widely in the literature (see e.g. Graepel (2003); Rasmussen and Williams (2006); Särkkä (2011); Särkkä et al. (2013); Cockayne et al. (2017); Raissi et al. (2017); Agrell (2019); Albert (2019); Krämer et al. (2022)), no proof of its general form where observations are made via bounded linear operators mapping a separable Banach function spaces into $\mathbb{R}^{n}$ , instead of finite-dimensional linear maps on a finite number of point evaluations exists, to the best of our knowledge. Owhadi and Scovel (2018) give a proof of a related property for Gaussian measures on separable Hilbert spaces. Here, we extend their results to the case of Gaussian processes. While these perspectives are closely related, significant technical attention needs to be paid for this result to transfer to the GP case. For our framework this is essential such that we can leverage the modeling capabilities provided by specifying a kernel as described in section 3.1.1.

To state the result, let ${\bm{\mathrm{f}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{m}},{\bm{k}}\right)}$ be a (multi-output) GP prior with index set ${\mathbb{X}}$ , ${\bm{\mathcal{L}}}\colon\operatorname{paths}\left({\bm{\mathrm{f}}}\right)\to% \mathbb{R}^{n}$ a linear operator acting on the paths of ${\bm{\mathrm{f}}}$ , and ${\bm{\mathrm{\epsilon}}}\sim{\operatorname{\mathcal{N}}\left({\bm{\mu}},{\bm{% \Sigma}}\right)}$ a Gaussian random vector in $\mathbb{R}^{n}$ with ${\bm{\mathrm{\epsilon}}}\perp\!\!\!\!\perp{\bm{\mathrm{f}}}$ . We need to compute the conditional random process

{\bm{\mathrm{f}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{\mathcal{% L}}}[{\bm{\mathrm{f}}}]+{\bm{\mathrm{\epsilon}}}={\bm{y}}

for some ${\bm{y}}\in\mathbb{R}^{n}$ . This object is defined as the family $\left(\left.{\bm{\mathrm{f}}}\nonscript\>\middle|\allowbreak\nonscript\>% \mathopen{}{\bm{\mathcal{L}}}[{\bm{\mathrm{f}}}]+{\bm{\mathrm{\epsilon}}}={\bm% {y}}\right.\right)\coloneqq\{{\bm{\mathrm{f}}}(x,\cdot)\nonscript\>|% \allowbreak\nonscript\>\mathopen{}E\}_{x\in{\mathbb{X}}},$ of conditional random variables⁹⁹9 Here, we need to work with regular conditional probability measures (Klenke, 2014), since the event $E$ typically has probability 0. , where $(\Omega,\mathcal{B}\left(\Omega\right),\mathrm{P})$ is the probability space on which both ${\bm{\mathrm{f}}}$ and ${\bm{\mathrm{\epsilon}}}$ are defined, $E$ is the event $E\coloneqq{\bm{\mathrm{h}}}^{-1}(\{{\bm{y}}\})\in\mathcal{B}\left(\Omega\right)$ , and ${\bm{\mathrm{h}}}$ is the random variable

{\bm{\mathrm{h}}}\colon\Omega\to\mathbb{R}^{n},\omega\mapsto{\bm{\mathcal{L}}}% [{\bm{\mathrm{f}}}(\cdot,\omega)]+{\bm{\mathrm{\epsilon}}}(\omega).

We refer to section B for definitions of the objects mentioned above. For instance, in section 3, we use ${\bm{\mathcal{L}}}\coloneqq\left(\mathcal{D}[\cdot]({\bm{x}}_{i})\right)_{i=1}% ^{n}$ , where $\mathcal{D}$ is a linear differential operator, as well as ${\bm{\mathcal{L}}}[{\bm{\mathrm{f}}}]\coloneqq({\bm{\mathrm{f}}}({\bm{x}}_{i})% )_{i=1}^{n}$ , and, in section 3.2, we additionally use

{\bm{\mathcal{L}}}[{\bm{\mathrm{f}}}]=\int_{{\mathbb{X}}}{\bm{\mathrm{f}}}({% \bm{x}})\,\mathrm{d}{\bm{x}}.

It is well-known that ${\bm{\mathrm{h}}}$ is a Gaussian random vector ${\bm{\mathrm{h}}}\sim{\operatorname{\mathcal{GP}}\left({\bm{\mathcal{L}}}[{\bm% {m}}]+{\bm{\mu}},{\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}}^{\prime}+{\bm{% \Sigma}}\right)},$ where ${\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}}^{\prime}\in\mathbb{R}^{n\times n}$ with $\left({\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}}^{\prime}\right)_{i_{1},i_{2% }}={\mathcal{L}}_{i_{1}}[{\bm{x}}\mapsto({\mathcal{L}}_{i_{2}}[{\bm{k}}_{j_{1}% ,:}({\bm{x}},\cdot)])_{j_{1}=1}^{n}],$ and that the conditional random process is a (multi-output) Gaussian process

\left.{\bm{\mathrm{f}}}\nonscript\>\middle|\allowbreak\nonscript\>\mathopen{}{% \bm{\mathcal{L}}}[{\bm{\mathrm{f}}}]+{\bm{\mathrm{\epsilon}}}={\bm{y}}\right.% \sim{\operatorname{\mathcal{GP}}\left({\bm{m}}^{{\bm{\mathrm{f}}}\nonscript\>|% \allowbreak\nonscript\>\mathopen{}{\bm{y}}},{\bm{k}}^{{\bm{\mathrm{f}}}% \nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}\right)}

with conditional moments given by

	$\displaystyle{m}^{{\bm{\mathrm{f}}}\nonscript\>\|\allowbreak\nonscript\>% \mathopen{}{\bm{y}}}_{i}({\bm{x}})$	$\displaystyle={m}_{i}({\bm{x}})+{\bm{\mathcal{L}}}[{\bm{k}}_{:,i}(\cdot,{\bm{x% }})]^{\top}\left({\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}}^{\prime}+{\bm{% \Sigma}}\right)^{-1}\left({\bm{y}}-({\bm{\mathcal{L}}}[{\bm{m}}]+{\bm{\mu}})% \right),\qquad\text{and}$
	$\displaystyle{k}^{{\bm{\mathrm{f}}}\nonscript\>\|\allowbreak\nonscript\>% \mathopen{}{\bm{y}}}_{i_{1},i_{2}}({\bm{x}}_{1},{\bm{x}}_{2})$	$\displaystyle={k}_{i_{1},i_{2}}({\bm{x}}_{1},{\bm{x}}_{2})+{\bm{\mathcal{L}}}[% {\bm{k}}_{:,i_{1}}(\cdot,{\bm{x}}_{1})]^{\top}\left({\bm{\mathcal{L}}}{\bm{k}}% {\bm{\mathcal{L}}}^{\prime}+{\bm{\Sigma}}\right)^{-1}{\bm{\mathcal{L}}}[{\bm{k% }}_{:,i_{2}}(\cdot,{\bm{x}}_{2})].$

Since the above are nontrivial claims about potentially ill-behaved infinite-dimensional objects, a proof is important, be it just to identify a precise set of assumptions about the objects at play, ensuring the result holds. For instance, the statement that ${\bm{\mathrm{h}}}$ is a random vector, i.e. a measurable function, is highly nontrivial. To remedy this situation, a major contribution of this work are theorems 1 and 2 and their proof in section B, which prove the claims above under realistic assumptions. Hence, besides being the theoretical basis for this work, theorems 1 and 2 also provide theoretical backing for many of the publications cited above. Our results identify a set of mild assumptions, which are easy to verify and widely-applicable in practical applications. Assumption 1 constitutes the common set of assumptions shared by theorems 1 and 2.

Assumption 1.

Let ${\mathrm{f}}\sim{\operatorname{\mathcal{GP}}\left(m,k\right)}$ be a Gaussian process prior with index set ${\mathbb{X}}$ on the probability space $(\Omega,\mathcal{F},\mathrm{P})$ , whose paths lie in a real separable reproducing kernel Banach space (RKBS) ${\mathbb{B}}\subset\mathbb{R}^{{\mathbb{X}}}$ such that $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is a ${\mathbb{B}}$ -valued Gaussian random variable.

For instance, for a 1D domain ${\mathbb{D}}\subset\mathbb{R}$ , a GP prior with half-integer Matérn kernel with smoothness parameter $\nu=p+\frac{1}{2}$ fulfills assumption 1 with ${\mathbb{B}}=C^{p}(\overline{{\mathbb{D}}})$ , i.e. the space of $p$ -times differentiable functions with bounded and uniformly continuous derivatives. Similar results hold in multiple dimensions and for other kernels. See section B.4 for more information on prior selection.

Table 2: theorem 1 provides the theoretical basis to condition on (affine) observations of a Gaussian process. While results like conditioning on derivative evaluations are used ubiquitously (e.g. for monotonic GPs, Bayesian optimization, probabilistic numerical PDE solvers, …) a complete proof does not exist in the literature, to the best of our knowledge.

Observation	Information operator	Reference
Point evaluation	${\bm{\mathrm{f}}}({\bm{x}})$	Bishop (2006)
Finite-dim. affine map	${\bm{A}}{\bm{\mathrm{f}}}({\bm{X}})+{\bm{b}}$	Bishop (2006)
Point evaluation of derivative	$\left.\frac{\partial^{\lvert{\bm{\alpha}}\rvert}}{\partial{\bm{x}}^{{\bm{% \alpha}}}}{\mathrm{f}}_{i}\right\|_{{\bm{x}}={\bm{x}}^{\prime}}$	corollary 2
Integral	$\int_{{\mathbb{X}}}\langle{\bm{\psi}}({\bm{x}}),{\bm{\mathrm{f}}}({\bm{x}})% \rangle\,\mathrm{d}\mu\left({\bm{x}}\right)$	theorem 1
General affine functionals	${\bm{\mathcal{L}}}[{\bm{\mathrm{f}}}]+{\bm{b}}$	theorem 1

Theorem 1 enables affine observations, in which the GP sample paths enter through one or multiple continuous linear functionals. For example, we used theorem 1 in section 3.2 to condition on observations of an integral of a GP’s paths and in section 3.3 to condition on projections of the paths. To state the result conveniently, we introduce some notation.

Notation 1.

Let assumption 1 hold and let ${\bm{\mathcal{L}}}\colon{\mathbb{B}}\to\mathbb{R}^{n}$ and $\tilde{{\bm{\mathcal{L}}}}\colon{\mathbb{B}}\to\mathbb{R}^{\tilde{n}}$ be bounded linear operators. By ${\bm{\mathcal{L}}}k\tilde{{\bm{\mathcal{L}}}}^{\prime}\in\mathbb{R}^{n_{1}% \times n_{2}}$ we denote the matrix with entries

({\bm{\mathcal{L}}}k\tilde{{\bm{\mathcal{L}}}}^{\prime})_{ij}\coloneqq{\bm{% \mathcal{L}}}[{\bm{x}}\mapsto\tilde{{\bm{\mathcal{L}}}}[k({\bm{x}},\cdot)]_{j}% ]_{i}.

The order in which the operators ${\bm{\mathcal{L}}}$ , $\tilde{{\bm{\mathcal{L}}}}$ are applied to the arguments of $k$ does not matter, i.e. $({\bm{\mathcal{L}}}k\tilde{{\bm{\mathcal{L}}}}^{\prime})_{ij}={\bm{\mathcal{L}% }}[{\bm{x}}\mapsto\tilde{{\bm{\mathcal{L}}}}[k({\bm{x}},\cdot)]_{j}]_{i}=% \tilde{{\bm{\mathcal{L}}}}[{\bm{x}}\mapsto{\bm{\mathcal{L}}}[k(\cdot,{\bm{x}})% ]_{i}]_{j}.$ This motivates the parenthesis-free notation ${\bm{\mathcal{L}}}k\tilde{{\bm{\mathcal{L}}}}^{\prime}$ introduced above.

Theorem 1.

Let assumption 1 hold and let ${\bm{\mathcal{L}}}\colon{\mathbb{B}}\to\mathbb{R}^{n}$ be a bounded linear operator. Then

{\bm{\mathcal{L}}}[{\mathrm{f}}]\sim{\operatorname{\mathcal{N}}\left({\bm{% \mathcal{L}}}[m],{\bm{\mathcal{L}}}k{\bm{\mathcal{L}}}^{\prime}\right)}.

(4.1)

Let ${\bm{\mathrm{\epsilon}}}\sim{\operatorname{\mathcal{N}}\left({\bm{\mu}},{\bm{% \Sigma}}\right)}$ be an $\mathbb{R}^{n}$ -valued Gaussian random vector with ${\bm{\mathrm{\epsilon}}}\perp\!\!\!\!\perp{\mathrm{f}}$ . Then, for any ${\bm{y}}\in\mathbb{R}^{n}$ ,

{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{\mathcal{L}}}[% {\mathrm{f}}]+{\bm{\mathrm{\epsilon}}}={\bm{y}}\sim{\operatorname{\mathcal{GP}% }\left(m^{{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}% ,k^{{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}\right% )},

(4.2)

with conditional mean and covariance function given by

m^{{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}({\bm{x% }})=m({\bm{x}})+{\bm{\mathcal{L}}}[k({\bm{x}},\cdot)]^{\top}\left({\bm{% \mathcal{L}}}k{\bm{\mathcal{L}}}^{\prime}+{\bm{\Sigma}}\right)^{\dagger}\left(% {\bm{y}}-\left({\bm{\mathcal{L}}}[m]+{\bm{\mu}}\right)\right),

(4.3)

and

k^{{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}({\bm{x% }}_{1},{\bm{x}}_{2})=k({\bm{x}}_{1},{\bm{x}}_{2})-{\bm{\mathcal{L}}}[k({\bm{x}% }_{1},\cdot)]^{\top}\left({\bm{\mathcal{L}}}k{\bm{\mathcal{L}}}^{\prime}+{\bm{% \Sigma}}\right)^{\dagger}{\bm{\mathcal{L}}}[k(\cdot,{\bm{x}}_{2})].

(4.4)

Finally, we turn to corollary 2, which is the result that is most widely-used throughout the literature (Graepel, 2003; Särkkä, 2011; Särkkä et al., 2013; Cockayne et al., 2017; Raissi et al., 2017; Agrell, 2019; Albert, 2019; Krämer et al., 2022). It shows how Gaussian processes can be conditioned on point evaluations of the image of their paths under a linear operator, provided that the linear operator is bounded and maps into a separable Banach function space, on which point evaluation is continuous. Moreover, it shows that, under these conditions, the image of the GP under the linear operator is itself a Gaussian process. Again, we introduce some notation to facilitate stating the result.

Notation 2.

Let assumption 1 hold and let $\mathcal{L}_{i}\colon{\mathbb{B}}\to{\mathbb{B}}_{i}$ for $i=1,2$ be bounded linear operators mapping into real separable RKBSs ${\mathbb{B}}_{i}\subset\mathbb{R}^{{\mathbb{X}}_{i}}$ , respectively. In analogy to notation 1, we define the bivariate functions

$\displaystyle k\mathcal{L}_{2}^{\prime}\colon$	$\displaystyle\mathbb{X}$	$\displaystyle\times$	$\displaystyle\mathbb{X}_{2}$	$\displaystyle\to\mathbb{R},\,$	$\displaystyle({\bm{x}},\,$	$\displaystyle{\bm{x}}_{2}$	$\displaystyle)$	$\displaystyle\mapsto\mathcal{L}_{2}[k({\bm{x}},\cdot)]({\bm{x}}_{2}),$	(4.5)
$\displaystyle\mathcal{L}_{1}k\colon$	$\displaystyle\mathbb{X}_{1}$	$\displaystyle\times$	$\displaystyle\mathbb{X}$	$\displaystyle\to\mathbb{R},\,$	$\displaystyle({\bm{x}}_{1},\,$	$\displaystyle{\bm{x}}$	$\displaystyle)$	$\displaystyle\mapsto\mathcal{L}_{1}[k(\cdot,{\bm{x}})]({\bm{x}}_{1}),\qquad% \text{and}$	(4.6)
$\displaystyle\mathcal{L}_{1}k\mathcal{L}_{2}^{\prime}\colon$	$\displaystyle\mathbb{X}_{1}$	$\displaystyle\times$	$\displaystyle\mathbb{X}_{2}$	$\displaystyle\to\mathbb{R},\,$	$\displaystyle({\bm{x}}_{1},\,$	$\displaystyle{\bm{x}}_{2}$	$\displaystyle)$	$\displaystyle\mapsto\mathcal{L}_{1}[(k\mathcal{L}_{2}^{\prime})(\cdot,{\bm{x}}% _{2})]({\bm{x}}_{1}).$	(4.7)

Corollary 2.

Let assumption 1 hold and let $\mathcal{L}\colon{\mathbb{B}}\to\tilde{{\mathbb{B}}}$ be a linear operator mapping into a real vector space $\tilde{{\mathbb{B}}}\subset\mathbb{R}^{\tilde{{\mathbb{X}}}}$ such that $\delta_{\bm{\tilde{x}}}\circ\mathcal{L}$ is bounded for all $\tilde{x}\in\tilde{{\mathbb{X}}}$ . Then

\mathcal{L}[{\mathrm{f}}]\sim{\operatorname{\mathcal{GP}}\left(\mathcal{L}[m],% \mathcal{L}k\mathcal{L}^{\prime}\right)}.

(4.8)

Let ${\bm{\mathrm{\epsilon}}}\sim{\operatorname{\mathcal{N}}\left({\bm{\mu}},{\bm{% \Sigma}}\right)}$ with values in $\mathbb{R}^{n}$ and ${\bm{\mathrm{\epsilon}}}\perp\!\!\!\!\perp{\mathrm{f}}$ . Then, for $\tilde{{\bm{X}}}=(\tilde{{\bm{x}}}_{i})_{i=1}^{n}\in\tilde{{\mathbb{X}}}^{n}$ and ${\bm{y}}\in\mathbb{R}^{n}$ ,

{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\mathcal{L}[{% \mathrm{f}}](\tilde{{\bm{X}}})+{\bm{\mathrm{\epsilon}}}={\bm{y}}\sim{% \operatorname{\mathcal{GP}}\left(m^{{\mathrm{f}}\nonscript\>|\allowbreak% \nonscript\>\mathopen{}{\bm{y}}},k^{{\mathrm{f}}\nonscript\>|\allowbreak% \nonscript\>\mathopen{}{\bm{y}}}\right)}

(4.9)

with

m^{{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}({\bm{x% }})\coloneqq m({\bm{x}})+(k\mathcal{L}^{\prime})({\bm{x}},\tilde{{\bm{X}}})^{% \top}\left((\mathcal{L}k\mathcal{L}^{\prime})(\tilde{{\bm{X}}},\tilde{{\bm{X}}% })+{\bm{\Sigma}}\right)^{\dagger}\left({\bm{y}}-\left(\mathcal{L}[m](X)+{\bm{% \mu}}\right)\right)

(4.10)

and

k^{{\mathrm{f}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}({\bm{x% }}_{1},{\bm{x}}_{2})\coloneqq k({\bm{x}}_{1},{\bm{x}}_{2})-(k\mathcal{L}^{% \prime})({\bm{x}}_{1},\tilde{{\bm{X}}})^{\top}\left((\mathcal{L}k\mathcal{L}^{% \prime})(\tilde{{\bm{X}}},\tilde{{\bm{X}}})+{\bm{\Sigma}}\right)^{\dagger}(% \mathcal{L}k)(\tilde{{\bm{X}}},{\bm{x}}_{2})

(4.11)

If additionally $\tilde{{\mathbb{X}}}={\mathbb{X}}$ , then

\begin{pmatrix}{\mathrm{f}}\\ \mathcal{L}[{\mathrm{f}}]\end{pmatrix}\sim{\operatorname{\mathcal{GP}}\left(% \begin{pmatrix}m\\ \mathcal{L}[m]\end{pmatrix},\begin{pmatrix}k&k\mathcal{L}^{\prime}\\ \mathcal{L}k&\mathcal{L}k\mathcal{L}^{\prime}\end{pmatrix}\right)}.

(4.12)

Remark 6.

The assumptions about $\mathcal{L}$ from corollary 2 are fulfilled if $\tilde{{\mathbb{B}}}$ is an RKBS and $\mathcal{L}$ is bounded. However, these conditions are not necessary.

Corollary 2 is the theoretical basis for most of section 3.2. For $\mathcal{L}=\operatorname{id}_{{\mathbb{B}}}$ , we recover standard GP regression as a special case. Finally, both Theorem 1 and Corollary 2 apply also to vector-valued Gaussian processes.

Remark 7 (Multi-Output Gaussian Processes).

Theorems 1 and 2 also apply to multi-output GPs ${\bm{\mathrm{f}}}$ . In this case, we interpret the sample paths ${\bm{\mathrm{f}}}(\cdot,\omega)\colon{\mathbb{X}}\to\mathbb{R}^{d^{\prime}}$ of the multi-output GP as sample paths $\tilde{{\mathrm{f}}}(\cdot,\omega)\colon I\times{\mathbb{X}}\to\mathbb{R},\ % \tilde{{\mathrm{f}}}((i,{\bm{x}}),\omega)\coloneqq{\mathrm{f}}_{i}({\bm{x}},\omega)$ of a regular GP with index set $I\times{\mathbb{X}}\to\mathbb{R}$ , where $I=\{1,\dotsc,d^{\prime}\}$ (see section 2.2). We also generalize notation like ${\bm{\mathcal{L}}}{\bm{k}}{\bm{\mathcal{L}}}^{\prime}$ accordingly.

5 Conclusion

In this work, we developed a probabilistic framework for the solution of (systems of) linear partial differential equations, which can be interpreted as physics-informed Gaussian process regression. It enables the seamless fusion of (1) a-priori known, provable properties of the system of interest, (2) exact and partial mechanistic information, (3) subjective domain expertise, as well as, (4) noisy empirical measurements into a unified scientific model. This model outputs a consistent uncertainty estimate, which quantifies the inherent approximation error in addition to the uncertainty arising from partially-known physics, as well as limited-precision measurements. Our framework fundamentally relies on the closure of Gaussian processes under conditioning on observations of their sample paths through an arbitrary bounded linear operator. While this result has been used ubiquitously in the literature, a rigorous proof for linear operator observations, as needed in the PDE setting, did not exist prior to this work to the best of our knowledge. Our work generalizes and unifies several related formulations of GP-PDE inference. Importantly, our formulation extends these ideas to virtually all popular methods for PDE simulation, revealing them to be a form of Gaussian process inference and in turn clarifying the underlying (probabilistic) assumptions. More specifically, by choosing a specific prior and information operator in our framework, it recovers methods of weighted residuals, a popular family of numerical methods for the solution of (linear) PDEs, which includes generalized Galerkin methods such as finite element and spectral methods. This demonstrates that classical linear PDE solvers can be generalized in their functionality to include approximate input data and equipped with a structured uncertainty estimate. Our work outlines a general framework for the integration of mechanistic building blocks in the form of information operators derived from e.g. linear PDEs into probabilistic models. Our case study shows that the language of information operators is a powerful toolkit for aggregating heterogeneous sources of partial information in a joint probabilistic model, especially in the context of physics-informed machine learning. This opens up several interesting lines of research. For example, the choice of prior and information operator are not fixed and can be specifically chosen for the problem at hand. The design of adaptive information operators, which actively collect information based on the current belief about the solution could prove to be a promising research direction. Further, the uncertainty estimate about the solution could be used to inform experimental design choices. For example, in the case study from Section 3.2, the posterior belief can be used to optimize the locations of the digital thermal sensors in future CPU designs. Finally, it remains an open question whether this framework can be adapted to nonlinear partial differential equations in a similar manner to how many classic methods solve a sequence of linearized problems to approximate the solution of a nonlinear PDE.

Acknowledgments

MP, PH and JW gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence “Machine Learning - New Perspectives for Science”, EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-Württemberg. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting MP and JW. IS thanks the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for supporting this work by funding EXC 2075-390740016 under Germany’s Excellence Strategy. IS also acknowledges support by the Stuttgart Center for Simulation Science (SimTech).

Finally, the authors are grateful to Nathaël Da Costa and Filip Tronarp for many invaluable discussions concerning the theoretical part of this work.

A Proofs for Section 3.3

Proof of proposition 3

By theorem 1, we have

	$\displaystyle{m}^{{\bm{\mathrm{u}}}\nonscript\>\|\allowbreak\nonscript\>% \mathopen{}\hat{{\bm{D}}},\hat{{\bm{f}}}}_{i}({\bm{x}})$	$\displaystyle={m}_{i}({\bm{x}})+(\hat{{\bm{D}}}{\bm{\mathcal{P}}}_{\mathbb{R}^% {m}})[{\bm{k}}_{:,i}(\cdot,{\bm{x}})]^{\top}\left((\hat{{\bm{D}}}{\bm{\mathcal% {P}}}_{\mathbb{R}^{m}}){\bm{k}}(\hat{{\bm{D}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{% m}})^{\prime}\right)^{-1}\left(\hat{{\bm{f}}}-\hat{{\bm{D}}}{\bm{\mathcal{P}}}% _{\mathbb{R}^{m}}[{\bm{m}}]\right)$
		$\displaystyle={m}_{i}({\bm{x}})+{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{k}}_{% :,i}(\cdot,{\bm{x}})]^{\top}\hat{{\bm{D}}}^{\top}\left(\hat{{\bm{D}}}{\bm{% \Sigma}}_{\bm{\mathrm{c}}}\hat{{\bm{D}}}^{\top}\right)^{-1}\hat{{\bm{D}}}\left% (\hat{{\bm{D}}}^{-1}\hat{{\bm{f}}}-{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}% }]\right)$
		$\displaystyle={m}_{i}({\bm{x}})+{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{k}}_{% :,i}(\cdot,{\bm{x}})]^{\top}{\bm{\Sigma}}_{\bm{\mathrm{c}}}^{-1}\left(\hat{{% \bm{D}}}^{-1}\hat{{\bm{f}}}-{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}]% \right).$

Since $\mathcal{P}_{\hat{{\mathbb{U}}}}$ is a bounded projection, we have ${\mathbb{U}}=\operatorname{ran}(\mathcal{P}_{\hat{{\mathbb{U}}}})\oplus% \operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})=\hat{{\mathbb{U}}}\oplus% \operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ (see Rudin, 1991, Section 5.16), where each ${\bm{u}}\in{\mathbb{U}}$ decomposes uniquely into ${\bm{u}}={\bm{u}}_{\hat{{\mathbb{U}}}}+{\bm{u}}_{\operatorname{ker}(\mathcal{P% }_{\hat{{\mathbb{U}}}})}$ with ${\bm{u}}_{\hat{{\mathbb{U}}}}\in\hat{{\mathbb{U}}}$ and ${\bm{u}}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}\in% \operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})$ . It is clear that ${\bm{u}}_{\hat{{\mathbb{U}}}}=\mathcal{P}_{\hat{{\mathbb{U}}}}[{\bm{u}}],$ and ${\bm{u}}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}=\left(% \operatorname{id}-\mathcal{P}_{\hat{{\mathbb{U}}}}\right)[{\bm{u}}]=\mathcal{P% }_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}[{\bm{u}}].$ This implies

	$\displaystyle{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}^{{\bm{\mathrm{u}}}% \nonscript\>\|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}},\hat{{\bm{f}}}}]$	$\displaystyle={\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}]+\underbrace{{\bm{% \mathcal{P}}}_{\mathbb{R}^{m}}{\bm{k}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}^{% \prime}missing}_{={\bm{\Sigma}}_{\bm{\mathrm{c}}}}{\bm{\Sigma}}_{\bm{\mathrm{c% }}}^{-1}\left(\hat{{\bm{D}}}^{-1}\hat{{\bm{f}}}-{\bm{\mathcal{P}}}_{\mathbb{R}% ^{m}}[{\bm{m}}]\right)$
		$\displaystyle={\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}]+\hat{{\bm{D}}}^{-1% }\hat{{\bm{f}}}-{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}]$
		$\displaystyle=\hat{{\bm{D}}}^{-1}\hat{{\bm{f}}}={\bm{c}}^{\mathrm{MWR}}.$

Hence, we have

\displaystyle\mathcal{P}_{\hat{{\mathbb{U}}}}[{\bm{m}}^{{\bm{\mathrm{u}}}% \nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}},\hat{{\bm{f}}}}]% =\sum_{i=1}^{m}\left(\mathcal{P}_{\mathbb{R}^{n}}[{\bm{m}}^{{\bm{\mathrm{u}}}% \nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}},\hat{{\bm{f}}}}]% \right)_{i}{\bm{\phi}}^{(i)}=\sum_{i=1}^{m}{c}^{\mathrm{MWR}}_{i}{\bm{\phi}}^{% (i)}={\bm{u}}^{\mathrm{MWR}}\in\hat{{\mathbb{U}}}

and since ${\mathbb{U}}=\hat{{\mathbb{U}}}\oplus\operatorname{ker}(\mathcal{P}_{\hat{{% \mathbb{U}}}})$ , the statement follows. Moreover, $\mathcal{P}_{\mathbb{R}^{n}}[{\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>|% \allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}},\hat{{\bm{f}}}}]$ is the mean of ${\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}}% {\bm{\mathrm{c}}}-\hat{{\bm{f}}}={\bm{0}}$ and its covariance matrix is given by

	$\displaystyle{\bm{\Sigma}}^{{\bm{\mathrm{c}}}\nonscript\>\|\allowbreak% \nonscript\>\mathopen{}\hat{{\bm{D}}},\hat{{\bm{f}}}}$	$\displaystyle={\bm{\Sigma}}_{\bm{\mathrm{c}}}-{\bm{\Sigma}}_{\bm{\mathrm{c}}}% \hat{{\bm{D}}}^{\top}\left(\hat{{\bm{D}}}{\bm{\Sigma}}_{\bm{\mathrm{c}}}\hat{{% \bm{D}}}^{\top}\right)^{-1}\hat{{\bm{D}}}{\bm{\Sigma}}_{\bm{\mathrm{c}}}$
		$\displaystyle={\bm{\Sigma}}_{\bm{\mathrm{c}}}-{\bm{\Sigma}}_{\bm{\mathrm{c}}}% \hat{{\bm{D}}}^{\top}(\hat{{\bm{D}}}^{\top})^{-1}{\bm{\Sigma}}_{\bm{\mathrm{c}% }}^{-1}\hat{{\bm{D}}}^{-1}\hat{{\bm{D}}}{\bm{\Sigma}}_{\bm{\mathrm{c}}}$
		$\displaystyle={\bm{\Sigma}}_{\bm{\mathrm{c}}}-{\bm{\Sigma}}_{\bm{\mathrm{c}}}{% \bm{\Sigma}}_{\bm{\mathrm{c}}}^{-1}{\bm{\Sigma}}_{\bm{\mathrm{c}}}={\bm{0}}.$

Consequently, ${\bm{\mathrm{c}}}\nonscript\>|\allowbreak\nonscript\>\mathopen{}\hat{{\bm{D}}}% {\bm{\mathrm{c}}}-\hat{{\bm{f}}}={\bm{0}}\sim\delta_{{\bm{c}}^{\mathrm{MWR}}}$ . ∎

Proof of corollary 4

		$\displaystyle\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})% }[{\bm{m}}^{{\bm{\mathrm{u}}}\nonscript\>\|\allowbreak\nonscript\>\mathopen{}% \hat{{\bm{D}}},\hat{{\bm{f}}}}]({\bm{x}})$
	$\displaystyle=\$	$\displaystyle\underbrace{\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{% \mathbb{U}}}})}[{\bm{m}}]missing}_{=0}({\bm{x}})+(\delta_{\bm{x}}\circ\mathcal% {P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}){\bm{k}}(\hat{{\bm{% D}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}})^{\prime}\left((\hat{{\bm{D}}}{\bm{% \mathcal{P}}}_{\mathbb{R}^{m}}){\bm{k}}(\hat{{\bm{D}}}{\bm{\mathcal{P}}}_{% \mathbb{R}^{m}})^{\prime}\right)^{-1}\left(\hat{{\bm{f}}}-\hat{{\bm{D}}}{\bm{% \mathcal{P}}}_{\mathbb{R}^{m}}[{\bm{m}}]\right)$
	$\displaystyle=\$	$\displaystyle\delta_{\bm{x}}[\underbrace{\mathcal{P}_{\operatorname{ker}(% \mathcal{P}_{\hat{{\mathbb{U}}}})}{\bm{k}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m}}^% {\prime}}_{={\bm{0}}}]\hat{{\bm{D}}}^{\top}\left((\hat{{\bm{D}}}{\bm{\mathcal{% P}}}_{\mathbb{R}^{m}}){\bm{k}}(\hat{{\bm{D}}}{\bm{\mathcal{P}}}_{\mathbb{R}^{m% }})^{\prime}\right)^{-1}\left(\hat{{\bm{f}}}-\hat{{\bm{D}}}{\bm{\mathcal{P}}}_% {\mathbb{R}^{m}}[{\bm{m}}]\right)={\bm{0}}$

∎

Proof of proposition 5

The process ${\bm{\mathrm{u}}}$ can be constructed as the sum of independent samples from the processes $\mathcal{P}_{\hat{{\mathbb{U}}}}[\tilde{{\bm{\mathrm{u}}}}]$ and $\mathcal{P}_{\hat{{\mathbb{U}}}}[\tilde{{\bm{\mathrm{u}}}}-\tilde{{\bm{m}}}]$ . This proves that the sample paths lie in ${\mathbb{U}}$ . Since $\mathcal{P}_{\hat{{\mathbb{U}}}}$ is idempotent, we have $\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}\mathcal{P}_% {\hat{{\mathbb{U}}}}=\mathcal{P}_{\hat{{\mathbb{U}}}}-\mathcal{P}_{\hat{{% \mathbb{U}}}}^{2}=\mathcal{P}_{\hat{{\mathbb{U}}}}-\mathcal{P}_{\hat{{\mathbb{% U}}}}=0$ and $\mathcal{P}_{\mathbb{R}^{n}}\mathcal{P}_{\hat{{\mathbb{U}}}}=(\mathcal{I}_{% \mathbb{R}^{m}}^{\hat{{\mathbb{U}}}})^{-1}\mathcal{P}_{\hat{{\mathbb{U}}}}^{2}% =(\mathcal{I}_{\mathbb{R}^{m}}^{\hat{{\mathbb{U}}}})^{-1}\mathcal{P}_{\hat{{% \mathbb{U}}}}=\mathcal{P}_{\mathbb{R}^{n}}.$ It follows that

	$\displaystyle\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})% }{\bm{k}}\mathcal{P}_{\mathbb{R}^{n}}^{*}$	$\displaystyle=\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}}% )}\tilde{{\bm{k}}}\mathcal{P}_{\mathbb{R}^{n}}^{*}-\underbrace{\mathcal{P}_{% \operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}\mathcal{P}_{\hat{{% \mathbb{U}}}}}_{=0}\tilde{{\bm{k}}}$
		$\displaystyle\qquad-\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb% {U}}}})}\tilde{{\bm{k}}}\mathcal{P}_{\hat{{\mathbb{U}}}}^{}\mathcal{P}_{% \mathbb{R}^{n}}^{}+2\underbrace{\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{% \hat{{\mathbb{U}}}})}\mathcal{P}_{\hat{{\mathbb{U}}}}}_{=0}\tilde{{\bm{k}}}% \mathcal{P}_{\hat{{\mathbb{U}}}}^{}\mathcal{P}_{\mathbb{R}^{n}}^{}$
		$\displaystyle=\mathcal{P}_{\operatorname{ker}(\mathcal{P}_{\hat{{\mathbb{U}}}}% )}\tilde{{\bm{k}}}\mathcal{P}_{\mathbb{R}^{n}}^{}-\mathcal{P}_{\operatorname{% ker}(\mathcal{P}_{\hat{{\mathbb{U}}}})}\tilde{{\bm{k}}}\underbrace{\left(% \mathcal{P}_{\mathbb{R}^{n}}\mathcal{P}_{\hat{{\mathbb{U}}}}\right)^{}}_{=% \mathcal{P}_{\mathbb{R}^{n}}}^{*}=0.$

∎

B Proofs for Section 4

Using the rules of linear-Gaussian inference (Bishop, 2006), we can easily see that

	$\displaystyle{\mathrm{f}}$	$\displaystyle\sim{\operatorname{\mathcal{GP}}\left(m,k\right)}$
	$\displaystyle{\bm{A}}{\mathrm{f}}({\bm{X}})$	$\displaystyle\sim{\operatorname{\mathcal{N}}\left({\bm{A}}m({\bm{X}}),{\bm{A}}% k({\bm{X}},{\bm{X}}){\bm{A}}^{\top}\right)}$
	$\displaystyle{\mathrm{f}}\nonscript\>\|\allowbreak\nonscript\>\mathopen{}{\bm{A% }}{\mathrm{f}}({\bm{X}})+{\bm{\mathrm{b}}}={\bm{y}}$	$\displaystyle\sim{\operatorname{\mathcal{GP}}\left(m^{{\mathrm{f}}\mid{\bm{y}}% },k^{{\mathrm{f}}\mid{\bm{y}}}\right)},$

where ${\bm{A}}\in\mathbb{R}^{m\times n}$ , ${\bm{X}}=({\bm{x}}_{i})_{i=1}^{n}\in{\mathbb{X}}$ , ${\bm{\mathrm{b}}}\sim{\operatorname{\mathcal{N}}\left({\bm{\mu}},{\bm{\Sigma}}% \right)}$ with ${\bm{\mathrm{b}}}\perp\!\!\!\!\perp{\mathrm{f}}$ and

	$\displaystyle m^{{\mathrm{f}}\mid{\bm{y}}}({\bm{x}})$	$\displaystyle\coloneqq m({\bm{x}})+k({\bm{x}},{\bm{X}}){\bm{A}}^{\top}({\bm{A}% }k({\bm{X}},{\bm{X}}){\bm{A}}^{\top}+{\bm{\Sigma}})^{\dagger}({\bm{y}}-({\bm{A% }}m+{\bm{\mu}}))$
	$\displaystyle k^{{\mathrm{f}}\mid{\bm{y}}}({\bm{x}}_{1},{\bm{x}}_{2})$	$\displaystyle\coloneqq k({\bm{x}}_{1},{\bm{x}}_{2})-k({\bm{x}}_{1},{\bm{X}}){% \bm{A}}^{\top}({\bm{A}}k({\bm{X}},{\bm{X}}){\bm{A}}^{\top}+{\bm{\Sigma}})^{% \dagger}{\bm{A}}k({\bm{X}},{\bm{x}}_{2}).$

It is tempting to think that the above also extends to more general linear transformations of ${\mathrm{f}}$ such as differentiation at a point ${\bm{x}}\in{\mathbb{X}}$ and integration. Unfortunately, this is not the case, since the result from (Bishop, 2006) heavily uses the fact that, by definition, evaluations of the Gaussian process at a finite set of points follow a joint Gaussian distribution. However, differentiation at a point and integration are examples of linear functionals, i.e. linear maps from a vector space of functions to the real numbers, which operate on an (uncountably) infinite subset of random variables.

To generalize the result above to general linear operators ${\bm{\mathcal{L}}}$ mapping the paths of ${\mathrm{f}}$ into $\mathbb{R}^{n}$ , we will take the following route:

1.

In section B.2, we will show that, under certain conditions on ${\mathrm{f}}$ and ${\mathbb{X}}$ , the function $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is a Gaussian random variable with values in a separable Banach space ${\mathbb{B}}$ of real-valued functions on ${\mathbb{X}}$ . We introduce Gaussian random variables on separable Banach spaces and their essential properties in section B.1.

Under the assumption that ${\bm{\mathcal{L}}}$ is continuous, we can use the transformation properties of Gaussian random variables on separable Banach spaces (see lemma 11) to show that ${\bm{\mathcal{L}}}[{\mathrm{f}}]$ and for ${\bm{X}}\in{\mathbb{X}}^{m}$ also

\begin{pmatrix}{\mathrm{f}}({\bm{X}})\\ {\bm{\mathcal{L}}}[{\mathrm{f}}]\end{pmatrix}

are Gaussian random variables on $\mathbb{R}^{n}$ and $\mathbb{R}^{m+n}$ , respectively.

3.

Finally, in section B.3, we can then use the well-known linear-Gaussian inference theorem (Bishop, 2006) to show that $\left.{\mathrm{f}}\nonscript\>\middle|\allowbreak\nonscript\>\mathopen{}{\bm{% \mathcal{L}}}[{\mathrm{f}}]={\bm{y}}\right.$ is again a Gaussian process.

In the following, $\mathcal{B}\left({\mathbb{B}}\right)$ denotes the Borel $\sigma$ -algebra generated by the norm topology on a Banach space ${\mathbb{B}}$ .

B.1 Gaussian Measures on Separable Banach Spaces

As stated before, in many cases, the function $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ will often turn out to be a Gaussian random variable with values in an infinite-dimensional separable Banach space ${\mathbb{B}}\supseteq\operatorname{paths}\left({\mathrm{f}}\right)$ of real-valued functions on ${\mathbb{X}}$ (see proposition 20).

Definition 8.

Let ${\mathbb{B}}$ be a real separable Banach space. A Borel probability measure $\mu$ on $({\mathbb{B}},\mathcal{B}\left({\mathbb{B}}\right))$ is called Gaussian if every continuous linear functional $l\in{\mathbb{B}}^{\prime}$ is a univariate Gaussian random variable. A ${\mathbb{B}}$ -valued random variable is called Gaussian if its law is Gaussian.

Just as for Gaussian random variables on Euclidean vector space $\mathbb{R}^{n}$ , we can define a mean and covariance (operator) for their counterparts on general separable Banach spaces.

Proposition 9.

Let ${\mathrm{b}}$ be a Gaussian random variable on $(\Omega,\mathcal{F},\mathrm{P})$ with values in a real separable Banach space ${\mathbb{B}}$ . Then there is a unique $m_{\mathrm{b}}\in{\mathbb{B}}$ such that $l[m_{\mathrm{b}}]=\operatorname{\mathbb{E}}_{{\mathrm{b}}}\left[l[{\mathrm{b}}% ]\right]$ for any $l\in{\mathbb{B}}^{\prime}$ . We refer to $m_{\mathrm{b}}$ as the mean (vector) of ${\mathrm{b}}$ . Moreover, there is a unique bounded linear operator $\mathcal{C}_{\mathrm{b}}\colon{\mathbb{B}}^{\prime}\to{\mathbb{B}}$ such that $l_{2}[\mathcal{C}_{\mathrm{b}}[l_{1}]]=\operatorname{Cov}_{{\mathrm{b}}}\left[% l_{1}[{\mathrm{b}}],l_{2}[{\mathrm{b}}]\right]$ for any $l_{1},l_{2}\in{\mathbb{B}}^{\prime}$ , the so-called covariance operator of ${\mathrm{b}}$ .

Proof

Fernique’s theorem (Da Prato and Zabczyk, 1992, Theorem 2.7) implies that $\lVert{\mathrm{b}}\rVert_{{\mathbb{B}}}\in L_{p}(\Omega,\mathrm{P})$ for all $p\in\mathbb{N}_{\geq 1}$ . By assumption, ${\mathrm{b}}$ is measurable and ${\mathbb{B}}\supset\operatorname{ran}({\mathrm{b}})$ is separable, which means that ${\mathrm{b}}$ is strongly measurable (Yosida, 1995, Section V.4, Pettis’ Theorem). Since $\lVert{\mathrm{b}}\rVert_{{\mathbb{B}}}\in L_{1}(\Omega,\mathrm{P})$ , it follows that ${\mathrm{b}}$ is Bochner integrable (Yosida, 1995, Section V.5, Theorem 1). Let $l\in{\mathbb{B}}^{\prime}$ . By Corollary 2 in Yosida (1995, Section V.5) we then have

\displaystyle\operatorname{\mathbb{E}}_{{\mathrm{b}}}\left[l[{\mathrm{b}}]% \right]=\int_{\Omega}l\left[{\mathrm{b}}(\omega)\right]\,\mathrm{d}\mathrm{P}% \left(\omega\right)=l\bigg{[}\underbrace{\int_{\Omega}{\mathrm{b}}(\omega)\,% \mathrm{d}\mathrm{P}\left(\omega\right)}_{\eqqcolon m_{\mathrm{b}}}\bigg{]}.

Now assume that there is another $\tilde{m}_{\mathrm{b}}\in{\mathbb{B}}$ with $l[\tilde{m}_{\mathrm{b}}]=\operatorname{\mathbb{E}}_{{\mathrm{b}}}\left[l[{% \mathrm{b}}]\right].$ Then

0=\operatorname{\mathbb{E}}_{{\mathrm{b}}}\left[l[{\mathrm{b}}]\right]-% \operatorname{\mathbb{E}}_{{\mathrm{b}}}\left[l[{\mathrm{b}}]\right]=l[\tilde{% m}_{\mathrm{b}}]-l[m_{\mathrm{b}}]=l[\tilde{m}_{\mathrm{b}}-m_{\mathrm{b}}]

and since this holds for all $l\in{\mathbb{B}}^{\prime}$ , it follows that $\tilde{m}_{\mathrm{b}}=m_{\mathrm{b}}$ .

Let $l_{1}\in{\mathbb{B}}^{\prime}$ . Then $\omega\mapsto l_{1}[{\mathrm{b}}(\omega)-m_{\mathrm{b}}]({\mathrm{b}}(\omega)-% m_{\mathrm{b}})$ is clearly weakly measurable and, since ${\mathbb{B}}$ is separable, also strongly measurable (Yosida, 1995, Section V.4, Pettis’ Theorem). As above, Fernique’s theorem shows that $\lVert{\mathrm{b}}\rVert_{{\mathbb{B}}}\in L_{2}(\Omega,\mathrm{P})$ . By the triangle inequality in ${\mathbb{B}}$ and the fact that $\mathrm{P}$ is a probability measure, we also have $\lVert{\mathrm{b}}(\cdot)-m_{\mathrm{b}}\rVert_{{\mathbb{B}}}\in L_{2}(\Omega,% \mathrm{P})$ . It follows that

	$\displaystyle\int_{\Omega}\lVert l_{1}[{\mathrm{b}}(\omega)-m_{\mathrm{b}}]({% \mathrm{b}}(\omega)-m_{\mathrm{b}})\rVert_{{\mathbb{B}}}\,\mathrm{d}\mathrm{P}% \left(\omega\right)$	$\displaystyle=\int_{\Omega}\left\lvert l_{1}[{\mathrm{b}}(\omega)-m_{\mathrm{b% }}]\right\rvert\lVert{\mathrm{b}}(\omega)-m_{\mathrm{b}}\rVert_{{\mathbb{B}}}% \,\mathrm{d}\mathrm{P}\left(\omega\right)$
		$\displaystyle\leq\lVert l_{1}\rVert_{{\mathbb{B}}^{\prime}}\int_{\Omega}\lVert% {\mathrm{b}}(\omega)-m_{\mathrm{b}}\rVert_{{\mathbb{B}}}\lVert{\mathrm{b}}(% \omega)-m_{\mathrm{b}}\rVert_{{\mathbb{B}}}\,\mathrm{d}\mathrm{P}\left(\omega\right)$
		$\displaystyle=\lVert l_{1}\rVert_{{\mathbb{B}}^{\prime}}\big{\lVert}\omega% \mapsto\lVert{\mathrm{b}}(\omega)-m_{\mathrm{b}}\rVert_{{\mathbb{B}}}\rVert_{L% _{2}(\Omega,\mathrm{P})}^{2}<\infty,$

where $\lVert l_{1}\rVert_{{\mathbb{B}}_{1}^{\prime}}<\infty$ , since $l_{1}$ is continuous. Let $l_{2}\in{\mathbb{B}}^{\prime}$ . Again by Corollary 2 in Yosida (1995, Section V.5), we find that

	$\displaystyle\operatorname{Cov}\left[l_{1}[{\mathrm{b}}],l_{2}[{\mathrm{b}}]\right]$	$\displaystyle=\int_{\Omega}l_{1}[{\mathrm{b}}(\omega)-m_{\mathrm{b}}]l_{2}[{% \mathrm{b}}(\omega)-m_{\mathrm{b}}]\,\mathrm{d}\mathrm{P}\left(\omega\right)$
		$\displaystyle=l_{2}\bigg{[}\int_{\Omega}\underbrace{l_{1}[{\mathrm{b}}(\omega)% -m_{\mathrm{b}}]({\mathrm{b}}(\omega)-m_{\mathrm{b}})}_{\eqqcolon\mathcal{C}[l% _{1}]}\,\mathrm{d}\mathrm{P}\left(\omega\right)\bigg{]}.$

$\mathcal{C}_{\mathrm{b}}$ is bounded, since

	$\displaystyle\lVert\mathcal{C}_{\mathrm{b}}[l_{1}]\rVert_{{\mathbb{B}}}$	$\displaystyle\leq\int_{\Omega}\lVert l_{1}[{\mathrm{b}}(\omega)-m_{\mathrm{b}}% ]({\mathrm{b}}(\omega)-m_{\mathrm{b}})\rVert_{{\mathbb{B}}}\,\mathrm{d}\mathrm% {P}\left(\omega\right)$
		$\displaystyle\leq\lVert l_{1}\rVert_{{\mathbb{B}}^{\prime}}\underbrace{\big{% \lVert}\omega\mapsto\lVert{\mathrm{b}}(\omega)-m_{\mathrm{b}}\rVert_{{\mathbb{% B}}}\rVert_{L_{2}(\Omega,\mathrm{P})}^{2}}_{\eqqcolon\lVert\mathcal{C}_{% \mathrm{b}}\rVert},$

where $\lVert\mathcal{C}_{\mathrm{b}}\rVert<\infty$ because $\lVert{\mathrm{b}}(\cdot)-m_{\mathrm{b}}\rVert_{{\mathbb{B}}}\in L_{2}(\Omega,% \mathrm{P})$ . Uniqueness follows from an argument analogous to the one used to prove uniqueness of the mean. ∎

Remark 10.

One can show that the mean and the covariance operator of a Gaussian random variable with values in a separable Banach space identify its law uniquely. Hence, we often write ${\operatorname{\mathcal{N}}\left(m,\mathcal{C}\right)}$ to denote Gaussian measures on separable Banach spaces.

B.1.1 Continuous Affine Transformations

Just as their finite-dimensional counterparts, Gaussian random variables with values in separable Banach spaces are closed under linear transformations as long as they are continuous. Moreover, the expressions for the transformed mean and covariance operator are analogous to the finite-dimensional case. For instance, we will use this result to show that ${\bm{\mathcal{L}}}[{\mathrm{f}}]$ is an $\mathbb{R}^{n}$ -valued random variable.

Lemma 11.

Let ${\mathrm{b}}\sim{\operatorname{\mathcal{N}}\left(m,\mathcal{C}\right)}$ be a Gaussian random variable on $(\Omega,\mathcal{F},\mathrm{P})$ with values in a real separable Banach space ${\mathbb{B}}$ . Let $\mathcal{L}\colon{\mathbb{B}}\to\tilde{{\mathbb{B}}}$ be a bounded linear operator mapping into another real separable Banach space $\tilde{{\mathbb{B}}}$ . Then ${\bm{\mathcal{L}}}[{\mathrm{b}}]\sim{\operatorname{\mathcal{N}}\left(\mathcal{% L}[m],\mathcal{L}\mathcal{C}\mathcal{L}^{\prime}\right)}.$

Proof

$\mathcal{L}$ is continuous and hence Borel measurable, which means that $\mathcal{L}[{\mathrm{b}}]$ is a $\tilde{{\mathbb{B}}}$ -valued random variable. Moreover, for $\tilde{l}\in\tilde{{\mathbb{B}}}^{\prime}$ , we have $l\coloneqq\tilde{l}\circ\mathcal{L}\in{\mathbb{B}}^{\prime}$ and hence $l[{\mathrm{b}}]=\tilde{l}[\mathcal{L}[{\mathrm{b}}]]$ is Gaussian. This implies that $\mathcal{L}[{\mathrm{b}}]$ is a $\tilde{{\mathbb{B}}}$ -valued Gaussian random variable. Moreover, we have $\operatorname{\mathbb{E}}_{{\mathrm{b}}}\left[l[{\mathrm{b}}]\right]=l[m]=% \tilde{l}[\mathcal{L}[m]],$ i.e. $\mathcal{L}[m]$ is the mean of $\mathcal{L}[{\mathrm{b}}]$ . Let $\tilde{l}_{1},\tilde{l}_{2}\in{\mathbb{B}}^{\prime}$ and define $l_{i}\coloneqq\tilde{l}_{i}\circ\mathcal{L}\in\tilde{{\mathbb{B}}}^{\prime}$ for $i=1,2$ . Then

\operatorname{Cov}_{{\mathrm{b}}}\left[l_{1}[{\mathrm{b}}],l_{2}[{\mathrm{b}}]% \right]=l_{2}[\mathcal{C}[l_{1}]]=(\tilde{l}_{2}\circ\mathcal{L})[\mathcal{C}[% \tilde{l}_{1}\circ\mathcal{L}]]=\tilde{l}_{2}[(\mathcal{L}C\mathcal{L}^{\prime% })[\tilde{l}_{1}]],

i.e. $\mathcal{L}C\mathcal{L}^{\prime}$ is the covariance operator of $\mathcal{L}[{\mathrm{b}}]$ . ∎

B.2 Gaussian Processes as Gaussian Random Functions

We now introduced all necessary preliminaries to show that, under certain assumptions on ${\mathrm{f}}$ and ${\mathbb{X}}$ , the function $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is a Gaussian random variable with values in a special kind of separable Banach space, namely a reproducing kernel Banach space (RKBS):

Definition 12 (Lin et al. 2022, Definition 2.1).

A reproducing kernel Banach space (RKBS) $({\mathbb{B}},\lVert\cdot\rVert_{{\mathbb{B}}})$ is a Banach space of real-valued functions on a nonempty set ${\mathbb{X}}$ , on which the point evaluation functionals $\delta_{\bm{x}}$ for ${\bm{x}}\in{\mathbb{X}}$ are continuous.

We formulate theorems 1 and 2, under the following set of assumptions: See 1

Generalizing an observation from Rajput and Cambanis (1972, Remark 1), it becomes clear that assumption 1 is often not about the GP ${\mathrm{f}}$ itself, but rather about the topology of the space ${\mathbb{B}}$ . Denote by $\operatorname{scl}_{w*}(\mathbb{L})\coloneqq\{\ell\in{\mathbb{B}}^{\prime}% \nonscript\>|\allowbreak\nonscript\>\mathopen{}\exists\{\ell_{k}\}_{k\in% \mathbb{N}}\subset\mathbb{L}\colon\ell_{k}\to_{w*}\ell\}$ the weak-* sequential closure¹⁰¹⁰10The weak-* sequential closure of $\mathbb{L}$ is not to be confused with the weak-* closure of $\mathbb{L}$ . The two notions coincide if ${\mathbb{B}}^{\prime}$ equipped with the weak-* topology is a sequential space, but for many of the dual spaces considered in this work this does not hold.of a subset $\mathbb{L}\subset{\mathbb{B}}^{\prime}$ of the continuous dual space of a Banach space ${\mathbb{B}}$ . Moreover, $\operatorname{scl}_{w*}^{n}(\mathbb{L})=\operatorname{scl}_{w*}^{n-1}(\mathbb{% L})$ and $\operatorname{scl}_{w*}^{0}(\mathbb{L})=\mathbb{L}$ .

Theorem 13.

Let ${\mathbb{B}}\subset\mathbb{R}^{{\mathbb{X}}}$ be a real separable RKBS and define $\mathbb{L}_{\delta}\coloneqq\operatorname{span}\{\delta_{\bm{x}}\}_{{\bm{x}}% \in{\mathbb{X}}}\subset{\mathbb{B}}^{\prime}$ . If there is an $n\in\mathbb{N}_{0}$ such that ${\mathbb{B}}^{\prime}=\operatorname{scl}_{w*}^{n}(\mathbb{L}_{\delta})$ , then assumption 1 holds for any GP ${\mathrm{f}}$ with paths in ${\mathbb{B}}$ .

Proof

First, we show by induction on $n\in\mathbb{N}_{0}$ that $\omega\mapsto l[{\mathrm{f}}(\cdot,\omega)]$ is a Gaussian random variable for every $l\in\operatorname{scl}_{w*}^{n}(\mathbb{L}_{\delta})$ . For $n=0$ , we have $\operatorname{scl}_{w*}^{n}(\mathbb{L}_{\delta})=\mathbb{L}_{\delta}$ . Hence, for every $l\in\operatorname{scl}_{w*}^{n}(\mathbb{L}_{\delta})=\mathbb{L}_{\delta}$ , there are $m\in\mathbb{N}$ , $\{\alpha_{k}\}_{k=1}^{m}\subset\mathbb{R}$ , and $\{{\bm{x}}_{k}\}_{k=1}^{m}\subset{\mathbb{X}}$ such that

l[{\mathrm{f}}(\cdot,\omega)]=\sum_{k=1}^{m}\alpha_{k}{\mathrm{f}}({\bm{x}}_{k% },\omega).

Since ${\mathrm{f}}$ is a GP, ${\mathrm{f}}({\bm{x}}_{1},\cdot),\dotsc,{\mathrm{f}}({\bm{x}}_{m},\cdot)$ is jointly Gaussian, which implies that $\omega\mapsto l[{\mathrm{f}}(\cdot,\omega)]$ is a Gaussian random variable. Now let $n>0$ and fix $l\in\operatorname{scl}_{w*}^{n}(\mathbb{L}_{\delta})$ . Then there is a sequence $\{l_{k}\}_{k\in\mathbb{N}}\subset\operatorname{scl}_{w*}^{n-1}(\mathbb{L}_{% \delta})$ such that $l_{k}[f]\to l[f]$ as $k\to\infty$ for every $f\in{\mathbb{B}}$ . By the inductive hypothesis, we know that $\omega\mapsto l_{k}[{\mathrm{f}}(\cdot,\omega)]$ is Gaussian for every $k\in\mathbb{N}$ . It follows that the function $\omega\mapsto l[{\mathrm{f}}(\cdot,\omega)]=\lim_{k\to\infty}l_{k}[{\mathrm{f}% }(\cdot,\omega)]$ is measurable (Klenke, 2014, Theorem 1.92). Moreover, as the pointwise limit of Gaussian random variables is again a Gaussian random variable, we conclude that $\omega\mapsto l[{\mathrm{f}}(\cdot,\omega)]$ is Gaussian.

Under the assumption that ${\mathbb{B}}^{\prime}=\operatorname{scl}_{w*}^{n}(\mathbb{L}_{\delta})$ for some $n\in\mathbb{N}_{0}$ , the above shows that $\omega\mapsto l[{\mathrm{f}}(\cdot,\omega)]$ is a Gaussian random variable for every $l\in{\mathbb{B}}^{\prime}$ . In particular, the map $\omega\mapsto l[{\mathrm{f}}(\cdot,\omega)]$ is $\mathcal{F}$ - $\mathcal{B}\left(\mathbb{R}\right)$ -measurable for all $l\in{\mathbb{B}}^{\prime}$ , i.e. $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is weakly measurable. By the separability of ${\mathbb{B}}$ , it follows that $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is $\mathcal{F}$ - $\mathcal{B}\left({\mathbb{B}}\right)$ -measurable (Bogachev, 1998, Theorem A.3.7). This shows that $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is a ${\mathbb{B}}$ -valued Gaussian random variable. ∎

Corollary 14.

Let ${\mathbb{B}}\subset\mathbb{R}^{{\mathbb{X}}}$ be a real separable RKBS. If $\mathbb{L}_{\delta}\coloneqq\operatorname{span}\{\delta_{\bm{x}}\}_{{\bm{x}}% \in{\mathbb{X}}}$ lies weak-* sequentially dense¹¹¹¹11As before, a weak-* sequentially dense set is not to be confused with a weak-* dense set, since the dual spaces considered in this work are not necessarily sequential w.r.t. the weak-* topology.in ${\mathbb{B}}^{\prime}$ , then assumption 1 holds for any GP ${\mathrm{f}}$ with paths in ${\mathbb{B}}$ .

In the following, we show that theorem 13 applies to three important classes of Banach spaces, which often arise in the study of Gaussian Processes and PDEs.

Proposition 15.

The set $\mathbb{L}_{\delta}\coloneqq\operatorname{span}\{\delta_{\bm{x}}\}_{{\bm{x}}% \in{\mathbb{X}}}$ is weak-* sequentially dense in the dual ${\mathbb{H}}_{k}^{\prime}$ of any separable RKHS ${\mathbb{H}}_{k}\subset\mathbb{R}^{{\mathbb{X}}}$ .

Proof

Let $l\in{\mathbb{H}}_{k}^{\prime}$ . By the Riesz representation theorem, there is $h\in{\mathbb{H}}_{k}$ such that $l=\langle h,\cdot\rangle_{{\mathbb{H}}_{k}}$ . Since $\operatorname{span}\{k(\cdot,{\bm{x}})\}_{x\in{\mathbb{X}}}$ lies dense in ${\mathbb{H}}_{k}$ (Steinwart and Christmann, 2008, Theorem 4.21), there is $\{h_{i}\}_{i\in\mathbb{N}}\subset\operatorname{span}\{k(\cdot,{\bm{x}})\}_{x% \in{\mathbb{X}}}$ such that $h_{i}\to h$ . For every $i\in\mathbb{N}$ , define $l_{i}\coloneqq\langle h_{i},\cdot\rangle_{{\mathbb{H}}_{k}}$ and note that $\{l_{i}\}_{i\in\mathbb{N}}\subset\mathbb{L}_{\delta}$ by the reproducing property. By the continuity of the inner product it follows that

l[f]=\langle h,f\rangle_{{\mathbb{H}}_{k}}=\left\langle\lim_{i\to\infty}h_{i},% f\right\rangle_{{\mathbb{H}}_{k}}=\lim_{i\to\infty}\langle h_{i},f\rangle_{{% \mathbb{H}}_{k}}=\lim_{i\to\infty}l_{i}[f]

for every $f\in{\mathbb{H}}_{k}$ . ∎

Proposition 16.

The set $\mathbb{L}_{\delta}\coloneqq\operatorname{span}\{\delta_{\bm{x}}\}_{{\bm{x}}% \in{\mathbb{X}}}$ is weak-* sequentially dense in the dual ${\mathbb{B}}^{\prime}$ of the space ${\mathbb{B}}=C({\mathbb{X}})$ of real-valued continuous functions on a compact metric space $({\mathbb{X}},d_{\mathbb{X}})$ .

Proof

Let $l\in C({\mathbb{X}})^{\prime}$ . By the Riesz-Markov theorem (Aliprantis and Border, 2006, Corollary 14.15), there is a signed Borel measure $\lambda$ on ${\mathbb{X}}$ such that $l[f]=\int_{{\mathbb{X}}}f({\bm{x}})\,\mathrm{d}\lambda\left({\bm{x}}\right)$ . We need to show that there are $\{l_{k}\}_{k\in\mathbb{N}}\subset\mathbb{L}_{\delta}$ such that $l[f]\to l_{k}[f]$ for every $f\in C({\mathbb{X}})$ . To do so, we modify a construction from Alt (2012, Paragraph 4.22). For $S\subset{\mathbb{X}}$ , define $\operatorname{diam}(S)\coloneqq\sup_{{\bm{x}},{\bm{x}}_{0}\in S}d_{\mathbb{X}}% ({\bm{x}},{\bm{x}}_{0})$ . Since ${\mathbb{X}}$ is compact, there is a finite open cover $\{\tilde{S}^{(k)}_{i}\}_{i=1}^{n^{(k)}}$ of ${\mathbb{X}}$ with $\operatorname{diam}(\tilde{S}^{(k)})<\frac{1}{k}$ for every $k\in\mathbb{N}$ . Then $\{S^{(k)}_{i}\}_{i=1}^{n^{(k)}}\subset{\mathbb{X}}$ with

S^{(k)}_{i}\coloneqq\tilde{S}^{(k)}_{i}\setminus\bigcup_{j<i}\tilde{S}^{(k)}_{j}

is also a cover of ${\mathbb{X}}$ and $S^{(k)}_{i}\in\mathcal{B}\left({\mathbb{X}}\right)$ . Now choose ${\bm{x}}^{(k)}_{i}\in S^{(k)}_{i}$ for all $i=1,\dotsc,n^{(k)}$ (w.l.o.g. $S^{(k)}_{i}\neq\emptyset$ ). For any $f\in C({\mathbb{X}})$ , we define

f_{k}\coloneqq\sum_{i=1}^{n^{(k)}}f({\bm{x}}^{(k)}_{i})\chi_{S^{(k)}_{i}}.

Note that, since the $f_{k}$ are simple, we have

l[f_{k}]=\int_{{\mathbb{X}}}f_{k}({\bm{x}})\,\mathrm{d}\lambda\left({\bm{x}}% \right)=\sum_{i=1}^{n^{(k)}}f({\bm{x}}^{(k)}_{i})\lambda(S^{(k)}_{i})\eqqcolon l% _{k}[f],

where $l_{k}\in\mathbb{L}_{\delta}$ . We will now show that $f_{k}\to f$ uniformly, since this implies $l_{k}[f]=l[f_{k}]\to l[f]$ by the dominated convergence theorem.

By the Heine-Cantor theorem, $f\in C({\mathbb{X}})$ is uniformly continuous. Thus, for every $\epsilon>0$ , there is $\delta(\epsilon)>0$ such that $\lvert f({\bm{x}})-f({\bm{x}}_{0})\rvert<\epsilon$ for ${\bm{x}},{\bm{x}}_{0}\in{\mathbb{X}}$ with $d_{\mathbb{X}}({\bm{x}},{\bm{x}}_{0})<\delta(\epsilon)$ . Now fix $\epsilon>0$ . Then for $k>\frac{1}{\delta(\epsilon)}$ and any ${\bm{x}}\in S^{(k)}_{i}$ , we have

d_{\mathbb{X}}({\bm{x}},{\bm{x}}^{(k)}_{i})<\operatorname{diam}(S^{(k)}_{i})<% \frac{1}{k}<\delta(\epsilon)

and thus $\lvert f({\bm{x}})-f({\bm{x}}^{(k)}_{i})\rvert<\epsilon$ . Consequently,

\lVert f-f_{k}\rVert_{\infty}=\max_{i=1,\dotsc,n^{(k)}}\sup_{{\bm{x}}\in S^{(k% )}_{i}}\lvert f({\bm{x}})-f({\bm{x}}^{(k)}_{i})\rvert<\max_{i=1,\dotsc,n^{(k)}% }\epsilon=\epsilon.

∎

The Banach spaces $C^{k}(\overline{{\mathbb{X}}})$ of $k$ -times differentiable functions on an open and bounded domain ${\mathbb{X}}\subset\mathbb{R}^{d}$ with bounded and uniformly continuous partial derivatives and their subspaces (in particular the Hölder spaces) appear naturally in the study of strong solutions to linear PDEs. However, to allow for a flexible prior construction, we define a generalization of these spaces.

Definition 17.

Let ${\mathbb{X}}\subset\mathbb{R}^{d}$ be open and bounded and let $A\subset\mathbb{N}_{0}^{d}$ be a non-empty downward closed set of multi-indices, i.e. ${\bm{\beta}}\in A$ implies ${\bm{\alpha}}\in A$ for every ${\bm{\alpha}}\in\mathbb{N}_{0}^{d}$ with ${\bm{\alpha}}\leq{\bm{\beta}}$ . We define $C^{A}(\overline{{\mathbb{X}}})$ as the space of real-valued functions $f$ on ${\mathbb{X}}$ , for which all partial derivatives $\mathrm{D}^{{\bm{\alpha}}}f$ with ${\bm{\alpha}}\in A$ are bounded and uniformly continuous.

Remark 18.

One can show that $C^{A}(\overline{{\mathbb{X}}})$ equipped with the norm

\lVert f\rVert_{C^{A}(\overline{{\mathbb{X}}})}\coloneqq\max_{{\bm{\alpha}}\in A% }\sup_{{\bm{x}}\in{\mathbb{X}}}\lvert\mathrm{D}^{{\bm{\alpha}}}f\left({\bm{x}}% \right)\rvert

is a separable Banach space. Since every partial derivative $\mathrm{D}^{{\bm{\alpha}}}f$ with ${\bm{\alpha}}\in A$ is bounded and uniformly continuous, it has a unique, bounded, continuous extension to the closure $\overline{{\mathbb{X}}}$ of ${\mathbb{X}}$ (Adams and Fournier, 2003).

Proposition 19.

Let $C^{A}(\overline{{\mathbb{X}}})$ be the separable Banach space of real-valued functions on an open and bounded domain ${\mathbb{X}}\subset\mathbb{R}^{d}$ with bounded and uniformly continuous partial derivatives $\mathrm{D}^{{\bm{\alpha}}}f$ for all $f\in C^{A}(\overline{{\mathbb{X}}})$ and ${\bm{\alpha}}\in A$ . Then $C^{A}(\overline{{\mathbb{X}}})^{\prime}=\operatorname{scl}_{w*}^{m+1}(\mathbb{% L}_{\delta})$ , where $\mathbb{L}_{\delta}\coloneqq\operatorname{span}\{\delta_{\bm{x}}\}_{{\bm{x}}% \in{\mathbb{X}}}\subset C^{A}(\overline{{\mathbb{X}}})^{\prime}$ and $m=\max_{{\bm{\alpha}}\in A}\lvert{\bm{\alpha}}\rvert$ .

Proof

In the following, we adapt the proof of Theorem 3.9 in (Adams and Fournier, 2003). We choose an arbitrary ordering ${\bm{\alpha}}_{1},\dotsc,{\bm{\alpha}}_{n}$ of the multi-indices in $A$ , i.e. $A=\{{\bm{\alpha}}_{1},\dotsc,{\bm{\alpha}}_{n}\}$ , where $n=\lvert A\rvert$ . Let

{\mathbb{W}}\coloneqq\{(\mathrm{D}^{{\bm{\alpha}}_{1}}f,\dotsc,\mathrm{D}^{{% \bm{\alpha}}_{n}}f)\colon f\in C^{A}(\overline{{\mathbb{X}}})\}\subset C(% \overline{{\mathbb{X}}})^{n},

where we interpret $\mathrm{D}^{{\bm{\alpha}}_{i}}f$ as a function defined on the closure $\overline{{\mathbb{X}}}$ of ${\mathbb{X}}$ by the unique continuous extension mentioned in remark 18. We equip $C(\overline{{\mathbb{X}}})^{n}$ and ${\mathbb{W}}\subset C(\overline{{\mathbb{X}}})^{n}$ with the norm

\lVert{\bm{f}}\rVert_{C(\overline{{\mathbb{X}}})^{n}}\coloneqq\max_{i=1,\dotsc% ,n}\lVert{f}_{i}\rVert_{C(\overline{{\mathbb{X}}})}.

Then $(C(\overline{{\mathbb{X}}})^{n},\lVert\cdot\rVert_{C(\overline{{\mathbb{X}}})^% {n}})$ is a separable Banach space (Adams and Fournier, 2003, Theorem 1.23). Let $\mathcal{I}\colon C^{A}(\overline{{\mathbb{X}}})\to{\mathbb{W}}$ be the linear operator defined by $\mathcal{I}[f]_{i}=\mathrm{D}^{{\bm{\alpha}}_{i}}f$ . The operator $\mathcal{I}$ is surjective and norm-preserving, and hence an isometric isomorphism. It follows that ${\mathbb{W}}$ is a closed subspace of $C(\overline{{\mathbb{X}}})^{n}$ .

Fix $l\in C^{A}(\overline{{\mathbb{X}}})^{\prime}$ . Then $l\circ\mathcal{I}^{-1}\in{\mathbb{W}}^{\prime}$ . By the Hahn-Banach theorem, there is a continuous extension $\tilde{l}\in(C(\overline{{\mathbb{X}}})^{n})^{\prime}$ of $l\circ\mathcal{I}^{-1}$ to $C(\overline{{\mathbb{X}}})^{n}$ . This means that there are $\tilde{l}_{1},\dotsc,\tilde{l}_{n}\in C(\overline{{\mathbb{X}}})^{\prime}$ , such that

l=(l\circ\mathcal{I}^{-1})\circ\mathcal{I}=\tilde{l}\circ\mathcal{I}=\sum_{i=1% }^{n}\tilde{l}_{i}\circ\mathrm{D}^{{\bm{\alpha}}_{i}}.

By proposition 16, there are $\{\tilde{l}_{ij}\}_{j\in\mathbb{N}}\subset\mathbb{L}_{\delta}$ such that $\tilde{l}_{ij}[f]\to\tilde{l}_{i}[f]$ as $j\to\infty$ for all $f\in C(\overline{{\mathbb{X}}})$ . Consequently, there are $n_{ij}\in\mathbb{N}$ , $\{c_{ijk}\}_{k=1}^{n_{ij}}\subset\mathbb{R}$ , and $\{{\bm{x}}_{ijk}\}_{k=1}^{n_{ij}}\subset\overline{{\mathbb{X}}}$ such that

l[f]=\lim_{j\to\infty}\sum_{i=1}^{n}\tilde{l}^{(i)}_{j}[\mathrm{D}^{{\bm{% \alpha}}_{i}}f]=\lim_{j\to\infty}\sum_{i=1}^{n}\sum_{k=1}^{n_{ij}}c^{(i)}_{jk}% \mathrm{D}^{{\bm{\alpha}}_{i}}f({\bm{x}}_{ijk})

for all $f\in C^{A}(\overline{{\mathbb{X}}})$ . We will detail the remainder of the proof only for $\lvert{\bm{\alpha}}\rvert\leq 1$ for all ${\bm{\alpha}}\in A$ , since the proof of the general statement is a straightforward yet laborious extension of this special case. Assume without loss of generality that ${\bm{\alpha}}_{1}={\bm{0}}$ and ${\bm{\alpha}}_{i}={\bm{e}}_{i-1}$ for $2\leq i\leq n\leq d+1$ . Then,

	$\displaystyle l[f]$	$\displaystyle=\lim_{j_{0}\to\infty}\sum_{i=1}^{n}\sum_{k=1}^{n_{ij_{0}}}c_{ij_% {0}k}\mathrm{D}^{{\bm{\alpha}}_{i}}f({\bm{x}}_{ij_{0}k})$
		$\displaystyle=\lim_{j_{0}\to\infty}\sum_{k=1}^{n_{ij_{0}}}c_{1j_{0}k}f({\bm{x}% }_{1j_{0}k})+\sum_{i=2}^{n}c_{ij_{0}k}\mathrm{D}^{{\bm{e}}_{i-1}}f({\bm{x}}_{% ij_{0}k})$
		$\displaystyle=\lim_{j_{0}\to\infty}\lim_{j_{1}\to\infty}\underbrace{\sum_{k=1}% ^{n_{ij_{0}}}c_{1j_{0}k}f({\bm{x}}_{1j_{0}k})+\sum_{i=2}^{n}\frac{c_{ij_{0}k}}% {h_{j_{1}}}\left(f({\bm{x}}_{ij_{0}k}+h_{j_{1}}{\bm{e}}_{i-1})-f({\bm{x}}_{ij_% {0}k})\right)}_{\eqqcolon l_{j_{0}j_{1}}[f]}$

for any null sequence $\{h_{j}\}_{j\in\mathbb{N}}\subset\mathbb{R}$ and all $f\in C^{A}(\overline{{\mathbb{X}}})$ . Since $l_{j_{0}j_{1}}\in\mathbb{L}_{\delta}$ , it follows that $l\in\operatorname{scl}_{w*}^{2}(\mathbb{L}_{\delta})$ . The general case, i.e. $l\in\operatorname{scl}_{w*}^{m+1}(\mathbb{L}_{\delta})$ with $m\coloneqq\max_{i=1,\dotsc,n}\lvert{\bm{\alpha}}_{i}\rvert$ , can be shown by induction on $m\in\mathbb{N}_{0}$ . Hence, $C^{A}(\overline{{\mathbb{X}}})^{\prime}=\operatorname{scl}_{w*}^{m+1}(\mathbb{% L}_{\delta})$ . ∎

Having investigated conditions under which $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is a Gaussian random variable, it remains to show what its mean and covariance operator are. Perhaps unsurprisingly, it turns out that they are strongly related to the mean and covariance function of the GP.

Proposition 20.

Let assumption 1 hold. Then $m\in{\mathbb{B}}$ , $k({\bm{x}},\cdot)\in{\mathbb{B}}$ for all ${\bm{x}}\in{\mathbb{X}}$ , and the mean and covariance operator of $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ are given by $m$ and

\mathcal{C}_{k}\colon{\mathbb{B}}^{\prime}\to{\mathbb{B}},\quad l\mapsto% \mathcal{C}_{k}[l]({\bm{x}})=l[k({\bm{x}},\cdot)],

(B.1)

respectively.

Proof

Since $\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is Gaussian, its mean and covariance operator $m_{\mathrm{f}}$ and $\mathcal{C}_{\mathrm{f}}$ exist by proposition 9 and we have $m({\bm{x}})=\operatorname{\mathbb{E}}_{\mathrm{P}}\left[{\mathrm{f}}({\bm{x}})% \right]=\operatorname{\mathbb{E}}_{{\mathrm{f}}}\left[\delta_{\bm{x}}[{\mathrm% {f}}]\right]=\delta_{\bm{x}}[m_{\mathrm{f}}]$ for all ${\bm{x}}\in{\mathbb{X}}$ and

k({\bm{x}}_{1},{\bm{x}}_{2})=\operatorname{Cov}_{\mathrm{P}}\left[{\mathrm{f}}% ({\bm{x}}_{1}),{\mathrm{f}}({\bm{x}}_{2})\right]=\operatorname{Cov}_{{\mathrm{% f}}}\left[\delta_{{\bm{x}}_{1}}[{\mathrm{f}}],\delta_{{\bm{x}}_{2}}[{\mathrm{f% }}]\right]=\mathcal{C}_{\mathrm{f}}[\delta_{{\bm{x}}_{1}}]({\bm{x}}_{2})

for all ${\bm{x}}_{1},{\bm{x}}_{2}\in{\mathbb{X}}$ , since all point evaluation functionals are continuous on ${\mathbb{B}}$ . Hence, $m=m_{\mathrm{f}}\in{\mathbb{B}}$ and $k({\bm{x}},\cdot)=\mathcal{C}_{\mathrm{f}}[\delta_{\bm{x}}]\in{\mathbb{B}}$ for all ${\bm{x}}\in{\mathbb{X}}$ . Additionally, for all $l\in{\mathbb{B}}^{\prime}$ and ${\bm{x}}\in{\mathbb{X}}$ ,

\displaystyle\mathcal{C}_{\mathrm{f}}[l]({\bm{x}})=\operatorname{Cov}_{{% \mathrm{f}}}\left[l[{\mathrm{f}}],\delta_{\bm{x}}[{\mathrm{f}}]\right]=% \operatorname{Cov}_{{\mathrm{f}}}\left[\delta_{\bm{x}}[{\mathrm{f}}],l[{% \mathrm{f}}]\right]=l[\mathcal{C}_{\mathrm{f}}[\delta_{\bm{x}}]]=l[k({\bm{x}},% \cdot)]=\mathcal{C}_{k}[l]({\bm{x}}).

This shows that $\mathcal{C}_{\mathrm{f}}=\mathcal{C}_{k}$ . ∎

B.3 Proof of Theorem 1

Using the results from sections B.1 and B.2, particularly propositions 20 and 11, we can now conduct the proof of theorems 1 and 2 as outlined in the beginning of section B.

The main theorem deals with the case in which we observe the GP through a finite number of linear functionals. This happens when conditioning on integral observations or on (Galerkin) projections as in sections 3.2 and 3.3. See 1

Proof

By lemma 11, ${\bm{\mathcal{L}}}[{\mathrm{f}}]$ is a Gaussian random variable with mean ${\bm{\mathcal{L}}}[m]$ and covariance matrix ${\bm{\Sigma}}$ with

{\Sigma}_{ij}={\mathcal{L}}_{i}[\mathcal{C}[{\mathcal{L}}_{j}]]={\mathcal{L}}_% {i}[x\mapsto{\mathcal{L}}_{j}[k(x,\cdot)]]=({\bm{\mathcal{L}}}k{\bm{\mathcal{L% }}}^{\prime})_{ij},

where we used propositions 9 and 20. This proves equation 4.1. Now let ${\bm{X}}=({\bm{x}}_{i})_{i=1}^{m}\in{\mathbb{X}}^{m}$ and consider

\tilde{L}\colon U\to\mathbb{R}^{m+n},f\mapsto\begin{pmatrix}f({\bm{X}})\\ {\bm{\mathcal{L}}}[f]\end{pmatrix}.

$\tilde{L}$ is linear and bounded. Hence, by lemma 11 and the stability properties of independent Gaussian random variables on $\mathbb{R}^{m+n}$ ,

\begin{pmatrix}{\mathrm{f}}({\bm{X}})\\ {\bm{\mathcal{L}}}[{\mathrm{f}}]+{\bm{\mathrm{\epsilon}}}\end{pmatrix}=\tilde{% {\bm{\mathcal{L}}}}[{\mathrm{f}}]+\begin{pmatrix}{\bm{0}}_{m\times n}\\ {\bm{I}}_{n}\end{pmatrix}{\bm{\mathrm{\epsilon}}}\sim{\operatorname{\mathcal{N% }}\left(\begin{pmatrix}m({\bm{X}})\\ {\bm{\mathcal{L}}}[m]+{\bm{\mu}}\end{pmatrix},\begin{pmatrix}k({\bm{X}},{\bm{X% }})&{\bm{\Sigma}}^{{\bm{X}},{\bm{\mathcal{L}}}}\\ {\bm{\Sigma}}^{{\bm{\mathcal{L}}},{\bm{X}}}&{\bm{\mathcal{L}}}k{\bm{\mathcal{L% }}}^{\prime}+{\bm{\Sigma}}\end{pmatrix}\right)},

where

{\Sigma}^{{\bm{X}},{\bm{\mathcal{L}}}}_{i,j}=\delta_{{\bm{x}}_{i}}[\mathcal{C}% [{\mathcal{L}}_{j}]]=\mathcal{C}[{\mathcal{L}}_{j}]({\bm{x}}_{i})={\mathcal{L}% }_{j}[k({\bm{x}}_{i},\cdot)]=(\delta_{{\bm{X}}}k{\bm{\mathcal{L}}}^{\prime})_{% i,j}

and ${\bm{\Sigma}}^{{\bm{\mathcal{L}}},{\bm{X}}}=({\bm{\Sigma}}^{{\bm{X}},{\bm{% \mathcal{L}}}})^{\top}$ . By the well-known conditioning theorem for Gaussian random variables in $\mathbb{R}^{m+n}$ , we arrive at

{\mathrm{f}}({\bm{X}})\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{% \mathcal{L}}}[{\mathrm{f}}]+{\bm{\mathrm{\epsilon}}}={\bm{y}}\sim{% \operatorname{\mathcal{N}}\left({\bm{\mu}}^{{\mathrm{f}}({\bm{X}})\nonscript\>% |\allowbreak\nonscript\>\mathopen{}{\bm{y}}},{\bm{\Sigma}}^{{\mathrm{f}}({\bm{% X}})\nonscript\>|\allowbreak\nonscript\>\mathopen{}{\bm{y}}}\right)},

with

	$\displaystyle{\bm{\mu}}^{{\mathrm{f}}({\bm{X}})\nonscript\>\|\allowbreak% \nonscript\>\mathopen{}{\bm{y}}}$	$\displaystyle=m({\bm{X}})+(\delta_{{\bm{X}}}k{\bm{\mathcal{L}}}^{\prime})({\bm% {\mathcal{L}}}k{\bm{\mathcal{L}}}^{\prime}+{\bm{\Sigma}})^{\dagger}({\bm{y}}-(% {\bm{\mathcal{L}}}[m]+{\bm{\mu}}))$
and
	$\displaystyle{\bm{\Sigma}}^{{\mathrm{f}}({\bm{X}})\nonscript\>\|\allowbreak% \nonscript\>\mathopen{}{\bm{y}}}$	$\displaystyle=k({\bm{X}},{\bm{X}})-(\delta_{{\bm{X}}}k{\bm{\mathcal{L}}}^{% \prime})({\bm{\mathcal{L}}}k{\bm{\mathcal{L}}}^{\prime}+{\bm{\Sigma}})^{% \dagger}({\bm{\mathcal{L}}}k\delta_{{\bm{X}}}^{\prime}).$

This shows that ${\mathrm{f}}=\{\omega\mapsto{\mathrm{f}}({\bm{x}},\omega)\}_{{\bm{x}}\in{% \mathbb{X}}}$ is a Gaussian process on the probability space

(\Omega,\mathcal{F},\mathrm{P}\left(\cdot\nonscript\>\middle|\allowbreak% \nonscript\>\mathopen{}{\bm{\mathcal{L}}}[{\mathrm{f}}]+{\bm{\mathrm{\epsilon}% }}={\bm{y}}\right))

where $\mathrm{P}\left(\cdot\nonscript\>\middle|\allowbreak\nonscript\>\mathopen{}{% \bm{\mathcal{L}}}[{\mathrm{f}}]+{\bm{\mathrm{\epsilon}}}={\bm{y}}\right)$ is a regular conditional probability whose existence is guaranteed, since $\mathbb{R}^{n}$ is Polish. The mean and covariance function of the conditional process evaluated at ${\bm{x}}_{i}$ are given by ${\mu}^{{\mathrm{f}}({\bm{X}})\nonscript\>|\allowbreak\nonscript\>\mathopen{}{% \bm{y}}}_{i}$ and ${\Sigma}^{{\mathrm{f}}({\bm{X}})\nonscript\>|\allowbreak\nonscript\>\mathopen{% }{\bm{y}}}_{i,i}$ . Since the points ${\bm{X}}$ were chosen arbitrarily, this holds for any ${\bm{x}}\in{\mathbb{X}}$ , which proves equations 4.3 and 4.4. ∎

Finally, we address the archetypical case, in which both the prior ${\mathrm{f}}$ and the prior predictive $\mathcal{L}[{\mathrm{f}}]+{\bm{\mathrm{\epsilon}}}$ are Gaussian processes. This happens if the linear operator maps into a function space, in which point evaluation is continuous. In this article, this case occurred in sections 3.1 and 3.2, where we inferred the strong solution of a PDE from observations of the PDE residual at a finite number of domain points. See 2

Proof

The linear operator $\mathcal{L}[\cdot](\tilde{{\bm{X}}})\colon{\mathbb{B}}\to\mathbb{R}^{n}$ is bounded, since $\mathcal{L}[\cdot](\tilde{{\bm{X}}})_{i}=\delta_{\tilde{{\bm{x}}}_{i}}\circ% \mathcal{L}$ is bounded by assumption. Hence, equations 4.8, 4.9, 4.10 and 4.11 follow directly from theorem 1. Now let ${\bm{X}}=({\bm{x}}_{i})_{i=1}^{m+m^{\prime}}\in{\mathbb{X}}^{m+m^{\prime}}$ . Then the linear operator

{\bm{\mathcal{L}}}_{\bm{X}}\colon{\mathbb{B}}\to\mathbb{R}^{m+m^{\prime}},f% \mapsto\begin{pmatrix}f({\bm{x}}_{1})&\ldots&f({\bm{x}}_{m})&\mathcal{L}[f]({% \bm{x}}_{m+1})&\ldots&\mathcal{L}[f]({\bm{x}}_{m+m^{\prime}})\end{pmatrix}^{\top}

is bounded and ${\bm{\mathcal{L}}}_{\bm{X}}[{\mathrm{f}}]$ is Gaussian by theorem 1. This implies that $\{\omega\mapsto{\bm{\mathrm{f}}}^{\mathcal{L}}({\bm{x}},\omega)\}_{{\bm{x}}\in% {\mathbb{X}}}$ with

{\bm{\mathrm{f}}}^{\mathcal{L}}({\bm{x}},\omega)\coloneqq\begin{pmatrix}f({\bm% {x}},\omega)\\ \mathcal{L}[f(\cdot,\omega)]({\bm{x}})\end{pmatrix}

is a 2-output Gaussian process. By lemmas 11, 9 and 20, its mean function is given by

{\bm{m}}^{\mathcal{L}}({\bm{x}})=\begin{pmatrix}\operatorname{\mathbb{E}}_{% \mathrm{P}}\left[\delta_{\bm{x}}[{\mathrm{f}}]\right]\\ \operatorname{\mathbb{E}}_{\mathrm{P}}\left[(\delta_{\bm{x}}\circ\mathcal{L})[% {\mathrm{f}}]\right]\end{pmatrix}=\begin{pmatrix}m({\bm{x}})\\ \mathcal{L}[m]({\bm{x}})\end{pmatrix}

and its covariance function is given by

	$\displaystyle{\bm{k}}^{\mathcal{L}}({\bm{x}}_{1},{\bm{x}}_{2})$	$\displaystyle=\begin{pmatrix}\operatorname{Cov}_{\mathrm{P}}\left[\delta_{{\bm% {x}}_{1}}[{\mathrm{f}}],\delta_{{\bm{x}}_{2}}[{\mathrm{f}}]\right]&% \operatorname{Cov}_{\mathrm{P}}\left[\delta_{{\bm{x}}_{1}}[{\mathrm{f}}],(% \delta_{{\bm{x}}_{2}}\circ\mathcal{L})[{\mathrm{f}}]\right]\\ \operatorname{Cov}_{\mathrm{P}}\left[(\delta_{{\bm{x}}_{1}}\circ\mathcal{L})[{% \mathrm{f}}],\delta_{{\bm{x}}_{2}}[{\mathrm{f}}]\right]&\operatorname{Cov}_{% \mathrm{P}}\left[(\delta_{{\bm{x}}_{1}}\circ\mathcal{L})[{\mathrm{f}}],(\delta% _{{\bm{x}}_{2}}\circ\mathcal{L})[{\mathrm{f}}]\right]\end{pmatrix}$
		$\displaystyle=\begin{pmatrix}\delta_{{\bm{x}}_{1}}[\mathcal{C}_{k}[\delta_{{% \bm{x}}_{2}}]]&\delta_{{\bm{x}}_{1}}[\mathcal{C}_{k}[\delta_{{\bm{x}}_{2}}% \circ\mathcal{L}]]\\ (\delta_{{\bm{x}}_{1}}\circ\mathcal{L})[\mathcal{C}_{k}[\delta_{{\bm{x}}_{2}}]% ]&(\delta_{{\bm{x}}_{1}}\circ\mathcal{L})[\mathcal{C}_{k}[\delta_{{\bm{x}}_{2}% }\circ\mathcal{L}]]\end{pmatrix}$
		$\displaystyle=\begin{pmatrix}k({\bm{x}}_{1},{\bm{x}}_{2})&(k\mathcal{L}^{% \prime})({\bm{x}}_{1},{\bm{x}}_{2})\\ (\mathcal{L}k)({\bm{x}}_{1},{\bm{x}}_{2})&(\mathcal{L}k\mathcal{L}^{\prime})({% \bm{x}}_{1},{\bm{x}}_{2})\end{pmatrix}.$

This proves equation 4.12. ∎

B.4 On Prior Selection

In order to apply theorem 1 in practice, we need to construct our GP prior ${\mathrm{f}}$ such that

1.

$\omega\mapsto{\mathrm{f}}(\cdot,\omega)$ is a Gaussian random variable on some suitably chosen RKBS ${\mathbb{B}}$ , and
2.

${\bm{\mathcal{L}}}\colon{\mathbb{B}}\to\mathbb{R}^{n}$ is bounded.

Luckily, we can use existing results about the path spaces of Gaussian processes together with theorems 13, 15, 16 and 19 to verify these assumptions. It is tempting to choose ${\mathbb{B}}={\mathbb{H}}_{k}$ , i.e. the RKHS of the GP’s kernel $k$ . However, this is only valid if ${\mathbb{H}}_{k}$ is finite-dimensional.

Remark 21.

Let ${\mathrm{f}}\sim{\operatorname{\mathcal{GP}}\left(m,k\right)}$ be a Gaussian process with index set ${\mathbb{X}}$ and let ${\mathbb{H}}_{k}$ be the RKHS of the covariance function $k$ . If $\dim{\mathbb{H}}_{k}=\infty$ , then the sample paths of ${\mathrm{f}}$ almost surely do not lie in ${\mathbb{H}}_{k}$ . We refer to (Kanagawa et al., 2018, Section 4) and Steinwart (2019) for more details on RKHS sample spaces.

In the following, we will give example constructions of appropriate priors for GP regression tasks with linear operator observations.

B.4.1 Priors for GP Regression with Linear Operator Observations

The spaces $C^{A}(\overline{{\mathbb{X}}})$ from definition 17, particularly $C^{{\bm{\beta}}}(\overline{{\mathbb{X}}})$ with $A\coloneqq\{{\bm{\alpha}}\in\mathbb{N}_{0}^{d}\nonscript\>|\allowbreak% \nonscript\>\mathopen{}{\bm{\alpha}}\leq{\bm{\beta}}\}$ and $C^{k}(\overline{{\mathbb{X}}})$ with $A\coloneqq\{{\bm{\alpha}}\in\mathbb{N}_{0}^{d}\nonscript\>|\allowbreak% \nonscript\>\mathopen{}\lvert{\bm{\alpha}}\rvert\leq k\}$ , are useful sample spaces for many GP regression tasks, since a large number of practically relevant observation functionals, including point evaluations of the paths and their partial derivatives, as well as integrals of the paths, are bounded on these spaces. Even though the functions in $C^{A}(\overline{{\mathbb{X}}})$ are, technically speaking, only defined on the open and bounded set ${\mathbb{X}}\subset\mathbb{R}^{d}$ , we can treat them as functions on the closure $\overline{{\mathbb{X}}}$ of ${\mathbb{X}}$ by continuous extension (see remark 18 for more details). In other words, we can evaluate functions in $C^{A}(\overline{{\mathbb{X}}})$ on the boundary $\partial{\mathbb{X}}$ of ${\mathbb{X}}$ .

To fulfill assumption 1, it remains to verify that the sample paths of a given GP prior (almost surely) lie in $C^{A}(\overline{{\mathbb{X}}})$ . Under the assumption that $m\in C^{A}(\overline{{\mathbb{X}}})$ , Da Costa et al. (2023) show that this can be done by studying the regularity of the covariance function $k$ . They also provide readily applicable results for a wide variety of covariance functions used in practice.

Example 8 (Tensor Products of Matérn Covariances).

Tensor products of 1D Matérn covariance functions

k_{{\bm{\nu}}}({\bm{x}}_{1},{\bm{x}}_{2})=\prod_{i=1}^{d}k_{{\nu}_{i}}({x}_{1,% i},{x}_{2,i})

are a particularly convenient choice of prior covariance function, since their hyperparameters directly control the differentiability of the sample paths independently for each input dimension. For an open and bounded domain ${\mathbb{X}}\subset\mathbb{R}^{d}$ , Propositions 10 and 21 in Da Costa et al. (2023) imply that samples from a Gaussian process with mean function $m$ and covariance function $k_{{\bm{\nu}}}$ lie in $C^{{\bm{\beta}}}(\overline{{\mathbb{X}}})$ (with probability 1) if ${\nu}_{i}>{\beta}_{i}$ and $m\in C^{{\bm{\beta}}}(\overline{{\mathbb{X}}})$ . Any point-evaluated partial derivative $f\mapsto\mathrm{D}^{{\bm{\alpha}}}f\left({\bm{x}}\right)$ with ${\bm{\alpha}}\leq{\bm{\beta}}$ and ${\bm{x}}\in\overline{{\mathbb{X}}}$ is continuous on $C^{{\bm{\beta}}}(\overline{{\mathbb{X}}})$ .

In section 3.2, we use tensor products of Matérn covariance functions to construct the GP priors. In particular, we choose ${\nu}_{i}=\frac{5}{2}=2+\frac{1}{2}$ , which implies that the sample paths of the prior lie in $C^{(2,2)}(\overline{{\mathbb{X}}})$ and that all point-evaluated differential operators of order $\leq 2$ are continuous linear functionals on the sample space. Hence, the assumptions of corollary 2 are fulfilled, which means that the inference procedure used in this section is supported by our theoretical results above.

The sample paths of Gaussian processes with multivariate Matérn covariance functions $k_{\nu}$ (almost surely) lie in the Banach space ${\mathbb{B}}=C^{p}(\overline{{\mathbb{X}}})$ if $\nu>p$ (Da Costa et al., 2023, Proposition 10).

For Gaussian processes with smooth covariance functions like the Gaussian/exponentiated quadratic or the rational quadratic covariance functions, assumption 1 holds for ${\mathbb{B}}=C^{p}(\overline{{\mathbb{X}}})$ for all $p\in\mathbb{N}_{0}$ (Da Costa et al., 2023, Corollary 13). Informally speaking, for the Gaussian covariance function, this can be seen as a limit of the argument above, since a Matérn covariance function approaches the Gaussian covariance function for $\nu\to\infty$ .

Gaussian processes with parametric covariance function $k({\bm{x}}_{1},{\bm{x}}_{2})={\bm{\phi}}({\bm{x}}_{1})^{\top}{\bm{\Sigma}}{\bm% {\phi}}({\bm{x}}_{2})$ with features ${\bm{\phi}}\colon{\mathbb{X}}\to\mathbb{R}^{m}$ and ${\bm{\Sigma}}\in\mathbb{R}^{m\times m}$ positive-(semi)definite have paths in ${\mathbb{B}}$ if ${\phi}_{i}\in{\mathbb{B}}$ . In this case, assumption 1 is also satisfied, since the Gaussian measure can be explicitly constructed as the law of the random function ${\bm{\mathrm{w}}}^{\top}{\bm{\phi}}$ , where ${\bm{\mathrm{w}}}\sim{\operatorname{\mathcal{N}}\left({\bm{0}},{\bm{\Sigma}}% \right)}$ .

B.4.2 Priors for Inferring Weak Solutions of Linear PDEs

A typical choice for the solution spaces ${\mathbb{U}}$ of linear PDEs in weak formulation (see section 2.1.1), are Sobolev spaces (Adams and Fournier, 2003). Unfortunately, it is impossible to construct a Gaussian process prior ${\mathrm{u}}$ , whose paths are elements of a Sobolev space ${\mathbb{U}}$ . This is due to the fact that Sobolev spaces are, technically speaking, not function spaces, but rather spaces of equivalence classes $[u]_{\sim}$ of functions, which are equal almost everywhere (Adams and Fournier, 2003). By contrast, the path spaces of Gaussian processes are proper function spaces, which means that, in this setting, $\operatorname{paths}\left({\mathrm{u}}\right)\subseteq{\mathbb{U}}$ is impossible.

Fortunately, if the path space ${\mathbb{B}}\supset\operatorname{paths}\left({\mathrm{u}}\right)$ of ${\mathrm{u}}$ can be continuously embedded in ${\mathbb{U}}$ , i.e. there is a continuous and injective linear operator $\iota\colon{\mathbb{B}}\to{\mathbb{U}}$ , commonly referred to as an embedding, then the inference procedure above can still be applied. If such an embedding exists, we can interpret the paths of the GP as elements of ${\mathbb{B}}$ by applying $\iota$ implicitly. For instance, $B[{\mathrm{u}},v]$ is then a shorthand notation for $B[\iota[{\mathrm{u}}],v]$ . Fortunately, since the embedding is assumed to be continuous, the conditions for GP inference with linear operator observations are still met when applying $\iota$ implicitly. The canonical choice for the embedding in the case of Sobolev spaces is $\iota[u]=[u]_{\sim}$ .

Example 9 (Matérn covariances and Sobolev spaces).

Kanagawa et al. (2018) show that, under certain assumptions, RKHS sample spaces of GP priors with Matérn covariance functions are continuously embedded in Sobolev spaces whose smoothness depends on the parameter $\nu$ of the covariance function. To be precise, let ${\mathbb{D}}\subset\mathbb{R}^{d}$ be open and bounded with Lipschitz boundary such that the cone condition (Adams and Fournier, 2003, Definition 4.6) holds. Denote by $k_{\nu,l}$ the Matérn kernel with smoothness parameter $\nu>0$ and lengthscale $l>0$ . Then, with probability 1, the sample paths of a Gaussian process ${\mathrm{u}}$ with covariance function $k_{\nu,l}$ are contained in any RKHS ${\mathbb{H}}_{k_{\nu^{\prime},l^{\prime}}}$ with $l^{\prime}>0$ and

0<\underbrace{\nu^{\prime}+\frac{d}{2}}_{\eqqcolon m^{\prime}}<\nu

(B.2)

(Kanagawa et al., 2018, Corollary 4.15 and Remark 4.15), i.e. $\operatorname{paths}\left({\mathrm{u}}\right)\subset{\mathbb{H}}_{k_{\nu^{% \prime},l^{\prime}}}$ . Moreover, if $m^{\prime}\in\mathbb{N}$ , then the RKHS ${\mathbb{H}}_{k_{\nu^{\prime},l^{\prime}}}$ is norm-equivalent to the Sobolev space $H^{m^{\prime}}\left({\mathbb{D}}\right)$ (Kanagawa et al., 2018, Example 2.6). This implies that the canonical embedding

\iota\colon{\mathbb{H}}_{k_{\nu^{\prime},l^{\prime}}}\to H^{m^{\prime}}\left({% \mathbb{D}}\right),{\mathrm{f}}(\cdot,\omega)\mapsto[{\mathrm{f}}(\cdot,\omega% )]_{\sim_{H^{m^{\prime}}\left({\mathbb{D}}\right)}}

(B.3)

is continuous.

For ${\mathbb{U}}=H^{m^{\prime}}\left({\mathbb{D}}\right)$ , the example above shows that the Matérn covariance function $k_{\nu,l}$ with $\nu=m^{\prime}+\epsilon$ for any $\epsilon>0$ leads to an admissible GP prior. The choice $\epsilon=\frac{1}{2}$ makes evaluating the covariance function particularly efficient (Rasmussen and Williams, 2006). For instance, in section 3.3, we used $\nu=\frac{3}{2}=1+\frac{1}{2}$ for a weak form linear PDE with solution space ${\mathbb{U}}=H^{1}\left({\mathbb{D}}\right)$ . However, the elements of the Sobolev space $H^{m}\left({\mathbb{D}}\right)$ are only $m$ -times weakly differentiable, which means that $H^{2}\left({\mathbb{D}}\right)$ is not an admissible choice in section 3.2.

C Linear Partial Differential Equations

Definition 22 (Multi-index).

Using a $d$ -dimensional multi-index ${\bm{\alpha}}\in\mathbb{N}_{0}^{d}$ , we can represent (mixed) partial derivatives of arbitrary order as

\frac{\partial^{\lvert{\bm{\alpha}}\rvert}}{\partial{\bm{x}}^{{\bm{\alpha}}}}% \coloneqq\frac{\partial^{\lvert{\bm{\alpha}}\rvert}}{\partial{x}_{1}^{({\alpha% }_{1})}\cdots\partial{x}_{d}^{({\alpha}_{d})}},

where $\lvert{\bm{\alpha}}\rvert\coloneqq\sum_{i=1}^{d}{\alpha}_{i}$ . If the variables w.r.t. which we differentiate are clear from the context, we also denote this (mixed) partial derivative by $\mathrm{D}^{{\bm{\alpha}}}$ . For two multi-indices ${\bm{\alpha}},{\bm{\alpha}}^{\prime}\in\mathbb{N}_{0}^{d}$ , we write ${\bm{\alpha}}\leq{\bm{\alpha}}^{\prime}$ iff ${\alpha}_{i}\leq{\alpha}^{\prime}_{i}$ for all $i=1,\dotsc,d$ .

Definition 23 (Linear differential operator).

A linear differential operator $\mathcal{D}\colon{\mathbb{U}}\to{\mathbb{V}}$ of order $k$ between a space ${\mathbb{U}}$ of $\mathbb{R}^{d^{\prime}}$ -valued functions and a space ${\mathbb{V}}$ of real-valued functions defined on some common open domain ${\mathbb{D}}\subset\mathbb{R}^{d}$ is a linear operator that linearly combines partial derivatives up to $k$ -th order of its input function, i.e.

\mathcal{D}[{\bm{u}}]\coloneqq\sum_{i=1}^{d^{\prime}}\sum_{{\bm{\alpha}}\in% \mathbb{N}_{0}^{d},\lvert{\bm{\alpha}}\rvert\leq k}A_{i,{\bm{\alpha}}}\mathrm{% D}^{{\bm{\alpha}}}{\bm{u}}_{i},

where $A_{i,{\bm{\alpha}}}\in\mathbb{R}$ for every $i\in\{1,\dotsc,d^{\prime}\}$ and every multi-index ${\bm{\alpha}}\in\mathbb{N}_{0}^{d}$ with $\lvert{\bm{\alpha}}\rvert\leq k$ .

C.1 Weak Derivatives and Sobolev Spaces

Definition 24 (Test Function).

Let ${\mathbb{D}}\subset\mathbb{R}^{d}$ be open and let

C_{c}^{\infty}\left({\mathbb{D}}\right)\coloneqq\{\phi\in C^{\infty}({\mathbb{% D}},\mathbb{R})\nonscript\>|\allowbreak\nonscript\>\mathopen{}\operatorname{% supp}\left(\phi\right)\subset{\mathbb{D}}\text{ is compact}\}

be the space of smooth functions with compact support in ${\mathbb{D}}$ . A function $\phi\in C_{c}^{\infty}\left({\mathbb{D}}\right)$ is dubbed test function and we refer to $C_{c}^{\infty}\left({\mathbb{D}}\right)$ as the space of test functions.

Theorem 25 (Sobolev Spaces¹²¹²12This theorem is a summary of (Adams and Fournier, 2003, Definitions 3.1 and 3.2 and Theorems 3.3 and 3.6)).

Let ${\mathbb{D}}\subset\mathbb{R}^{d}$ be open, $k\in\mathbb{N}_{>0}$ , and $p\in[1,\infty)\cup\{\infty\}$ . The functional

\lVert u\rVert_{k,p,{\mathbb{D}}}\coloneqq\begin{cases}\left(\sum_{\lvert% \alpha\rvert\leq k}\lVert\mathrm{D}^{\alpha}u\rVert_{L_{p}\left({\mathbb{D}}% \right)}^{p}\right)^{\nicefrac{{1}}{{p}}}&\text{if }p<\infty,\\ \max_{\lvert\alpha\rvert\leq k}\lVert\mathrm{D}^{\alpha}u\rVert_{L_{\infty}% \left({\mathbb{D}}\right)}&\text{if }p=\infty,\end{cases}

(C.1)

where the $\mathrm{D}^{\alpha}$ are weak partial derivatives, is called a Sobolev norm. A Sobolev norm $\lVert u\rVert_{k,p,{\mathbb{D}}}$ is a norm on subspaces of $L_{p}\left({\mathbb{D}}\right)$ , on which the right-hand side is well-defined and finite. A Sobolev space of order $k$ is defined as the subspace

W^{k,p}\left({\mathbb{D}}\right)\coloneqq\{u\in L_{p}\left({\mathbb{D}}\right)% \nonscript\>|\allowbreak\nonscript\>\mathopen{}\mathrm{D}^{\alpha}u\in L_{p}% \left({\mathbb{D}}\right)\ \text{for}\ \lvert\alpha\rvert\leq k\}.

of $L_{p}$ . Sobolev spaces $W^{k,p}\left({\mathbb{D}}\right)$ are Banach spaces under the Sobolev norm $\lVert\cdot\rVert_{k,p,{\mathbb{D}}}$ . The Sobolev space $H^{k}\left({\mathbb{D}}\right)\coloneqq W^{2,k}\left({\mathbb{D}}\right)$ is a separable Hilbert space with inner product

\langle u_{1},u_{2}\rangle_{k,{\mathbb{D}}}\coloneqq\sum_{\lvert\alpha\rvert% \leq k}\langle\mathrm{D}^{\alpha}u_{1},\mathrm{D}^{\alpha}u_{2}\rangle_{L_{2}% \left({\mathbb{D}}\right)}

(C.2)

and norm $\lVert\cdot\rVert_{k,{\mathbb{D}}}\coloneqq\sqrt{\langle\cdot,\cdot\rangle_{k,% {\mathbb{D}}}}=\lVert\cdot\rVert_{k,2,{\mathbb{D}}}.$

References

Adams and Fournier (2003) Robert A. Adams and John J. F. Fournier. Sobolev Spaces, volume 140 of Pure and Applied Mathematics. Elsevier, second edition, 2003.
Agrell (2019) Christian Agrell. Gaussian processes with linear operator inequality constraints. Journal of Machine Learning Research, 20(135):1–36, 2019.
Albert (2019) Christopher G. Albert. Gaussian processes for data fulfilling linear differential equations. Proceedings of the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 33(1), 2019. doi:10.3390/proceedings2019033005.
Aliprantis and Border (2006) Charalambos D. Aliprantis and Kim C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, Berlin, Heidelberg, third edition, 2006. doi:10.1007/3-540-29587-9.
Alt (2012) Hans Wilhelm Alt. Lineare Funktionalanalysis: Eine anwendungsorientierte Einführung. Springer, Berlin, Heidelberg, 2012. doi:10.1007/978-3-642-22261-0.
Alvarez et al. (2009) Mauricio Alvarez, David Luengo, and Neil D. Lawrence. Latent force models. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5, pages 9–16, Clearwater Beach, Florida, USA, 2009.
Azangulov et al. (2022) Iskander Azangulov, Andrei Smolensky, Alexander Terenin, and Viacheslav Borovitskiy. Stationary kernels and Gaussian processes on Lie groups and their homogeneous spaces i: the compact case. arXiv preprint arXiv:2208.14960, 2022.
Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York, first edition, 2006.
Black and Scholes (1973) Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637–654, 1973. doi:10.1086/260062.
Bogachev (1998) Vladimir Igorevich Bogachev. Gaussian Measures, volume 62 of Mathematical Surveys and Monographs. American Mathematical Society, Providence, Rhode Island, 1998.
Borthwick (2018) David Borthwick. Introduction to Partial Differential Equations. Universitext. Springer, first edition, 2018. doi:10.1007/978-3-319-48936-0.
Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Cockayne et al. (2017) Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Probabilistic numerical methods for PDE-constrained Bayesian inverse problems. In Geert Verdoolaege, editor, Proceedings of the 36th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, volume 1853 of AIP Conference Proceedings, pages 060001–1 – 060001–8, 2017. doi:10.1063/1.4985359.
Cockayne et al. (2019a) Jon Cockayne, Chris J. Oates, Ilse C.F. Ipsen, and Mark Girolami. A Bayesian conjugate gradient method (with discussion). Bayesian Analysis, 14(3):937–1012, 2019a. doi:10.1214/19-BA1145.
Cockayne et al. (2019b) Jon Cockayne, Chris J. Oates, T. J. Sullivan, and Mark Girolami. Bayesian probabilistic numerical methods. SIAM Review, 61(4):756–789, 2019b. doi:10.1137/17M1139357.
Da Costa et al. (2023) Nathaël Da Costa, Marvin Pförtner, Lancelot Da Costa, and Philipp Hennig. Sample path regularity of Gaussian processes from the covariance kernel, 2023.
Da Prato and Zabczyk (1992) Guiseppe Da Prato and Jerzy Zabczyk. Stochastic Equations in Infinite Dimensions. Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1992. doi:10.1017/CBO9780511666223.
Evans (2010) Lawrence C. Evans. Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, Providence, Rhode Island, second edition, 2010.
Fasshauer (1997) Gregory E. Fasshauer. Solving partial differential equations by collocation with radial basis functions. In Alain Le Méhauté, Christophe Rabut, and Larry L. Schumaker, editors, Surface Fitting and Multiresolution Methods, pages 131–138. Vanderbilt University Press, Nashville, TN, 1997.
Fasshauer (1999) Gregory E. Fasshauer. Solving differential equations with radial basis functions: multilevel methods and smoothing. Advances in Computational Mathematics, 11:139–159, November 1999. doi:10.1023/A:1018919824891.
Fletcher (1984) C. A. J. Fletcher. Computational Galerkin Methods. Scientific Computation. Springer, Berlin, Heidelberg, first edition, 1984. doi:10.1007/978-3-642-85949-6.
Fourier (1822) Jean Baptiste Joseph Fourier. Théorie analytique de la chaleur. Firmin Didot, 1822. doi:10.1017/CBO9780511693229.
Girolami et al. (2021) Mark Girolami, Eky Febrianto, Yin Ge, and Fehmi Cirak. The statistical finite element method (statFEM) for coherent synthesis of observation data and model predictions. Computer Methods in Applied Mechanics and Engineering, 275:113533, 2021. doi:10.1016/j.cma.2020.113533.
Golub and Van Loan (2013) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. The Johns Hopkins University Press, Baltimore, fourth edition, 2013.
Graepel (2003) Thore Graepel. Solving noisy linear operator equations by Gaussian processes: Application to ordinary and partial differential equations. In Proceedings of the 20th International Conference on Machine Learning, pages 234–241. AAAI Press, 2003.
Haasdonk and Burkhardt (2007) Bernard Haasdonk and Hans Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine learning, 68(1):35–61, 2007.
Hennig et al. (2015) Philipp Hennig, Michael A. Osborne, and Mark Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society A, 471(2179), 2015. doi:10.1098/rspa.2015.0142.
Hennig et al. (2022) Philipp Hennig, Michael A. Osborne, and Hans P. Kersting. Probabilistic Numerics: Computation as Machine Learning. Cambridge University Press, June 2022. doi:10.1017/9781316681411.
Holder (2005) David S. Holder, editor. Electrical Impedance Tomography: Methods, History and Applications. Institute of Physics Medical Physics Series. Institute of Physics Publishing, Bristol, 2005.
Holderrieth et al. (2021) Peter Holderrieth, Michael J Hutchinson, and Yee Whye Teh. Equivariant learning of stochastic fields: Gaussian processes and steerable conditional neural processes. In International Conference on Machine Learning, pages 4297–4307. PMLR, 2021.
Kanagawa et al. (2018) Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
Karniadakis et al. (2021) George Em Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021. doi:https://doi.org/10.1038/s42254-021-00314-5.
Kazhdan et al. (2006) Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proceedings of the 4th Eurographics Symposium on Geometry Processing, volume 7, 2006.
Klenke (2014) Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer, London, second edition, 2014. doi:10.1007/978-1-4471-5361-0.
Krämer et al. (2022) Nicholas Krämer, Jonathan Schmidt, and Philipp Hennig. Probabilistic numerical method of lines for time-dependent partial differential equations. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151, pages 625–639. PMLR, 2022.
Lautrup (2005) Benny Lautrup. The PDE’s of continuum physics. In Proceedings of the Workshop on PDE methods in Computer Graphics, 2005.
Li et al. (2020) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020. doi:10.48550/arXiv.2003.03485.
Li et al. (2021) Zongyi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. doi:10.48550/arXiv.2010.08895.
Lienhard and Lienhard (2020) John H. Lienhard, IV and John H. Lienhard, V. A Heat Transfer Textbook. Phlogiston Press, Cambridge, MA, fifth edition, 2020.
Lin et al. (2022) Rong Rong Lin, Hai Zhang Zhang, and Jun Zhang. On reproducing kernel Banach spaces: Generic definitions and unified framework of constructions. Acta Mathematica Sinica, English Series, 38(8):1459–1483, August 2022. doi:10.1007/s10114-022-1397-7.
Logg et al. (2012) Anders Logg, Kent-Andre Mardal, and Garth Wells, editors. Automated Solution of Differential Equations by the Finite Element Method, volume 84 of Lecture Notes in Computational Science and Engineering. Springer, Berlin, Heidelberg, 2012. doi:10.1007/978-3-642-23099-8.
Maxwell (1865) James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical transactions of the Royal Society of London, 155:459–512, 1865.
Michaud (2019) Pierre Michaud. A simple model of processor temperature for deterministic turbo clock frequency. resreport RR-9308, Inria Rennes, 2019. URL https://hal.inria.fr/hal-02391970.
Oates and Sullivan (2019) Chris J. Oates and Tim J. Sullivan. A modern retrospective on probabilistic numerics. Statistics and Computing, 29:1335–1351, 2019. doi:10.1007/s11222-019-09902-z.
Owhadi and Scovel (2018) Houman Owhadi and Clint Scovel. Conditioning Gaussian measure on Hilbert space. Journal of Mathematical and Statistical Analysis, 1(109), 2018.
Owhadi et al. (2019) Houman Owhadi, Clint Scovel, and Florian Schäfer. Statistical numerical approximation. Notices of the American Mathematical Society, 66(10):1608–1617, 2019. doi:10.1090/noti1963.
Raissi et al. (2017) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Machine learning of linear differential equations using Gaussian processes. Journal of Computational Physics, 348:683–693, 2017. doi:10.1016/j.jcp.2017.07.050.
Raissi et al. (2019) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019. doi:https://doi.org/10.1016/j.jcp.2018.10.045.
Rajput and Cambanis (1972) Balram S. Rajput and Stamatis Cambanis. Gaussian processes and Gaussian measures. The Annals of Mathematical Statistics, 43(6):1944–1952, 1972. doi:10.1214/aoms/1177690865.
Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, London, England, 2006.
Reisert and Burkhardt (2007) Marco Reisert and Hans Burkhardt. Learning equivariant functions with matrix valued kernels. Journal of Machine Learning Research, 8(3), 2007.
Rudin (1991) Walter Rudin. Functional Analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, New York, second edition, 1991.
Särkkä (2011) Simo Särkkä. Linear operators and stochastic partial differential equations in Gaussian process regression. In Artificial Neural Networks and Machine Learning – ICANN 2011, pages 151–158, Berlin, Heidelberg, 2011. doi:10.1007/978-3-642-21738-8_20.
Särkkä et al. (2013) Simo Särkkä, Arno Solin, and Jouni Hartikainen. Spatiotemporal learning via infinite-dimensional Bayesian filtering and smoothing: A look at Gaussian process regression through Kalman filtering. IEEE Signal Processing Magazine, 30(4):51–61, 2013. doi:10.1109/MSP.2013.2246292.
Steinwart (2019) Ingo Steinwart. Convergence types and rates in generic Karhunen-Loève expansions with applications to sample path properties. Potential Analysis, 51:361–395, 2019. doi:10.1007/s11118-018-9715-5.
Steinwart and Christmann (2008) Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, New York, first edition, 2008. doi:10.1007/978-0-387-77242-4.
von Harrach (2021) Bastian von Harrach. Numerik partieller differentialgleichungen. Lecture Notes, 2021. URL https://www.math.uni-frankfurt.de/~harrach/lehre/Numerik_PDGL.pdf.
Wang et al. (2021) Junyang Wang, Jon Cockayne, Oksana Chkrebtii, Tim J. Sullivan, and Chris J. Oates. Bayesian numerical methods for nonlinear partial differential equations. Statistics and Computing, 31(55), 2021. doi:10.1007/s11222-021-10030-w.
Wenger and Hennig (2020) Jonathan Wenger and Philipp Hennig. Probabilistic linear solvers for machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
Wenger et al. (2021) Jonathan Wenger, Nicholas Krämer, Marvin Pförtner, Jonathan Schmidt, Nathanael Bosch, Nina Effenberger, Johannes Zenn, Alexandra Gessner, Toni Karvonen, François-Xavier Briol, Maren Mahsereci, and Philipp Hennig. ProbNum: Probabilistic numerics in python, 2021.
Yosida (1995) Kôsaku Yosida. Functional Analysis, volume 123 of Classics in Mathematics. Springer, sixth edition, 1995. doi:10.1007/978-3-642-61859-8.

Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers

Abstract

1 Introduction

Scientific inference with PDEs

Challenges when solving PDEs

Solving PDEs as a learning problem

Contribution

2 Background

2.1 Linear Partial Differential Equations

2.1.1 Weak Formulation

Definition 1.

2.1.2 Methods of Weighted Residuals

2.2 Gaussian Processes

3 Learning the Solution to a Linear PDE

Indirectly Observing the Solution of a PDE

Example 1 (Thermal Conduction and the Heat Equation).

3.1 Solving PDEs as a Bayesian Inference Problem

Gaussian Process Inference

3.1.1 Encoding Prior Knowledge about the Solution

Function Space of the Solution

Symmetries, In- and Equivariances

Related Problems

Domain Expertise

3.1.2 (Indirectly) Observing the Solution

Observing the Solution via the PDE

Observing the Solution at the Boundary

Observing the Solution Directly

3.2 Case Study: Modeling the Temperature Distribution in a CPU

Example 2 (Stationary Heat Equation).

Encoding Prior Knowledge

Conditioning on the PDE

Conditioning on the Boundary Conditions

Example 3 (continues=ex:thermal-conduction-heat-equation).

Conditioning on Direct Measurements

Uncertainty in the Right-hand Side and the Boundary Function

Summary

3.3 A General Class of Tractable Information Operators for Linear PDEs

3.3.1 Infinite-Dimensional Trial Function Spaces

Example 4 (Symmetric Collocation).

Example 5 (Weak Formulations).

3.3.2 Finite-Dimensional Trial Function Spaces

Example 6.

Example 7 (A 1D Finite Element Method).

3.3.3 MWR Information Operators

Definition 2 (MWR Information Operator).

3.3.4 Recovery of Classical Methods

Proposition 3.

Corollary 4 (MWR Generalization).

Proposition 5 (MWR Recovery Prior).

3.4 Algorithm

Performance Considerations

Code

3.5 Related Work

4 Gaussian Process Inference with Linear Operator Observations

Assumption 1.

Notation 1.

Theorem 1.

Notation 2.

Corollary 2.

Remark 6.

Remark 7 (Multi-Output Gaussian Processes).

5 Conclusion

A Proofs for Section 3.3

B Proofs for Section 4

B.1 Gaussian Measures on Separable Banach Spaces

Definition 8.

Proposition 9.

Remark 10.

B.1.1 Continuous Affine Transformations

Lemma 11.

B.2 Gaussian Processes as Gaussian Random Functions

Definition 12 (Lin et al. 2022, Definition 2.1).

Theorem 13.

Corollary 14.

Proposition 15.

Proposition 16.

Definition 17.

Remark 18.

Proposition 19.

Proposition 20.

Physics-Informed Gaussian Process Regression
Generalizes Linear PDE Solvers

Theorem 25 (Sobolev Spaces¹²¹²12This theorem is a summary of (Adams and Fournier, 2003, Definitions 3.1 and 3.2 and Theorems 3.3 and 3.6)).