Application of Hierarchical Matrices to Linear Inverse Problems in Geostatistics

Application of Hierarchical Matrices to Linear Inverse Problems in Geostatistics — Characterizing the uncertainty in the subsurface is an important step for exploration and extraction of natural resources, the storage of nuclear material and gasses such as natural gas or CO2. Imaging the subsurface can be posed as an inverse problem and can be solved using the geostatistical approach [Kitanidis P.K. (2007) Geophys. Monogr. Ser. 171, 19-30, doi:10.1029/171GM04; Kitanidis (2011) doi: 10.1002/9780470685853. ch4, pp. 71-85] which is one of the many prevalent approaches. We briefly describe the geostatistical approach in the context of linear inverse problems and discuss some of the challenges in the large-scale implementation of this approach. Using the hierarchical matrix approach, we show how to reduce matrix vector products involving the dense covariance matrix from O(m2) to O(m log m), where m is the number of unknowns. Combined with a matrix-free Krylov subspace solver, this results in a much faster algorithm for solving the system of equations that arise from the geostatistical approach. We illustrate the performance of our algorithm on an application, for monitoring CO2 concentrations using crosswell seismic tomography. 2 Oil & Gas Science and Technology – Rev. IFP Energies nouvelles


INTRODUCTION
Inverse problems arise frequently in the context of earth sciences, such as hydraulic tomography [1][2][3][4], crosswell seismic traveltime tomography [5][6][7], electrical resistivity tomography [8,9], contaminant source identification [10][11][12][13], etc.A common feature amongst inverse problems is that the parameters we are interested in estimating are hard to measure directly and a crucial component of inverse modeling is using sparse data to evaluate model parameters, i.e., the solution to the inverse problems.Inverse problems are typically ill-posed for three reasons: -the functional form of the relation between the model parameters and the data may be over-sensitive; -there is a lack of data due to insufficient measurements; -the presence of noise in the available data and limitations of the model.These imply that there is no unique solution to such inverse problem.Instead, there is a set of solutions consistent under the given model and data.Hence, one should not look for a unique solution but a statistical ensemble of solutions.The classical approach to deal with ill-posedness is to place additional restrictions on the solution, by the use of some kind of regularization such as Tikhonov regularization.This approach, while sufficient to produce the "best estimate" does not eliminate the uncertainty.On the other hand, statistical approaches to inverse problems model all the variables as random variables, which represents the degree of information concerning their realizations, which are encoded in terms of probability distributions.The solution to the inverse problem is the posterior probability distribution.
Geostatistics is a general method for solving such inverse problems, see for example [14].The approach is based on the idea of combining data with information about the structure of the function that needs to be estimated.In Bayesian and geostatistical approaches (for example, see discussion in [15,16]), the structure of the function is represented through the prior probability density function, which in practical applications is often parameterized through variograms and generalized covariance functions.The method has found several applications because it is generally practical and quantifies uncertainty.The method can generate best estimates, which can be determined in a Bayesian framework as a posteriori mean values or most probable values, measures of uncertainty as posterior variances or credibility intervals and conditional realizations, which are sample functions from the ensemble of the posterior probability distribution.
In recent years, there have been vast improvements in the measurement technology that allow the collection of even more measurements, tremendous improvements in computational power and advances in developing state-ofthe-art computational techniques to solve forward problems, in other words the governing partial differential equations, that scale well on several processors.Moreover, there is a demand for more accurate prediction of parameters that govern flow and transport in complex geological processes.There are several challenges in large-scale implementations of inverse problems.As the number of unknowns increase, the storage and computational costs involving the dense covariance matrix that represents the Gaussian random field are overwhelming, especially on unstructured grids.Recently, we have developed methods [17,18] for large-scale implementation of the geostatistical approach that scales to unknowns of the order of O (10 6 ) with O(10 3 ) measurements that uses of the hierarchical matrix approach [19][20][21][22][23][24] in order to reduce the costs involved in computing matrix-vector products from O(m 2 ) to O(m log m) [18] or O(m) [17], where m is the number of unknowns.
The purpose of this paper is to introduce and review the geostatistical approach to solving linear inverse problems and describe in detail the computational techniques involved in large-scale implementation of these algorithms using the hierarchical matrix approach and to illustrate its utility on some real applications.In particular, we describe how hierarchical matrices can be used to accelerate matrix operations involving dense covariance matrices.We emphasize here that this article is more of a tutorial nature, rather than development of new ideas or techniques.The remainder of the article is organized as follows.In Section 1, we review the geostatistical approach and the approaches in the literature for fast algorithms to deal with large covariance matrix.The next section, Section 2, describes the essential steps in the implementation of the hierarchical matrix approach, starting from simple observations about the structure of covariance matrices to stating the full algorithm along with complexity estimates.Subsequently in Section 3, we provide numerical evidence of the log-linear scaling of hierarchical matrices for various kernels.Then, we show how to combine the hierarchical matrix formulation with direct solvers and appropriately chosen matrix-free Krylov subspace iterative techniques to solve the system of equations that result from the geostatistical approach.Finally, we illustrate the performance of our algorithm on applications from CO 2 monitoring.
We briefly mention some other approaches to deal with large spatial datasets, in the context of kriging.One approach is to use covariance tapering [25,26], in which the correct covariance matrix is tapered using an appropriately chosen compactly supported radial basis function which results in a sparse approximation of the covariance matrix that can be solved using sparse matrix algorithms.Another approach is to choose classes of covariance functions for which kriging can be done exactly using a multiresolution spatial process [27][28][29].Other approaches include fixed rank kriging [30] Before we proceed, we would like to clarify the notations.Lower case bold face denotes vectors, upper case bold face denotes matrices and Greek alphabets denote scalars.In general, m denotes the number of unknowns and n stands for the number of measurements.The rest of the notations should be clear from the context.

General Formulation
Suppose that we want to estimate a spatially variable parameter s(x), for example the log hydraulic conductivity, in the domain D. The prevalent approach is to write it as a sum of a deterministic function with adjustable coefficients and a small scale variability term, or random-walk type of variability, which is typically modeled using covariance functions or variograms (for example, see Fig. 1).After appropriate discretization, the parameters to be determined, s ∈ R m×1 can be written as: where X ∈ R m×p is a known matrix, β ∈ R p×1 is a vector of drift coefficients and E is the expectation.The entries of the covariance matrix Q, are given by Q i j = κ(x i , x j ), where κ(•, •) is a generalized covariance function.The generalized covariance function [31,32] must be conditionally positive definite.The forward problem relates the parameters/unknowns to be determined, s, to the set of the measurements, y ∈ R n×1 , by the measurement equation: where, we model the noise in the measurements v as a normal distribution, with zero mean and covariance matrix R. Using Bayes' theorem, assuming a uniform prior for β, i.e., p(β) ∝ 1, we have p(s, β), the prior probability distribution of the unknowns and the probability distribution of the error can be combined to give the posterior probability distribution of the unknown parameters, s and β: Plugging in the expression for the respective terms, and using an alternative convenient notation, we find that: where x M is the norm induced by the positive definite matrix M and is defined as x 2 M = x T Mx.The resulting posterior distribution ( 4) is also Gaussian.The posterior mean values, ŝ and β is given by maximizing the a posteriori estimate, arg max The above Equation ( 5) is equivalent to the minimization problem given below in Equation ( 6): For n < m, it is more convenient to compute the solution to this optimization problem (5) by first obtaining the solution of the following linear system of equations: and then computing the resulting unknown field from the solution of the system of Equations ( 7) by the following transformation: We also denote by = HX and we also define the matrix A as: The covariance matrix of the posterior pdf of the parameters s and β is simply given by the inverse of the Hessian of the objective function that we minimized, i.e.
A direct implementation of this algorithm can be done in O(m 2 n + mn 2 + mnp).However, for realistic problem sizes, the number of unknowns, m, is quite large, O(10 5 − 10 6 ).
The covariance matrix Q is dense, and the construction and storage costs are high.One possibility is to form QH T directly without forming Q.Even with this possibility, the time required to form QH T is quite high.In the next section, we discuss several approaches to deal with this issue.

Covariance Functions
Consider the random Gaussian field s(x), and let x ∈ D ⊂ R d be a bounded domain, with a covariance Kernel κ(x, y).Three realizations of a Gaussian random field with mean zero exponential covariance function.
In terms of applications, it is useful to consider three kinds of covariance kernels [33]: 1. isotropic and translation invariant, i.e., κ(x, y) = κ(|x−y|); 2. stationary and anisotropic, i.e., κ(x, y) = κ(x − y); 3. non-stationary.Some possible choices for the covariance kernel κ(•, •) arise from the Matérn family of covariance kernels [34], corresponding to a stationary and isotropic stochastic process.They are defined as K α,ν (x, y) = C α,ν (r), r = x − y with: where, K ν is the modified Bessel function of second kind of order ν and γ is the gamma-function.Equation ( 11) takes special forms for certain parameters ν.For example, when ν = 1/2, C α,ν corresponds to the exponential covariance function, ν = 1/2 + n where n is an integer, C α,ν is the product of an exponential covariance and a polynomial of order n.In the limit as ν → ∞ and for appropriate scaling of α, C α,ν converges to the Gaussian covariance kernel.For a more detailed discussion of permissible covariance kernels, we refer the reader to the following references [31,32,35].After appropriate discretization of the domain, the covariance matrix takes the form A i j = κ(x i , x j ) for a set of points {x i }.Covariance matrices, although dense, have special structure which can be exploited.They are similar to dense matrices that arise from the discretization of integral equations.Various fast summation schemes have been devised to provide matrix-vector products in O(N log β N) for such problems, where N is the number of unknowns and β ≥ 0 an integer depending on the method.They broadly fall under three different categories: -FFT (Fast Fourier Transform) based methods, -fast multipole methods [36][37][38][60][61][62], -hierarchical matrices [21,22,24].

FFT Based Methods
Covariance kernels of type (1) or (2), discretized on a uniformly spaced, regular grid, result in Toeplitz (or Block-Toeplitz) matrices.Toeplitz matrices can be embedded inside circulant matrices, for which operations such as matrix-vector products, matrix-matrix products can be efficiently computed using FFT.However, their primary deficiency is that these algorithms don't extend very easily to other types of grids that are predominant in realistic problems.In particular, Nowak and Cirpka [39] utilized this method to estimate hydraulic conductivity and dispersivities in a large-scale problem.However, they cannot easily be extended to handle non-uniform grids.In [40], they extended the FFT based algorithm to deal with irregularly spaced measurements but they did not show how to extend their algorithm for general measurement operators or the case when the underlying unknowns are not on a regular grid.

Fast Multipole Methods
Fast multipole methods were first proposed by Greengard and Rokhlin [37] in the context of simulating several particles with coulombic or gravitational potential, which required computing sums of the form: where, {x i } and {y i } are a set of N points and {φ i } are a set of weights.Direct implementation of this summation can be done in O(N 2 ).Fast multipole methods require O(N) to compute q, given a prescribed relative error and assumptions on the kernel κ.It was first proposed for the Laplacian kernel in 2-D, κ(x, y) = − log x − y but it has been subsequently extended to several other kernels, including radial basis functions.The original version of the fast multipole method relied on analytic expressions for the multipole expansion of the kernels but several kernel independent versions have been proposed [36,38].For more details, the reader is referred to [60][61][62].

Hierarchical Matrices
Hierarchical matrices [19,21,22,24] (or H-matrices, for short) are efficient data-sparse representations of certain densely populated matrices.The main idea that is used repeatedly in these kind of techniques is to split a given matrix into a hierarchy of rectangular blocks and approximate each of the blocks by a low-rank matrix.Hierarchical matrices have been use successfully in the approximate representation of matrices arising in the boundary element method or for the approximation of the inverse of a finite element discretization of an elliptic partial differential operator.Fast algorithms have been developed for this class of matrices, including matrix-vector products, matrix addition, multiplication and factorization in almost linear complexity.
A variant of the H-matrix approach is the H 2 -matrix approach.The H 2 -matrix approach is an algebraic generalization of the fast multipole method [19].The H 2 -matrix approach has also been applied to such linear inverse problems [17].The H 2 -matrix approach enables us to further reduce the cost to O(m).In this article though, we will focus our attention and explain in detail only H-matrices.The H-matrix approach is slightly easier to introduce to a new audience than the H 2 -matrix approach.We will, however, present some results obtained using the H 2 -matrix approach.A more detailed introduction can be found in [20].

Low-Rank Matrices
Let A be any m × n matrix.The column rank of the matrix is the maximum number of linearly independent column vectors of A, whereas the row rank is the maximum number of linearly dependent row vectors of A. A fundamental theorem of linear algebra states that the row rank and column rank of a matrix must be equal.This number is called the "rank" of a matrix.
Let A = UσV T be the Singular Value Decomposition (SVD) of the matrix, with n} ≥ 0 and zero non-diagonal entries.The columns of U form a basis for the row-space of A, whereas the columns of V form a basis for the column-space of A. For readers unfamiliar with the SVD, we recommend the following reference [41].The SVD is a very useful idea for both theoretical analysis and numerical computations.It will also be convenient to define the following ideas.An unitarily invariant norm • , is one for which UAV = A for all unitary (i.e., for real matrices, same as orthogonal) matrices U and V. Examples of unitarily invariant norms are the spectral or 2-norm A 2 = σ 1 , where σ 1 is the largest singular value of A and the Frobenius norm We now state a theorem mentioned in [19, Section 1.1.3,page 18] that gives the best approximation to a matrix, for a prescribed rank k.
Theorem.Let A be a real valued m × n matrix and let • be a unitarily invariant matrix norm, then for k ≤ min{m, n}, it holds that: where, and σ k = diag(σ 1 , σ 2 , . . ., σ k , 0, . . ., 0).Also, u i , v i , i = 1, . . ., k are the first k columns of U and V respectively.In other words, the best approximation of a matrix, of rank at most k is simply obtained by retaining the first k singular values and vectors of the matrix.In particular, for A k as defined before.
In terms of a relative error of approximation, an alternate notion of rank is introduced [19,42] based on a relative approximation error ε.The ε-rank of a matrix A in the norm • is defined as: where, A r is the SVD of A truncated to the r largest singular values.
A convenient representation of low rank matrices is the so-called outer-product form.Suppose there exist two matrices M ∈ R m×k and N ∈ R n×k , with columns m i , n i , i = 1, . . ., k such that: One advantage of this form of representation is that the storage requirements are k(m+n), as opposed to the usual mn entries of A. This leads us to a formal definition of low-rank matrices.It should be emphasized that this definition is not the same with the usual mathematical definition of deficient rank.A matrix A ∈ R m×n of rank k is called low-rank if: Moreover, low-rank matrices will always be represented in outer-product form, while the entry-wise representation will be used for all other matrices.Consider as before, A = MN T .Computing y = Ax and y = A T x on appropriate sized vectors x can be done in the following steps: -compute w ← N T x, which costs O(kn), by counting number of multiplications; -compute y ← Mw, similarly, which costs O(km).Thus, the total cost is O(k(m + n)).This is to be compared with O(mn) for the direct multiplication of Ax.The same results apply when the matrix-vector multiplication involves the transpose of A, i.e., y = A T x.

Motivation and Key Ideas
We will focus our attention on a 1−D example that will help illustrate several of the features of H−matrices.
Notice that this is an appropriate covariance function, even and positive definite (which one can verify by taking the Fourier transform).Let A be an m × n matrix with entries A i j = κ α (x i , y j ) for points x i , y j uniformly discretized in the domain [0, 1] × [0, 1] with i = 1, . . ., m and j = 1, . . ., n and are given by the following expressions: Clearly, because this matrix is dense, storing this matrix and computing matrix-vector products are of O(mn) complexity.Figures 2 and 3 provide key insight to the structure of the matrix that we hope to exploit.Again, for the sake of illustration, we consider m = n = 256 (even though in actual applications where H-matrix techniques are useful, m and n are much larger) and compute the ε-ranks of various subblocks of the matrix, for ε = 10 −6 and α = 10 −6 .The details of these computations will be discussed later, while here we focus on the results.From the first diagram in the figure, we see that the matrix has full rank, which is not surprising.
In the second diagram, the block ranks of the 2 × 2 blocks are provided.We see that the off-diagonal blocks have low ranks whereas, the (1, 1) and (2, 2) blocks still have full rank.We subdivide these blocks further.We see that this kind of hierarchical separation of blocks and approximation of sub-blocks chosen by some criteria (in these case, simply off-diagonal sub-blocks) results in reduction in storage requirements and, as we shall see, significant reduction in computational requirements.The amount of memory (in kilo bytes) required for storing each level of approximation is as follows: 512, 296, 204 and 122.Clearly, even just using 2 levels of compression, we can reduce the storage to less than half.This represents a significant reduction in storage for problems of size ∼ 10 6 .As a second example, we consider the exponential covariance kernel, which is a part of the Matérn covariance family and is defined as: As before, we plot the ε-ranks of various sub-blocks of the matrix, A with entries A i j = κ(x i , y j ) for ε = 10 −6 .The points x i , i = 1, . . ., m and y j , j = 1, . . ., n, for m = n = 256 are the same as before.However, the situation is much more dramatic.The off-diagonal blocks have ranks at most 2. memory (in kilo bytes) required for storing each level of approximation is as follows: 512, 260, 136 and 76.At the finest level, the storage amount is less than 15% of the coarsest level of refinement.We are now in a position to summarize the key features of hierarchical matrices: -a hierarchical separation of space; -an acceptable tolerance ε that is specified; -low-rank approximation of carefully chosen sub-blocks.
In the above example, the situation was rather simple because clustering of points can be easily accomplished by grouping them by intervals.In a realistic 3-D scenario, the clustering requires some slightly more complicated data structures such as a tree.The form of recursive low-rank representation of off-diagonal blocks is a special case of a class of matrices known as hierarchically semiseparable matrices (HSS) [43][44][45].In the general case, submatrices corresponding to clusters that are well-separated lead to lowrank matrices.This will be discussed in detail shortly.
When we plot the singular values of off-diagonal subblocks of matrix corresponding to non-overlapping segments, we observe an exponential decay in the singular values, see Figure 4.The exponentially decreasing singular values, can be explained by appealing to the Taylor series representation of the kernel κ α .Consider two sets of indices t, s, and let X t = {x i |i ∈ t} and X s = {y j | j ∈ s} that satisfy the following condition: where we define: Now, using the Taylor series expansion for κ α (x, y) where, R k (x, y) is the remainder term in the Taylor series expansion which can be bounded as follows: provided that diam(X s ) ≤ η dist(X t , X s ) and α small.Repeating the same argument interchanging the roles of x and y, a similar result can be obtained with diam(X t ) ≤ η dist(X t , X s ).
A bound of the type Equation ( 25) implies that with matrices U and V defined by: and therefore using the Frobenius norm M 2 F = i, j M 2 i j , we have the following global error bound: 1 1 2 Simple cluster tree with 3 levels.
Therefore, the global relative approximation error decreases exponentially with increasing k.From this it can easily be shown that: where, σ k+1 is the k + 1-th singular value of A. This implies that the singular values of such blocks decay exponentially.

Construction of Cluster Tree and Block-Cluster Tree
Consider the index set I = {1, 2, . . ., N}, corresponding the points {x i } N i=1 .A cluster τ ⊂ I, is a set of indices corresponding to points that are "close" together in some sense.We also define X τ = {x i |i ∈ τ}.We have previously seen that low-rank matrices are an efficient means to represent matrices with exponentially decaying singular values.Representing the entire matrix by a low-rank matrix is not feasible in most applications.Typically, in dense covariance matrices that arise in practice, only the sub-blocks corresponding to admissible clusters (a notion that we will expand on shortly) can be approximated well by low-rank matrices.In order to partition the set of matrix indices I × I hierarchically into sub-blocks, we first recursively subdivide the index set I, which leads to the so-called cluster tree.For a formal definition, please refer to [20]. Figure 5 describes a very simple cluster tree with 3 levels.Also, the notation I (l) k implies, the index set of node at level l and having index k (at level l).
A simple algorithm to construct the cluster tree is based on geometric bisection and is described in Figure 6.Briefly, this algorithm is initialized with the cluster corresponding to the index set I. The eigenvector v 1 corresponding to the largest eigenvalue of the covariance matrix C is computed, Center of mass of the cluster: Covariance matrix of the cluster: Eigenvalues and eigenvectors: Build Cluster Tree -Split τ.
and it corresponds to the direction of the longest expanse of the cluster.Then, we compute the separation plane that goes through the center of mass of the cluster X and is orthogonal to the eigenvector v 1 .Based on which side of the separation plane the points lie in, the cluster is split into two, more or less equal, sons.Recursive subdivision gives rise to a binary cluster tree.Theoretically, each node of the tree is split into two sons until the cardinality of the node is 1.In practice, this subdivision is stopped if the cardinality of the node is less than equal to some threshold parameter n min ∼ 32.It can be shown that for uniformly distributed clusters, the depth of the cluster tree, in other words the number of levels in the tree, is log 2 (|I|).Other algorithms to construct a tree are geometric bisection [20] or the FMM cluster tree which is visualized in Figure 7.
Next, well separated cluster pairs, called admissible cluster pairs are identified.A cluster pair (τ, σ) is considered admissible if: where, the diameter of a cluster τ is the maximal distance between any pair of points in τ and the distance between two clusters τ and σ, is the minimum distance between any pair of points in τ × σ: x − y (31) A kernel is called asymptotically smooth, if there exist constants c as 1 , c as 2 and a real number g ≥ 0 such that for all multi-indices α ∈ N d 0 it holds that:

Leaf nodes Direct
No far away interactions at the first two levels Fast multipole method tree.
The condition (30) ensures that the kernel κ(• , • ) is asymptotically smooth over the domain D τ × D σ , where D σ is the convex hull of X σ .Asymptotically smooth kernel functions can be shown to admit degenerate approximations on the kernel functions, on pairs of domains satisfying the admissibility condition (30).The implication of the kernel being asymptotically smooth on admissible clusters is that the submatrix corresponding to the blocks τ × σ have exponentially decaying singular values and are well approximated by lowrank matrices.The rank of the resulting matrix is O(k d ) [42], where k is the number of terms retained in the Taylor series expansion.
The cluster tree can then be used to define a block cluster tree, by forming pairs of clusters recursively.Given a cluster tree and an admissibility condition (Eq.30), the block cluster tree can be constructed using Figure 8 (see for example, Fig. 9 and 10).By calling the algorithm with τ = σ = I, we create a block cluster tree with root I × I.The leaves of the block cluster tree form a partition of I × I.
For the leaves of the block-cluster tree, that are nodes which do not have any sons, if the clusters are not admissible then we store the matrix corresponding to this sub-blocks in a component-wise fashion.Else, if they are admissible, then the admissibility condition (30) guarantees that the subblocks will be of low-rank which can be computed in one of the methods described in the subsequent Section 2.4, and if τ × σ is not admissible and |τ| > n min and |σ| > n min then: the corresponding low-rank matrices will be stored in the outer-product form described in Section 2.1.

Low Rank Approximation
We have seen in previous sections that sub-blocks of the matrices corresponding to admissible clusters can be well approximated using low-rank matrices.However, to illustrate this, we have made the assumption that it is possible to compute the Taylor series expansion of the kernel.Computing the Taylor series (or for that matter, other analytical expressions such as multipole expansions) in 3D of kernels can be a tedious task.Moreover, every time one needs to use a new covariance kernel, it might be necessary to re-derive the Taylor series for that particular kernel.Therefore, in general, we would like an approach that is, in some sense, kernel independent.By this, we mean that as long as the kernel satisfies the admissibility condition (30), the method should only require as input a numerical evaluation of the kernel κ(x, y) and a given pair of points (x, y).Moreoever, the number of terms required for an accurate representation up to an error of ε, is O(k d ), where k = O(1/ε).Another approach is to use tensor-product interpolation, which constructs a low-rank approximation using tensor-products of interpolating polynomials such as Legendre or Chebyshev polynomials [20,36].This approach has the advantage that it is quite general since it only uses kernel evaluations but like the Taylor series it also requires O(k d ) evaluations.We now discuss a way to compute low-rank representations using adaptive cross approximation.

Adaptive Cross Approximation
The idea behind the cross approximation is based on the result described in [42], which states that supposing a matrix A is well approximated by a low-rank matrix, by a clever choice of k columns C ∈ R m×k and k rows R of the matrix A, we can approximate A of the form: It relies on a result from [47], which states that if there is a sufficiently good low rank approximation to a matrix, Hierarchical matrix arising out of a one-dimensional problem at different levels in the tree.The above figures were generated using HFIGS [46].
Full rank matrix sub-blocks Low rank matrix sub-blocks Hierarchical matrix arising out of a two dimensional problem at different levels in the tree.The above figures were generated using HFIGS [46].
then there exists a cross-approximation with almost the same approximation quality.Figure 11 describes a simple heuris-tic to compute such a cross approximation that is based on successive approximations by rank-1 matrices (visualized in Initialize: R 0 = A, S = 0 for all k = 0, 1, 2, . . .do if γ k+1 0 then: Compute column u k+1 := γ k+1 R k e j k+1 and row v k+1 := R T k e i k+1 New residue and approximation: Terminate algorithm with exact rank k − 1 end if end for.

Figure 11
Cross approximation using full pivoting.Fig. 12).It has the property that if the matrix A ∈ R m×n has an exact rank k < min{m, n}, this algorithm will terminate in k steps and defining: (34) we have that S k = A in exact arithmetic.Furthermore, it exactly reproduces the k pivot rows and columns of A. Of course, the principal disadvantage of this algorithm is that, to generate a rank-k approximation, it requires O(kmn) steps, which is not feasible for large matrices.The bottleneck arises from calculating the pivot indices (i * k , j * k ) which requires generating all the entries of the matrix A.
Several heuristic strategies have been proposed to reduce the complexity of the fully pivoting cross approximation algorithm.In particular, one such algorithm is called Partially Pivoted Adaptive Cross Approximation algorithm, which maximizes |A i j | for only one of the pairs of indices and also avoids the update of A. It has a complexity O(k 2 (m + n)) and is very easy to implement.Figure A.1 in the appendix, lists a practical version of the algorithm, which includes a termination criteria based on an heuristic approximation to the relative approximation in the Frobenius norm.The proofs of convergence of this algorithm can be found in [48], and they rely on approximation bounds of Lagrange interpolation and a geometric assumptions on the distribution of points, which may not be very practical.[20] lists some contrived counterexamples that show that this algorithm can produce bad pivots.To fix this issue, several other variants have been proposed such as Improved ACA (ACA+) and Hybrid Cross Approximation (HCA).

Further Compressing Low-Rank Representations
From Theorem 2.1, clearly the best rank k approximation to a matrix is the SVD of the matrix truncated to retain only the k largest singular values and their corresponding singular vectors.However, computing the SVD of a matrix is expensive.Indeed, for a m × n matrix A, with m > n, the cost of SVD would be O(mn 2 ).Clearly, this is an unacceptably high cost, considering that we intend to design algorithms that work in almost-linear complexity.It is for this reason that we use one of the methods described above -Adaptive Cross Approximation (ACA) or polynomial interpolation which can be computed in linear complexity.However, from Theorem 2.1, these low-rank representations are not optimal and there is further scope for compression.
The compression is performed as follows: given the lowrank outer product form for the matrix A of rank r, we compute the SVD of the matrix and then retain only k(ε) (as defined in ( 16)) singular values for a given tolerance ε.We now show how to do this in an efficient manner.Given A = UV T , U ∈ R m×r , V ∈ R n×r , the truncated matrix Ã can be computed as follows: -compute the truncated QR factorizations, and now, Ã = Ũ ṼT .This truncation can be computed in O(r 2 (m + n) + r 3 ).

Matrix-Vector Product
We are then in a position to define the set of hierarchical matrices with block-wise rank k.Definition.Let T I be a cluster tree and T I×I be the block cluster tree corresponding to the index set I. We define the set of hierarchical matrices as: Note that H(T I×I , k) is not a linear space.For the subblock A τ×σ ∈ R τ×σ which is the restriction of the matrix A to the sub-block τ × σ, and supposing that this cluster pair is admissible, i.e., it satisfies condition (30), it is possible to generate a low-rank approximation for this subblock when the kernel is asymptotically smooth (for definition see [19]).The low-rank representation for this subblock is computed using the partially pivoted adaptive cross approximation algorithm [19,24].
To compute the matrix-vector product involving the H−matrix A, with a vector x, we use Figure 13.For a rigorous derivation of the computational complexity of storage and matrix-vector product, the reader is referred to the following sources [20,22].Since the hierarchical matrix framework is an algebraic approach, it is possible to define a matrix arithmetics (addition, multiplication and inversion) for the hierarchical matrices.These algorithms can be found in the aforementioned references.
x x i * j * Illustration of the steps involved in adaptive cross approximation.

APPLICATIONS
We now describe how to use the hierarchical matrix approach for the solution of the system of equations that result from the geostatistical approach.We then highlight the performance of the hierarchical matrices in terms of setup and computation time.The performance of our algorithm is demonstrated with an application to realtime monitoring at a CO 2 sequestration site.

Problem Specification
In this section, we describe the specific problem, which we solve by large-scale linear inversion.The problem deals with large scale crosswell tomography to monitor the CO 2 plume in CO 2 sequestration sites.The problem configuration is shown in Figure 14.The motivation for this capture geometry stems from [49].We will compare the results we obtain using our fast large scale linear inversion algorithm with the results in [49], which were obtained by running the forward model using TOUGH2, a multiphase-flow reservoir model [50] and patchy rock physical model [51].
A cross-well tomography is setup with n s sources along the vertical line AB and n r receivers along the vertical line CD.The sources emit a seismic wave and the receivers measure the time taken by the seismic wave from a source to hit the receiver.This is done for each source-receiver pair.
Our goal is to image the slowness of the medium inside the rectangular domain.Slowness is defined as the reciprocal of the speed of the seismic wave in the medium.In the context of CO 2 sequestration for example, the seismic wave travels considerably slower through CO 2 saturated rock [49,52,53] as opposed to the rest of the medium.Hence by measuring the slowness in the medium, we can estimate the CO 2 concentration at any point in the domain (with some uncertainty) and thereby the location of the CO 2 plume.
The above configuration is typical for most of the continuous crosswell seismic monitoring sites.We go about modeling the problem as follows.As a first order approximation, the seismic wave is modeled as traveling along a straight line from the sources to receivers with no reflections/refractions.The time taken by the seismic wave to travel from the source to the target is measured.Each source-receiver pair gives us a measurement and hence there are a total of n = n s × n r measurements.To obtain the slowness in the domain ABDC, the domain is discretized into m grid points (i.e., an m x × m y grid such that m = m x m y ) and within each cell the slowness is assumed to be constant.Let t i j denote the time taken by the seismic wave to travel from source i to receiver j.Let s k be the slowness of the kth cell.For every source-receiver (i, j) pair, we then have that: where l k i j denotes the length traveled by the ray from source i to the source j through the kth cell and v i j denotes the measurement error i ∈ {1, 2, . . ., n s }, j ∈ {1, 2, . . ., n r }, k ∈ {1, 2, . . ., m}.Equation ( 36) can be written in a matrixvector form: Typically, the measurement error is modeled as a Gaussian white-noise.For most problems of practical interest, the number of measurements n is much smaller than the number of unknowns m, i.e., n m.This under-determined linear system constitutes our inverse problem.Conventional technique to solve large scale linear inversion problem for the cross-well tomography problem where H is sparse.
We will now analyze the matrix H to figure out the structure of the linear system.Each row of the matrix H corresponds to a source-receiver pair.Consider the source i and receiver j where i ∈ {1, 2, . . ., n s } and j ∈ {1, 2, . . ., n r }.This corresponds to the ((i − 1)n r + j)th row of the matrix H. Since we have modeled the wave traveling from a source to a receiver as a straight line without reflections/refractions, the non-zero entries along each row correspond to the cells hit by the ray from a source to the receiver.Since the wave from the source to receiver travels along a straight line, only the cells that lie on this straight line contribute to the non-zero entries.Hence, every row of H has only O( √ m) non-zeros.Hence, the matrix H is sparse since it has only O(n √ m) entries as opposed to O(nm) entries.We would hence like to take advantage of this sparsity to accelerate our computations.This underdetermined system is solved using the Bayesian geostatistical approach discussed in Section 3.3.Since, we would also like to characterize the finescale features, m can be much larger than n.The convention algorithm to solve this system is described in Figure 15.A pictorial representation of the capture geometry is shown in Figure 16.Section 3.3 discusses the fast algorithm to solve the above problem using Bayesian geostatistical approach.

H-matrix Approximation of Covariance Matrices
We verify the asymptotic complexity of the H-matrix approach with regard to setup costs and computation of matrix-vector products.We focus on the following test case: from the domain [−1, 1] 2 , we randomly sample m points which are distributed uniform randomly and for these set of points, we build the corresponding covariance matrix and compute matrix-vector products.The kernel is chosen to be a member of the Matérn covariance family, the exponential covariance kernel κ(x, y) = exp(− x − y /l), where l is the integral length scale and is chosen to be 1.It can be shown that this kernel satisfies the requirements of asymptotic smoothness [33] as stated in Section 2. We also make the following choices for the parameters: we pick η = 0.75 and n min = 32.We compare the time required to perform matrix-vector products for three different tolerances ε = 10 −3 , 10 −6 , 10 −9 .The block wise rank k is computed so that each sub-block of the matrix is approximated relatively in Frobenius norm to ε.We observe a log-linear scaling in the computation of matrix-vector products.We observe similar performance for other covariance kernels as well.The accuracy of the matrix-vector products involving an appropriate sized vector x can be established using the following result: Various schemes to estimate the Frobenius norm of the matrix are described in [54].For the rest of the examples, we use the exponential covariance kernel, with appropriately chosen length scales depending on the problem.We also The crosswell geometry.pick ε = 10 −9 and η = 0.75.The choice of the approximation ε depends on the application at hand.Typically, a choice of ε = 10 −9 is a good choice and ensures that the matrix-vector product are sufficiently accurate (see Fig. 17).

Algorithm
In this section, we discuss in detail the fast algorithm for the large scale linear inverse problem using Bayesian geostatistical approach and how this algorithm is applied to the cross-well tomography problem.We would like to point out that the algorithm is far more general and can be used for other linear inverse problems and not just the cross-well tomography case.
The main bottleneck in the entire computation is constructing the matrix product Qz, where Q ∈ R m×m is the covariance matrix.As mentioned earlier, the covariance matrix Q has a hierarchical matrix structure since Q i, j = κ(x i , x j ), where x i and x j are the grid points, and the interaction between the well-separated clusters can be efficiently approximated by low-rank interactions.We also take into account the sparsity of the measurement operator H, to accelerate matrix-vector product computations.
The system of Equations ( 7) is solved iteratively using a Krylov subspace method such as restarted GMRES (Generalized Minimum RESidual).The key advantage of using Krylov subspace methods, is that they never require the explicit entries of the matrix but only rely on matrix-vector products involving the matrix or its transpose.The matrix A is never explicitly constructed, where: This is so since Q is never constructed explicitly.At each step in the iterative technique, we need the matrix vector product Ax.If we set: The bottleneck in computing Ax is in computing HQH T x 1 .
Hence, the goal is to accelerate this computation using the hierarchical matrix approach.The algorithm to accelerate the computation of HQH T x 1 is described in Figure 19.
If the iterative solver converges in κ iterations, then the total computational cost of the proposed fast algorithm is O κm log m + κn √ m .While implementing the iterative solver, we observed that in certain instances, such as when the number of measurements are high, the number of iterations required for convergence of our system were quite large and thus, we devise a pre-conditioner to reduce the number of iterations.To deal with this, we proposed a pre-conditioner that serves to cluster the eigenvalues of the preconditioned operator near 1 thus resulting in fewer iterations.We will not describe this algorithm here.For the applications that we will consider, the number of measurements is modestly high and the algorithm converges in fewer iterations than the number of measurements, even without the use of a pre-conditioner.
An alternate strategy would be to compute QH T using fast matrix-vector products using the H-matrix approach.Subsequently, we form the dense matrix Ψ and solve the system of Equations ( 7) using a direct solver such as Gaussian elimination.This strategy has an advantage when the entries of the matrix H are available explicitly, and the number of measurements are small.This approach has been discussed in [17].
The algorithm is implemented in C++ with the aid of PETSc [55][56][57], which has a collection of data structures and linear solvers that are extremely useful for large scale implementations in scientific computing.

Numerical Benchmark
In Figure 18, we show the reconstruction of the CO 2 concentration, for which the measurements are taken 5 days after the injection.The number of measurements were 288 and the number of unknowns varied up to a million.The algorithm converged in fewer than 200 iterations in all cases, for the residual to satisfy r k 2 / r 0 2 ≤ 10 −6 .
For a = 20 m and b = 1 665 m, we plot the image of the estimated slowness.This is shown in Figure 20.There seems to be a good agreement with the true data shown in Figure 20.The uncertainty in the image reconstructed is shown in the right panel of Figure 20.

CONCLUSIONS AND DISCUSSIONS
We have described an efficient numerical method to compute the best estimate for a linear under determined set of equations using the Stochastic Bayesian approach for geostatistical applications.We emphasize here, the generality of our formulation, in the sense that it is capable of handling a wide variety of generalized covariance functions κ(•, •) with a minimum amount of recoding and the fact that this is applicable to situations in which the unknowns can be distributed on an irregularly spaced grid.One important issue that has not been discussed is the quantification of uncertainty associated with the solution of the inverse problem.A commonly used strategy [14,59] to quantify the uncertainty associated with the estimate of the solution, is via computing conditional realizations.This method avoids the computation of the posterior covariance matrix, which is expensive for large-scale problems [18].An extension to solve quasi-linear problems and for solving 3D problems is underway.

Figure 19 y
Figure 18 a) Comparison of the time taken by the fast algorithm vs the conventional direct algorithm; b) comparison of the storage cost.
Figure 20 a) True slowness using TOUGH2/ECO2N [50, 58] simulations, 120 hours after injection; b) reconstructed slowness using our proposed algorithm for the optimal choice of a and b; c) uncertainty in the estimated solution.All these plots are for a 237 × 217 grid.Each unit of slowness corresponds to 10 4 s/m.