*Oil & Gas Science and Technology – Rev. IFP Energies nouvelles*(2016)

**71**, 66

##
*S*-Step BiCGStab Algorithms for Geoscience Dynamic Simulations

### Méthodes *s*‐step BiCGStab appliquées en Géosciences

^{1}
IFP Energies nouvelles, 1-4 avenue de Bois-Préau, 92852
Rueil-Malmaison Cedex – France

^{2}
INRIA Paris, Alpines, 2 Rue Simone IFF, 75012
Paris – France

^{3}
UPMC - Univ Paris 6, CNRS UMR 7598, Laboratoire Jacques-Louis Lions, 4 Place Jussieu, 75005
Paris – France

e-mail: ani.anciaux-sedrakian@ifpen.fr – laura.grigori@inria.fr – sm101@aub.edu.lb – soleiman.yousef@ifpen.fr

^{*} Corresponding author

Received:
15
December
2015

Accepted:
11
October
2016

In basin and reservoir simulations, the most expensive and time consuming phase is solving systems of linear equations using Krylov subspace methods such as BiCGStab. For this reason, we explore the possibility of using communication avoiding Krylov subspace methods (*s*-step BiCGStab), that speedup of the convergence time on modern-day architectures, by restructuring the algorithms to reduce communication. We introduce some variants of *s*-step BiCGStab with better numerical stability for the targeted systems.

Résumé

Dans les simulateurs d’écoulement en milieu poreux, comme les simulateurs de réservoir et de bassin, la résolution de système linéaire constitue l’étape la plus consommatrice en temps de calcul et peut même représenter jusqu’à 80 % du temps de la simulation. Ceci montre que la performance de ces simulateurs dépend fortement de l’efficacité des solveurs linéaires. En même temps, les machines parallèles modernes disposent d’un grand nombre de processeurs et d’unités de calcul massivement parallèle. Dans cet article, nous proposons de nouveaux algorithmes BiCGStab, basés sur l’algorithme à moindre communication nommé *s*-step, permettant d’éviter un certain nombre de communication afin d’exploiter pleinement les architectures hautement parallèles.

*© A. Anciaux-Sedrakian et al., published by IFP Energies nouvelles, 2016*

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Introduction

Many scientific problems require the solution of systems of linear equations of the form *Ax* = *b*, where the input matrix *A* is very large and sparse. These systems arise mainly from the discretization of Partial Differential Equations (PDE), and are usually solved using Krylov subspace methods, such as Generalized Minimal RESidual (GMRES) [1], Conjugate Gradient (CG) [2] and Bi-Conjugated Gradient Stabilized (BiCGStab) [3].

In the case of basin modeling or reservoir simulations with highly heterogeneous data and complex geometries, complex non-linear systems of PDE are solved. These PDE are discretized with a cell-centered finite volume scheme in space, leading to a non-linear system which is solved with an iterative Newton solver. At each Newton step, the system is linearized. Then, the generated large, sparse and unstructured linear system is solved using preconditioned GMRES, BiCGStab, CG, Orthomin or other preconditioned iterative methods. Some of the most commonly used preconditioners are ILU(*k*), ILUT, AMG and CPR-AMG. This resolution phase constitutes the most expensive part of the simulation. Thus we focus on linear solvers, since their efficiency is a key point for the simulator’s performance.

Furthermore, modern parallel computing resources are based on complex hardware architecture. They are composed of several multi-core processors and massively parallel processing units such as many-cores or General-Purpose GPU (GPGPU) cards. Most of the current algorithms are not able to fully exploit the highly parallel architectures. In fact, a severe degradation of performance is detected when the number of processing units is increased. This is due to the difference between the required time to perform floating point operations (flops) by processing units and the time to communicate the obtained results, where flops have become much cheaper than data communication.

Thus, recent research has focused on reformulating dense and sparse linear algebra algorithms with the aim of reducing and avoiding communication. These methods are referred to as communication avoiding methods, whereby communication refers to data movement between different processing units in parallel, and different levels of memory hierarchy. In the case of Krylov subspace methods, the introduced communication avoiding Krylov subspace methods [4-8] are based on *s*-step methods [9-11]. The goal is to restructure the algorithms to perform *s* iterations at a time by using kernels that avoid or reduce communication, such as the matrix powers kernel [12], Tall and Skinny QR (TSQR) [13], and Block Gram Schmidt (BGS) methods.

Our aim is to reduce the overall cost of the linear solver resolution phase in geoscience simulations, specifically basin and reservoir simulations, using a parallel implementation of BiCGStab that avoids communication on multi-core hardware (CA-BiCGStab) [5] and has a similar convergence behavior as the classical BiCGStab method. Communication Avoiding BiCGStab (CA-BiCGStab), which was introduced in [5], is a reformulation of BiCGStab into *s*-step BiCGStab that avoids communication. Thus, in this paper we study the convergence behavior of a sequential version of the unpreconditioned *s*-step BiCGStab [5], on matrices obtained from reservoir simulations, with different *s* values. The obtained results show that, for most of the tested matrices, *s*-step BiCGStab requires more iterations to converge than BiCGStab. Thus, we design new variants of *s*-step BiCGStab, that have the same convergence rate as BiCGStab for *s* values between 2 and 6, and reduce communication similarly to *s*-step BiCGStab.

In Section 1, we introduce BiCGStab, its reformulation to *s*-step BiCGStab [5], and we discuss the performance of *s*-step BiCGStab in geoscience applications, specifically reservoir simulations. Then, in Section 2, we introduce the new *s*-step BiCGStab variants that we call orthonormalized *s*-step BiCGStab, split orthonormalized *s*-step BiCGStab, and modified split orthonormalized *s*-step BiCGStab. In Section 3, we present the convergence results of the newly introduced *s*-step BiCGStab variants and compare them to that of *s*-step BiCGStab. Finally, we conclude.

##
1 From Bicgstab to *S*-Step Bicgstab

In this section we briefly introduce BiCGStab (Sect. 1.1) and *s*-step BiCGStab (Sect. 1.2). We show the relation between both methods and their convergence in reservoir simulations (Sect. 1.3).

### 1.1 BiCGStab

The Bi-Conjugate Gradient Stabilized method (BiCGStab), introduced by van der Vorst in 1992 [3], is an iterative Krylov subspace method that solves the general systems *Ax* = *b*. It is a variant of the Bi-Conjugate Gradient (BiCG) method that aims at smoothing BiCG’s erratic convergence. At each iteration *m* ≥ 0, *r*_{m+1} = *P*_{m+1}(*A*)*r*_{0} is replaced by *r*_{m+1} = *Q*_{m+1}(*A*)*P*_{m+1}(*A*)*r*_{0} where and are polynomials of degree *m* + 1. *Q*_{m+1}(*z*) is chosen to bewhere *ω*_{m} minimizes the norm of *r*_{m+1}.

BiCGStab, being a variant of BiCG, has a similar form. But the recurrence relations of *x*_{m+1}, *r*_{m+1}, *p*_{m+1}, *α*_{m} and *β*_{m} are different.(1)
(2)
(3)where *r*_{0} = *b* − *Ax*_{0}, and *p*_{0} = *r*_{0}.

The scalars *α*_{m} and *β*_{m} are defined as follows:(4)where is chosen such that as shown in Algorithm 1. In general is set equal to *r*_{0}. As for *ω*_{m}, it is defined by minimizing the norm of the residual *r*_{m+1}, *i.e*. , where(5)

In addition, we have that for *m* > 0(6)and more generally for *m* ≥ 0 and *j* > 0(7)

**Algorithm 1**: BiCGStab

**Input:***A*, *b*, *x*_{0}, *m*_{max}: the maximum allowed iterations

**Output:***x*_{m}: the *m*th approximate solution satisfying the stopping criteria

1: Let *r*_{0} = *b* − *Ax*_{0}, *p*_{0} = *r*_{0}, *ρ*_{0} = 〈 *r*_{0}, *r*_{0} 〉, and *m* = 0

2: Choose such that .

3: **While** ( and *m* < *m*_{max}) **Do**

4:

5: *s* = *r*_{m} − *α*_{m}*Ap*_{m}

6: *t* = *As*

7: *ω*_{m} = Ǡ〈 *t*, *s* 〉 / 〈 *t*, *t* 〉

8: *x*_{m+1} = *x*_{m} + *α*_{m}*p*_{m} + *ω*_{m}*s*

9: *r*_{m+1} = (*I* − *ω*_{m}*A*)*s*

10:

11: *β*_{m} = (*δ*_{m+1}/*δ*_{m})(*α*_{m}/*ω*_{m})

12: *p*_{m+1} = *r*_{m+1} + *β*_{m}(*I* − *ω*_{m}*A*)*p*_{m}

13: *ρ*_{m+1} = 〈 *r*_{m+1}, *r*_{m+1} 〉, *m* = *m* + 1

14: **end for**

At each iteration of Algorithm 1, two sparse matrix-vector multiplications, six saxpy’s, and five dot products are computed. Given that each processor has the scalars and its corresponding part of the vectors, then the saxpy’s can be parallelized without communication. However, this is not the case for the sparse matrix-vector multiplications and the dot products, which require communication to obtain the desired results. Such operations cause a severe performance degradation, especially when using modern computing resources.

###
1.2
*S*-Step BiCGStab

To reduce the communication in parallel and sequential implementations of BiCGStab, Carson *et al*. [5] introduced the *s*-step version of BiCGStab. The reformulation is based on the computation of *s* BiCGStab iterations at once, and on the fact that for *m* ≥ 0 and 1 ≤ *j* ≤ *s*
(8)since for any *z* ≠ 0.

The goal is to perform more flops per communication, by computing 2*s* matrix-vector products at the beginning of each iteration of the *s*-step BiCGStab. This would reduce the communication cost, specifically the number of messages, by *O*(*s*) times in parallel [5]. However, this is not possible using the same formulation as BiCGStab. Therefore, at the beginning of each *s*-step iteration, one computes *P*_{2s+1} and *R*_{2s}, the Krylov matrices corresponding to the and bases respectively, where *m* = 0, *s*, 2*s*, 3*s*,…. Then, by Equation (8), *p*_{m+j}, *r*_{m+j}, and *x*_{m+j} − *x*_{m} can be defined as the product of the basis vectors and a span vector, for *j* = 0, …, *s*,(9)
(10)
(11)where [*P*_{2s+1}, *R*_{2s}] is an *n* × (4*s* +1) matrix containing the basis vectors of and *K*_{2s}(*A*, *r*_{m}), and *a*_{j}, *c*_{j,} and *e*_{j} are span vectors of size 4*s* + 1. Note that *e*_{0} = 0. As for *a*_{0} and *c*_{0}, their definition depends on the type of computed basis. One can compute a basis defined by a recurrence relation with three or less terms, such as a monomial, scaled monomial, Newton, or Chebyshev basis. Then, we have that(12)
(13)
(14)where *T*_{2s+1} and *T*_{2s} are change of basis matrices of size (2*s* + 1) × (2*s*) and (2*s*) × (2*s* − 1) respectively, andis a (4*s* + 1) × (4*s* + 1) matrix.

The definition of *T*_{2s+1}, *T*_{2s} and eventually *T*'depends on the chosen type of basisFor example, in the monomial basis case, where , , , and for *i* > 0, the matrices *T*_{2s+1} and *T*_{2s} are all zeros except the lower diagonal which is ones, *i.e*. *T*_{2s+1}(*i* + 1, *i*) = 1 for *i* = 1, …, 2*s*. In the case of the scaled monomial basis, where and , the matrices are defined as for *i* = 1, …, 2*s*, and for *i* = 1, …, 2*s* − 1, and zero elsewhere.

The reformulation of BiCGStab into *s*-step BiCGStab starts by replacing the definitions (9)-(11) in Equations (1)-(3), and taking into consideration that for *j* = 0 to *s* − 1.(15)
(16)
(17)
(18)

Then we get the following,(19)
(20)
(21)with *e*_{0} = 0, *a*_{0} = [1, 0_{4s}]^{T}, and *c*_{0} = [0_{2s+1}, 1, 0_{2s−1}] in the case of monomial and Newton basis. Then, the span vectors, *a*_{j}, *c*_{j}, and *e*_{j} are updated for *j* = 1, …, *s*, rather than *p*_{m+j}, *r*_{m+j}, and *x*_{m+j} which are of size *n* ≫ 4*s* + 1.

As for *α*_{m+j}, *δ*_{m+j+1}, and *ω*_{m+j}, it is sufficient to replace *r*_{m+j} and *p*_{m+j} by definitions (10) and (9) to obtain
where *G* = [*P*_{2s+1}, *R*_{2s}]^{T}[*P*_{2s+1}, *R*_{2s}] is a Gram-like matrix, . Then, .

**Algorithm 2:***s*-step BiCGStab

**Input:***A*, *b*, *x*_{0}, *m*_{max}, *s*, *Type of Basis*

**Output:***x*_{m}: the *m*th approximate solution satisfying the stopping criteria

1: Let *r*_{0} = *b* − *Ax*_{0}, *p*_{0} = *r*_{0}, *ρ*_{0} = 〈 *r*_{0}, *r*_{0} 〉, and *k* = 0

2: Choose such that .

3: **While**
**Do**

4: Compute *P*_{2s+1} and *R*_{2s} depending on Type of Basis, and output the diagonals of *T*′

5:

6: *G* = [*P*_{2s+1}, *R*_{2s}]^{T}[*P*_{2s+1}, *R*_{2s}] and

7: Initialize *a*_{0}, *e*_{0}, *c*_{0} and set *m* = *k* * *s*

8: **for** (*j* = 0 to *s* − 1) **Do**

9: *t*_{a} = *T*′*a*_{j}

10: *α*_{m+j} = *δ*_{m+j} / 〈 *g*, *t*_{a} 〉

11: *d* = *c*_{j} − *α*_{m+j}*t*_{a}

12: *t*_{d} = *T*′*d*

13: *g*_{d} = *Gd*

14: *g*_{t} = *Gt*_{d}

15: *ω*_{m+j} = 〈 *t*_{d}, *g*_{d} 〉 / 〈 *t*_{d}, *g*_{t} 〉

16: *e*_{j+1} = *e*_{j} + *α*_{m+j}*a*_{j} + *ω*_{m+j}*d*

17: *c*_{j+1} = *c*_{j} − *ω*_{m+j}*t*_{d} − *α*_{m+j}*t*_{a}

18: *δ*_{m+j+1} = 〈 *g*, *c*_{j+1} 〉

19: *β*_{m+j} = (*δ*_{m+j+1}/*δ*_{m+j})(*α*_{m+j}/*ω*_{m+j})

20: *a*_{j+1} = *c*_{j+1} + *β*_{m+j}*a*_{j} − *β*_{m+j}*ω*_{m+j}*t*_{a}

21: **end for**

22: *p*_{m+s} = [*P*_{2s+1}, *R*_{2s}]*a*_{s}, *r*_{m+s} = [*P*_{2s+1}, *R*_{2s}]*z*_{s}

23: *x*_{m+s} = *x*_{m} + [*P*_{2s+1}, *R*_{2s}]*e*_{s}

24: *ρ*_{m+s} = 〈 *r*_{m+s}, *r*_{m+s} 〉, *k* = *k* + 1

25: **end While**

Algorithm 2 is a reformulation of BiCGStab that reduces communication where the matrix-vector multiplications are grouped at the beginning of the outer iteration, and the Gram-like matrix *G* is computed once per outer iteration. Then, in the inner iterations, the vector operations of size *n* are replaced by vector operations of size 4*s* + 1, where 4*s* + 1 ≪ *n*. However, this reformulation alone is not sufficient to reduce communication.

For example, in the sequential case the basis computation should be done using the matrix powers kernel [12], where is computed by parts that fit into cache memory, for *j* = 0, …, 2*s* − 1. This reduces communication in the memory hierarchy of the processor and increases cache hits. In the parallel case, each processor fetches, at the beginning, the needed data from neighboring processors to compute its assigned part of the 2*s* vectors without any communication, for *j* = c 0, …, 2*s* − 1. Similarly, we can compute the 2*s* − 1 vectors for *j* = 0, …, 2*s* − 2. Note that it is possible to compute the two bases simultaneously using a block version of the matrix powers kernel that computes a block of vectors without communication.

###
1.3
*S*-Step BiCGStab for Geoscience Applications

In geoscience applications, specifically in reservoir simulations, at each time step a new linear system of the form *Ax* = *b* has to be solved. The difficulty and the ill-conditioning of the systems may vary throughout the simulation. However, in most cases, an iterative method and a preconditioner are chosen at the beginning of the simulation and are used for solving all the obtained linear systems. Since the obtained systems are not symmetric, Krylov subspace methods such as BiCGStab are used. Our aim is to implement a numerically stable version of *s*-step BiCGStab that has a similar convergence rate as BiCGStab for the reservoir simulations systems. The stability of *s*-step BiCGStab is related to the chosen *s* value and to the type of the basis.

Thus, we study the convergence of the *s*-step BiCGStab (Algorithm 2) method for different *s* values, using the monomial and Newton basis [7, 14, 15] and compare it to BiCGStab’s convergence. We do not consider the scaled versions of the monomial and Newton basis, since this requires the computation of the norm of each vector at a time, which annihilates the possibility of avoiding communication in the matrix powers kernel. The test matrices, described in Section 3.1, are obtained from different reservoir simulations. A sample of the obtained results is described in Section 3.2 (Tab. 3). For the well-conditioned matrices, the *s*-step BiCGStab with the monomial basis converges in fewer *s*-step iterations as *s* increases from 2 to 6. But for the ill-conditioned matrices, the convergence of the *s*-step BiCGStab with monomial basis is chaotic with respect to *s*. Note that, we focus on the consistency of the convergence behavior as *s* increases for the following reasons. First, as *s* increases, the communication cost per *s*-step BiCGStab iteration decreases, since more flops are performed per communication. Moreover, if the convergence of *s*-step BiCGStab improves as *s* increases, then the overall communication cost is decreased which should lead to a speedup in parallel implementations. Second, although we test the convergence of *s*-step BiCGStab on matrices obtained from reservoir simulations, our goal is the speedup obtained in the full simulation. As mentioned earlier, the obtained linear systems could be well-conditioned or ill-conditioned. However, at the beginning of the simulation a single *s* value is chosen without having any information on the systems to be solved. Thus, we would like to implement an *s*-step BiCGStab version for which the number of iterations needed for convergence decreases as *s* increases to some upper limit.

An alternative to the monomial basis is the Newton basis, which is known to be more stable [7] in the case of GMRES. However, in the case of *s*-step BiCGStab, computing a Newton basis is expensive. First, the 2*s* largest eigenvalues have to be computed once per matrix using some library such as ARPACK [16]. In general, the computed eigenvalues could be almost equal, which does not improve the stability of the basis. Thus, the eigenvalues are reordered in the Leja ordering as discussed in [7]. For the well-conditioned matrices obtained in geoscience simulations, using the Newton basis did not improve the convergence of *s*-step BiCGStab. Moreover, for the ill-conditioned matrices, finding the desired number of eigenvalues is time consuming. In addition, in geoscience simulations, we seek a relatively “cheap” and stable basis, since several linear systems are solved during the simulation (at least one per time step). For all these reasons, we will use the monomial basis and improve its numerical stability, as discussed in the next section.

##
2 Orthonormalized *S*-Step Bicgstab

The *s*-step BiCGStab method with the monomial basis, has an irregular convergence with respect to the *s* values, and converges slower than BiCGStab for some of the tested systems. This irregular and slow convergence might be due to the fact that the estimated residual used for stopping criterion is not equal to the exact residual [4]. However, in our case, the slow convergence is caused by the basis vectors which become numerically linearly dependent. One way to improve the stability of the basis is by orthonormalizing it. We propose a new variant of *s*-step BiCGStab that orthonormalizes the basis vectors. This new version is referred to as orthonormalized *s*-step BiCGStab (Algorithm 3).

There are several ways of constructing the basis and orthonormalizing it. However, we derive the algorithm irrespective of the method used. We replace the 4*s* + 1 basis vectors [*P*_{2s+1}, *R*_{2s}] by an orthonormal basis *Q*_{4s+1}. Then, the vectors *p*_{m+j}, *r*_{m+j}, and *x*_{m+j} can be defined as follows,(22)
(23)
(24)

The *n* × (4*s* + 1) orthonormal matrix *Q*_{4s+1} should satisfy *AQ*_{4s+1}*v* = *Q*_{4s+1}*H*_{4s+1}*v*, where and *H*_{4s+1} is a (4*s* + 1) × (4*s* + 1) upper Hessenberg matrix. Then, for *j* = 0 to *s* − 1 we get(25)
(26)
(27)

By replacing the definitions (22)-(27) in Equations (1)-(3), we get thatwith *e*_{0} = 0. As for *a*_{0} and *c*_{0}, their definitions depend on the orthonormalization technique used. We will discuss this in Section 2.1.

As for *α*_{m+j} and *δ*_{m+j+1}, it is sufficient to replace *r*_{m+j} and *p*_{m+j} by definitions (22) and (23) to obtainwhere . Similarly for *ω*_{m+j}, we get(28)since . Finally, .

Algorithm 3 describes the orthonormalized *s*-step BiCGStab method except for the orthonormal basis construction phase, which we discuss in Section 2.1.

**Algorithm 3:** Orthonormalized *s*-step BiCGStab

**Input:***A*, *b*, *x*_{0}, *m*_{max}, *s*, Type of Basis

**Output:***x*_{m}: the *m*th approximate solution satisfying the stopping criteria

1: Let *r*_{0} = *b* − *Ax*_{0}, *p*_{0} = *r*_{0}, *ρ*_{0} = 〈 *r*_{0}, *r*_{0} 〉, and *k* = 0

2: Choose such that .

3: **While**
**Do**

4: Compute the orthonormal basis *Q*_{4s+1} and

5: the upper Hessenberg matrix *H*_{4s+1}

6: Compute

7: Initialize *a*_{0}, *e*_{0}, *c*_{0} and set *m* = *k* * *s*

8: **for** (*j* = 0 to *s* − 1) **Do**

9: *h*_{a} = *H*_{4s+1}*a*_{j}

10: *α*_{m+j} = *δ*_{m+j} / 〈 *g*, *h*_{a} 〉

11: *d* = *c*_{j} − *α*_{m+j}*h*_{a}

12: *h*_{d} = *H*_{4s+1}*d*

13: *ω*_{m+j} = 〈 *d*, *h*_{d} 〉 / 〈 *h*_{d}, *h*_{d} 〉

14: *e*_{j+1} = *e*_{j} + *α*_{m+j}*a*_{j} + *ω*_{m+j}*d*

15: *c*_{j+1} = *c*_{j} − *ω*_{m+j}*h*_{d} − *α*_{m+j}*h*_{a}

16: *δ*_{m+j+1} = 〈 *g*, *c*_{j+1} 〉

17: *β*_{m+j} = (*δ*_{m+j+1}/*δ*_{m+j})(*α*_{m+j}/*ω*_{m+j})

18: *a*_{j+1} = *c*_{j+1} + *β*_{m+j}*a*_{j} − *β*_{m+j}*ω*_{m+j}*h*_{a}

19: **end for**

20: *p*_{m+s} = *Q*_{4s+1}*a*_{s}, *r*_{m+s} = *Q*_{4s+1}*z*_{s}, *x*_{m+s} = *x*_{m} + *Q*_{4s+1}*e*_{s}

*x*_{m+s} = *x*_{m} + *Q*_{4s+1}*e*_{s}

21: *ρ*_{m+s} = 〈 *r*_{m+s}, *r*_{m+s} 〉, *k* = *k* + 1

22: **end While**

### 2.1 Construction of the Orthonormal Basis

The simplest parallelizable way to compute the orthonormal basis *Q*_{4s+1} is to compute first [*P*_{2s+1}, *R*_{2s}] using the matrix powers kernel. Then, orthonormalize it using a QR algorithm, such as the Tall and Skinny QR (TSQR) algorithm [13] that requires log (*p*) messages in parallel, where *p* is the number of processors. In this case, and where *U*_{4s+1} is the (4*s* + 1) × (4*s* + 1) upper triangular matrix obtained from the QR factorization of [*P*_{2s+1}, *R*_{2s}]. In addition, where *T*′ is the change of basis matrix. By replacing [*P*_{2s+1}, *R*_{2s}] by *Q*_{4s+1}*U*_{4s+1}, obtained from the QR factorization, and by assuming that *U*_{4s+1} is invertible, we get:
*Q*_{4s+1} is an *n* × (4*s* + 1) orthonormal matrix, is a (4*s* + 1) × (4*s* + 1) upper Hessenberg matrix, andNote that the matrix *H*_{4s+1} is never constructed and multiplying *H*_{4s+1} by a vector is equivalent to solving an upper triangular system and multiplying *T*′ and *U*_{4s+1} by a vector.

However, there are two issues to take into consideration in the construction of the orthonormal basis. First, at iteration *k* ≥ 0, we compute two bases *P*_{2s+1} and *R*_{2s}, for two different subspaces and . There is no guarantee that the two bases are linearly independent with respect to each others. In other words, the 4*s* + 1 vectors obtained are not necessarily linearly independent. Moreover, the upper triangular matrix obtained from the orthonormalization of a linearly dependent set of vectors, is not invertible. Second, at iteration *k* = 0, we compute two bases of and subspaces, since *p*_{0} is initialized to *r*_{0}, as in the BiCGStab and *s*-step BiCGStab algorithms.

A solution to the first problem, is to perform a split orthonormalization, where *P*_{2s+1} = *Q*_{2s+1}*U*_{2s+1} and *R*_{2s} = *Q*_{2s}*U*_{2s} are orthonormalized separately. Then, and still satisfy the relationwhere . But *Q*_{4s+1} is not orthonormal, only *Q*_{2s+1} and *Q*_{2s} are orthonormal. Note that in the derivation of the orthonormalized *s*-step BiCGStab in Section 2, we only need that *Q*_{4s+1} is orthonormal for the definition of *ω*_{m+j}. Moreover, ω_{m+j} is obtained by minimizing the *L*2 norm of *r*_{m+j+1}. If instead we minimize the *B* norm of *r*_{m+j+1}, where *B* is an *n* × *n* matrix that satisfies , then *ω*_{m+j} would be defined as in Equation (28). We call this version split orthonormalized *s*-step BiCGStab (Algorithm 3).

**Algorithm 4:** Split Orthonormalized *s*-step BiCGStab

**Input:***A*, *b*, *x*_{0}, *m*_{max}, *s*, Type of Basis

**Output:***x*_{m}: the *m*th approximate solution satisfying the stopping criteria

1: Let *r*_{0} = *b* − *Ax*_{0}, *p*_{0} = *r*_{0}, *ρ*_{0} = 〈*r*_{0}, *r*_{0}〉, and *k* = 0

2: Choose such that .

3: **While**
**Do**

4: Compute *P*_{2s+1} and *R*_{2s} depending on Type of Basis, and output the diagonals of *T′*

5: Perform the QR factorization of *P*_{2s+1} = Q_{2s+1}*U*_{2s+1}

6: and *R*_{2s} = *Q*_{2s}*U*_{2s}.

7: Let *Q*_{4s+1} = *Q*_{2s+1} , *Q*_{2s}, ,

8: and

9: Compute
10: Initialize *a*_{0}, *e*_{0}, *c*_{0} and set *m* = *k* * *s*

11: **for** (*j* = 0 to *s* − 1) **Do**

12: *h*_{a} = *H*_{4s+1}*a*_{j}

13: *α*_{m+j} = *δ*_{m+j} / 〈 *g*, *h*_{a} 〉

14: *d* = *c*_{j} *− α*_{m+j}*h*_{a}

15: *h*_{d} = *H*_{4s+1}*d*

16:

17: *e*_{j+1} = *e*_{j} + *α*_{m+j}*a*_{j} + *ω*_{m+j}*d*

18: *c*_{j+1} = *c*_{j} − *ω*_{m+j}*h*_{d} − *α*_{m+j}*h*_{a}

19: δ_{m+j+1} = 〈 *g*, c_{j+1} 〉

20:* β*_{m+j} = (*δ*_{m+j+1}/*δ*_{m+j})(*α*_{m+j}/*ω*_{m+j})

21:* a*_{j+1} = *c*_{j+1} + *β*_{m+j}*a*_{j} *− β*_{m+j}*ω*_{m+j}*h*_{a}

22:* ***end for**

23: *p*_{m+s} = *Q*_{4s+1}*a*_{s}, *r*_{m+s} = *Q*_{4s+1}*z*_{s},

*x*_{m+s} = *x*_{m} + *Q*_{4s+1}*e*_{s}

24: *ρ*_{m+s} = 〈 *r*_{m+s}, *r*_{m+s} 〉, *k* = *k* + 1

25: **end While**

For the second problem, starting with a *p*_{0} ≠ *r*_{0}, might improve the convergence of all the previously discussed *s*-step BiCGStab versions. One might pick a random *p*_{0}. However, its effect on the convergence of the method is unknown. Thus, to be consistent with the previously introduced BiCGStab versions, we choose to perform one iteration of BiCGStab before constructing the first 4*s* + 1 basis vectors. The advantage is that all the information obtained in the BiCGStab iteration are used afterwards. Algorithm 5 could be considered as a “preprocessing” step, for all the *s*-step algorithms, where it is performed before the while loop and replacing the first two lines in Algorithm 2, Algorithm 3, or Algorithm 4. We refer to these versions as modified *s*-step BiCGStab, modified orthonormalized *s*-step BiCGStab, and modified split orthonormalized *s*-step BiCGStab.

**Algorithm 5:** Choosing *p*_{0} not equal to *r*_{0}

1: Let *r*_{0} = *b* − *Ax*_{0}, *p*_{0} = *r*_{0}, *ρ*_{0} = 〈 *r*_{0}, *r*_{0} 〉, and *k* = 0

2: Choose such that .

3:

4: *s* = *r*_{0} − *α*_{0}*Ap*_{0}, *t* = *As*

5: *ω*_{0} = 〈 *t*, *s* 〉 / 〈 *t*, *t* 〉

*6:* *x*_{0} = *x*_{0} + *α*_{0}*p*_{0} + *ω*_{0}*s*

7: *r*_{0} = (*I* − *ω*_{0}*A*)*s*

8: *β*_{0} = *α*_{0}/(*ω*_{0}*δ*_{0})

9: , *β*_{0} = *β*_{0} * *δ*_{0}

10: *p*_{0} = *r*_{0} + *β*_{0}(*I* − *ω*_{0}*A*)*p*_{0}

11: *ρ*_{0} = 〈 *r*_{0}, *r*_{0} 〉

## 3 Results and Expected Performance

In this Section, we show the convergence behavior of the newly introduced split orthonormalized *s*-step BiCGStab and modified split orthonormalized *s*-step BiCGStab, on the matrices defined in Section 3.1, and compare them to that of BiCGStab, *s*-step BiCGStab, and modified *s*-step BiCGStab in Section 3.2. Then we discuss the split orthonormalized *s*-step BiCGStab’s computation and communication cost in Section 3.3 and compare it to that of *s*-step BiCGStab and BiCGStab.

### 3.1 Test Matrices

The study cases presented in this paper are obtained from different representative models for reservoir simulations at different time steps. Table 1 illustrates the characteristics of the models from which the square test matrices, GCS2K, CantaF3, SPE10 [17], and HIS, are generated.

The test matrices.

Consequently, the obtained matrices have different profiles and varying degrees of difficulty. Table 2 describes the test matrices.

The reservoir models.

### 3.2 Convergence Results

We have implemented the *s*-step BiCGStab, and the split orthonormalized *s*-step BiCGStab (Algorithm 4) whereby the monomial bases *P*_{2s+1} and *R*_{2s} are first built and then orthonormalized separately using MKL’s (Math Kernel Library) QR factorization. We have also implemented the corresponding modified versions, whereby one iteration of BiCGStab is performed before building the first 4*s*+1 bases vectors. These algorithms are developed in MCG Solver [18, 19], a C++ based software package developed by *IFPEN* to provide linear solver algorithms for its industrial simulators in reservoir simulation, basin simulation or in engine combustion simulation.

Table 3 shows the convergence results for the *s*-step BiCGStab versions on the matrices introduced previously with tolerance *tol* = 10^{−8} except for the SPE10 matrix (*tol* = 10^{−4}). The number of iterations needed for the convergence of *s*-step BiCGStab versions, referred to as *s*-Step Iterations (SI), is shown. For comparison reasons, we also show the Total number of Iterations (TI) of the *s*-step BiCGStab versions, which is equal to the *s*-step iterations times *s*. Note that in exact arithmetics, each *s*-step BiCGStab iteration is equivalent to *s* iterations of the classical BiCGStab method. However, in terms of computation and communication cost, they are not equivalent, as discussed in the next section.

The convergence of the *s*-step BiCGStab, split orthonormalized *s*-step BiCGStab and their modified versions with monomial basis, for different *s* values, compared to BiCGStab. The *s*-Step Iterations (SI) and the Total Iterations (TI) are shown. The number of total iterations is equal to the number of *s*-step iterations times *s.*

For the well-conditioned matrices such as GCS2K, *s*-step BiCGStab with monomial basis converges faster as *s* increases from 2 to 6, and the corresponding total iterations are in the same range as that of BiCGStab. However, this is not the case for the other ill-conditioned matrices. The *s*-step BiCGStab’s convergence is chaotic with respect to *s* and for some *s* values it requires more total iterations to converge than BiCGStab. As mentioned earlier, using a more stable basis such as the Newton basis could improve the convergence of the *s*-step BiCGStab for the ill-conditioned matrices. However, for the ill-conditioned matrices ARPACK was not able to find the requested number of eigenvalues in 3000 iterations, which is time consuming. Thus, we orthonormalize the bases for better numerical stability.

In the case of well-conditioned matrices such GCS2K, the computed monomial basis is already numerically stable. Thus, it is expected that the convergence will not improve much by orthonormalizing the basis or by starting with a *p*_{0} ≠ *r*_{0}. This is clear in Table 3, where the convergence of split orthonormalized *s*-step BiCGStab, modified *s*-step BiCGStab, and modified split orthonormalized *s*-step BiCGStab is in the same range as that of *s*-step BiCGStab.

On the other hand, the convergence behavior of the *s*-step BiCGStab methods for the ill-conditioned matrices varies. For SPE10, *s*-step BiCGStab converges in fewer *s*-step iterations, as *s* increases from 2 to 6. Then, orthonormalizing the basis and/or starting with *p*_{0} ≠ *r*_{0} improves the numerical stability of the basis, leading to a better convergence. Note that orthonormalizing the basis (split ortho *s*-step BiCGStab) has a larger effect on convergence than starting with a *p*_{0} ≠ *r*_{0} (modified *s*-step BiCGStab).

For the HIS matrix, *s*-step BiCGStab’s convergence fluctuates as *s* increases. Whereas, the other *s*-step BiCGStab methods have a strictly decreasing convergence with respect to *s*. Starting with *p*_{0} ≠ *r*_{0} (modified *s*-step BiCGStab) improved the convergence of *s*-step BiCGStab (except for *s* = 3). Moreover, the convergence results of modified *s*-step BiCGStab and modified split orthonormalized *s*-step BiCGStab are very similar for *s* ≥ 4.

CantaF3 is a special case where the modified split orthonormalized *s*-step BiCGStab method was the only method that had a stable convergence as *s* increased from 2 to 5. All the other *s*-step BiCGStab methods had a chaotic convergence with respect to *s*.

Note that in some cases the total iterations of the *s*-step BiCGStab variants is more than that of BiCGStab. However, the communication cost of each *s*-step iteration is much less than that of *s* iterations of BiCGStab. Thus the *s*-step variants should still converge faster, in terms of runtime, in parallel implementations.

In general, we can say that for well-conditioned matrices, the *s*-step BiCGStab methods have a similar rate of convergence. However, for the ill-conditioned matrices, orthonormalizing the bases separately and starting with a *p*_{0} ≠ *r*_{0} have positive effects on the stability of the basis which speeds up and stabilizes the convergence with respect to *s*.

As mentioned earlier, several linear system with different degrees of difficulty are solved throughout a basin or reservoir simulation using the same method (BiCGStab). Our goal is to speedup the convergence of BiCGStab by replacing it with an *s*-step version that reduces communication. Based on the presented convergence results, the modified split orthonormalized *s*-step BiCGStab method seems to be the most stable version with respect to *s* for both, well-conditioned and ill-conditioned matrices. The reason we focus on the convergence behavior as *s* increases, rather than the convergence behavior for a given *s* value, is that the *s* value is fixed throughout the simulation. And the convergence effect of the chosen *s* value on the different linear systems is not known beforehand. Thus, we seek a robust method that will converge faster as *s* increases to some upper limit (5 or 6).

### 3.3 Computation and Communication Cost

In Table 4, the number of flops performed in one *s*-step iteration of orthonormalized *s*-step BiCGStab and *s*-step BiCGStab is presented, along with the number of flops performed in *s* iterations of BiCGStab.

The number of flops performed in one iteration of the sequential (modified) split orthonormalized *s*-step BiCGStab and *s*-step BiCGStab with monomial basis, and the corresponding flops for computing s iterations of BiCGStab.

In the (modified) split orthonormalized *s*-step BiCGStab there is a need for performing two QR factorizations of the matrices *P*_{2s+1} and *R*_{2s}, however Gram-like matrix *G* is not computed. Thus, the computed flops in the (modified) split orthonormalized *s*-step BiCGStab is slightly less than the computed flops in the *s*-step BiCGStab as shown in Table 4. As discussed in [5], the only communication that occurs in the parallel implementation of *s*-step BiCGStab is in the construction of the basis and the matrix *G*. Similarly for the parallel implementation of the (modified) split orthonormalized *s*-step BiCGStab, only the construction of the basis and its orthonormalization using TSQR require communication. Note that both TSQR and the computation of *G* require log (*p*) messages and sending words where *p* is the number of processors. Thus in terms of computed flops, sent messages and sent words, the (modified) split orthonormalized *s*-step BiCGStab and the *s*-step BiCGStab are equivalent. Note that the only difference between the modified split orthonormalized *s*-step BiCGStab and the split orthonormalized *s*-step BiCGStab is that Algorithm 5 is called once before the while loop. Hence, we may assume that the communication and computation cost of both methods is bounded by the same value.

On the other hand, the performed flops in one iteration of *s*-step BiCGStab and orthonormalized *s*-step BiCGStab is at least twice the flops performed in *s* iterations of BiCGStab. However, the number of sent messages in the *s*-step versions is reduced by a factor of *O*(*s*), at the expense of increasing the number of sent words. Therefore, it is expected to obtain speedup in the parallel implementations of the introduced *s*-step BiCGStab variants.

## Conclusion

In this paper, we have introduced the split orthonormalized *s*-step BiCGStab and the modified split orthonormalized *s*-step BiCGStab, variants of *s*-step BiCGStab where the basis vectors are orthonormalized. In addition, in the modified split orthonormalized *s*-step BiCGStab, we perform one iteration of BiCGStab to define a *p*_{0} not equal to *r*_{0}.

We have studied the convergence behavior of the introduced methods with monomial basis, and compared it to that of the *s*-step BiCGStab for the matrices obtained from reservoir simulations. For almost all the tested matrices, the modified split orthonormalized *s*-step BiCGStab with monomial basis, converged faster than the *s*-step BiCGStab for *s* = 2, …, 6. Moreover, for ill-conditioned matrices, the modified split orthonormalized *s*-step BiCGStab has a similar convergence behavior as the BiCGStab method for *s* = 2, …, 6, unlike the *s*-step BiCGStab.

All the *s*-step BiCGStab versions send *O*(*s*) times less messages than *s* iterations of BiCGStab. Moreover, the computation cost of the introduced variants is slightly less than that of the *s*-step BiCGStab. Hence, it is expected that the introduced *s*-step BiCGStab methods, specifically modified split orthonormalized *s*-step BiCGStab, will perform well in parallel on multi-core architectures.

As a future work, we would like to implement the split orthonormalized *s*-step BiCGStab and the modified split orthonormalized *s*-step BiCGStab, in parallel and compare its runtime to that of the parallel BiCGStab and *s*-step BiCGStab.

## References

- Saad Y., Schultz M.H. (1986) Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput. 7, 3, 856–869. [Google Scholar]
- Hestenes M.R., Stiefel E. (1952) Methods of conjugate gradients for solving linear systems, J. Res. Natl. Bur. Stand. 49, 409–436. [CrossRef] [MathSciNet] [Google Scholar]
- van der Vorst H.A. (1992) Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems, SIAM J. Sci. Statist. Comput. 13, 2, 631–644. [Google Scholar]
- Carson E., Knight N., Demmel J. (2011) Avoiding communication in two-sided Krylov subspace methods. Technical Report UCB/EECS-2011-93, EECS Department, University of California, Berkeley, August. [Google Scholar]
- Carson E., Knight N., Demmel J. (2013) Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods, SIAM J. Sci. Comput. 35, 5, S42–S61. [CrossRef] [Google Scholar]
- Grigori L, Moufawad S. (2013) Communication avoiding ILU0 preconditioner, Technical Report, ALPINES - INRIA Paris-Rocquencourt, March. [Google Scholar]
- Hoemmen M. (2010) Communication-avoiding Krylov subspace methods, PhD Thesis, EECS Department, University of California, Berkeley. [Google Scholar]
- Mohiyuddin M., Hoemmen M., Demmel J., Yelick K. (2009) Minimizing communication in sparse matrix solvers, In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC’09, New York, NY, USA, ACM, pp.1–12. [Google Scholar]
- Chronopoulos A.T., Gear W. (1989) s-Step iterative methods for symmetric linear systems, J. Comput. Appl. Math. 25, 2, 153–168. [CrossRef] [Google Scholar]
- Erhel J. (1995) A parallel GMRES version for general sparse matrices, Electron. Trans. Numer. Anal. 3, 160–176. [MathSciNet] [Google Scholar]
- Walker H.F. (1988) Implementation of the GMRES method using householder transformations, SIAM J. Sci. Statist. Comput. 9, 1, 152–163. [CrossRef] [MathSciNet] [Google Scholar]
- Demmel J., Hoemmen M., Mohiyuddin M., Yelick K. (2008) Avoiding communication in sparse matrix computations, In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium, 14–18 April 2008, Held in Hyatt Regency Hotel in Miami, Florida, USA, pp. 1–12. [CrossRef] [Google Scholar]
- Demmel J., Grigori L., Hoemmen M., Langou J. (2012) Communication-avoiding parallel and sequential QR factorizations, SIAM J. Sci. Comput. 34, 206–239. [CrossRef] [Google Scholar]
- Bai Z., Hu D., Reichel L. (1994) A Newton basis GMRES implementation, IMA J. Numer. Anal. 14, 563–581. [CrossRef] [MathSciNet] [Google Scholar]
- Reichel L. (1990) Newton interpolation at Leja points, BIT Numerical Mathematics 30, 332–346. [CrossRef] [MathSciNet] [Google Scholar]
- Lehoucq R., Sorensen D., Yang C. (1998) ARPACK Users’ Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]
- The 10th SPE Comparative Solution Project (2000) Retrieved from http://www.spe.org/web/csp/datasets/set02.htm. [Google Scholar]
- Anciaux-Sedrakian A., Eaton J., Gratien J., Guignon T., Havé P., Preux C., Ricois O. (2015) Will GPGPUs be finally a credible solution for industrial reservoir simulators, SPE Reservoir Simulation Symposium, 23-25 February, Houston, Texas, USA, SPE-173223-MS. DOI: 10.2118/173223-MS. [Google Scholar]
- Anciaux-Sedrakian A., Gottschling P., Gratien J., Guignon T. (2014) Survey on efficient linear solvers for porous media flow models on recent hardware architectures, Oil Gas Sci. Technol. - Rev. IFP 69, 4, 753–766. [CrossRef] [EDP Sciences] [Google Scholar]

**Cite this article as**: A. Anciaux-Sedrakian, L. Grigori, S. Moufawad and S. Yousef (2016). *S*-Step BiCGStab Algorithms for Geoscience Dynamic Simulations, *Oil Gas Sci. Technol* **71**, 66.

## All Tables

The convergence of the *s*-step BiCGStab, split orthonormalized *s*-step BiCGStab and their modified versions with monomial basis, for different *s* values, compared to BiCGStab. The *s*-Step Iterations (SI) and the Total Iterations (TI) are shown. The number of total iterations is equal to the number of *s*-step iterations times *s.*

The number of flops performed in one iteration of the sequential (modified) split orthonormalized *s*-step BiCGStab and *s*-step BiCGStab with monomial basis, and the corresponding flops for computing s iterations of BiCGStab.