S-Step BiCGStab Algorithms for Geoscience Dynamic Simulations

Ani Anciaux-Sedrakian; Laura Grigori; Sophie Moufawad; Soleiman Yousef

doi:10.2516/ogst/2016021

Dossier: SimRace 2015: Numerical Methods and High Performance Computing for Industrial Fluid Flows

Open Access

Issue		Oil Gas Sci. Technol. – Rev. IFP Energies nouvelles Volume 71, Number 6, November–December 2016 Dossier: SimRace 2015: Numerical Methods and High Performance Computing for Industrial Fluid Flows


Article Number		66
Number of page(s)		11
DOI		https://doi.org/10.2516/ogst/2016021
Published online		20 December 2016

Oil & Gas Science and Technology – Rev. IFP Energies nouvelles (2016) 71, 66

S-Step BiCGStab Algorithms for Geoscience Dynamic Simulations

Méthodes s‐step BiCGStab appliquées en Géosciences

Ani Anciaux-Sedrakian¹, Laura Grigori²^,3, Sophie Moufawad¹^* and Soleiman Yousef¹

¹ IFP Energies nouvelles, 1-4 avenue de Bois-Préau, 92852 Rueil-Malmaison Cedex – France
² INRIA Paris, Alpines, 2 Rue Simone IFF, 75012 Paris – France
³ UPMC - Univ Paris 6, CNRS UMR 7598, Laboratoire Jacques-Louis Lions, 4 Place Jussieu, 75005 Paris – France
e-mail: ani.anciaux-sedrakian@ifpen.fr – laura.grigori@inria.fr – sm101@aub.edu.lb – soleiman.yousef@ifpen.fr

^* Corresponding author

Received: 15 December 2015
Accepted: 11 October 2016

Abstract

In basin and reservoir simulations, the most expensive and time consuming phase is solving systems of linear equations using Krylov subspace methods such as BiCGStab. For this reason, we explore the possibility of using communication avoiding Krylov subspace methods (s-step BiCGStab), that speedup of the convergence time on modern-day architectures, by restructuring the algorithms to reduce communication. We introduce some variants of s-step BiCGStab with better numerical stability for the targeted systems.

Résumé

Dans les simulateurs d’écoulement en milieu poreux, comme les simulateurs de réservoir et de bassin, la résolution de système linéaire constitue l’étape la plus consommatrice en temps de calcul et peut même représenter jusqu’à 80 % du temps de la simulation. Ceci montre que la performance de ces simulateurs dépend fortement de l’efficacité des solveurs linéaires. En même temps, les machines parallèles modernes disposent d’un grand nombre de processeurs et d’unités de calcul massivement parallèle. Dans cet article, nous proposons de nouveaux algorithmes BiCGStab, basés sur l’algorithme à moindre communication nommé s-step, permettant d’éviter un certain nombre de communication afin d’exploiter pleinement les architectures hautement parallèles.

© A. Anciaux-Sedrakian et al., published by IFP Energies nouvelles, 2016

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

Many scientific problems require the solution of systems of linear equations of the form Ax = b, where the input matrix A is very large and sparse. These systems arise mainly from the discretization of Partial Differential Equations (PDE), and are usually solved using Krylov subspace methods, such as Generalized Minimal RESidual (GMRES) [1], Conjugate Gradient (CG) [2] and Bi-Conjugated Gradient Stabilized (BiCGStab) [3].

In the case of basin modeling or reservoir simulations with highly heterogeneous data and complex geometries, complex non-linear systems of PDE are solved. These PDE are discretized with a cell-centered finite volume scheme in space, leading to a non-linear system which is solved with an iterative Newton solver. At each Newton step, the system is linearized. Then, the generated large, sparse and unstructured linear system is solved using preconditioned GMRES, BiCGStab, CG, Orthomin or other preconditioned iterative methods. Some of the most commonly used preconditioners are ILU(k), ILUT, AMG and CPR-AMG. This resolution phase constitutes the most expensive part of the simulation. Thus we focus on linear solvers, since their efficiency is a key point for the simulator’s performance.

Furthermore, modern parallel computing resources are based on complex hardware architecture. They are composed of several multi-core processors and massively parallel processing units such as many-cores or General-Purpose GPU (GPGPU) cards. Most of the current algorithms are not able to fully exploit the highly parallel architectures. In fact, a severe degradation of performance is detected when the number of processing units is increased. This is due to the difference between the required time to perform floating point operations (flops) by processing units and the time to communicate the obtained results, where flops have become much cheaper than data communication.

Thus, recent research has focused on reformulating dense and sparse linear algebra algorithms with the aim of reducing and avoiding communication. These methods are referred to as communication avoiding methods, whereby communication refers to data movement between different processing units in parallel, and different levels of memory hierarchy. In the case of Krylov subspace methods, the introduced communication avoiding Krylov subspace methods [4-8] are based on s-step methods [9-11]. The goal is to restructure the algorithms to perform s iterations at a time by using kernels that avoid or reduce communication, such as the matrix powers kernel [12], Tall and Skinny QR (TSQR) [13], and Block Gram Schmidt (BGS) methods.

Our aim is to reduce the overall cost of the linear solver resolution phase in geoscience simulations, specifically basin and reservoir simulations, using a parallel implementation of BiCGStab that avoids communication on multi-core hardware (CA-BiCGStab) [5] and has a similar convergence behavior as the classical BiCGStab method. Communication Avoiding BiCGStab (CA-BiCGStab), which was introduced in [5], is a reformulation of BiCGStab into s-step BiCGStab that avoids communication. Thus, in this paper we study the convergence behavior of a sequential version of the unpreconditioned s-step BiCGStab [5], on matrices obtained from reservoir simulations, with different s values. The obtained results show that, for most of the tested matrices, s-step BiCGStab requires more iterations to converge than BiCGStab. Thus, we design new variants of s-step BiCGStab, that have the same convergence rate as BiCGStab for s values between 2 and 6, and reduce communication similarly to s-step BiCGStab.

In Section 1, we introduce BiCGStab, its reformulation to s-step BiCGStab [5], and we discuss the performance of s-step BiCGStab in geoscience applications, specifically reservoir simulations. Then, in Section 2, we introduce the new s-step BiCGStab variants that we call orthonormalized s-step BiCGStab, split orthonormalized s-step BiCGStab, and modified split orthonormalized s-step BiCGStab. In Section 3, we present the convergence results of the newly introduced s-step BiCGStab variants and compare them to that of s-step BiCGStab. Finally, we conclude.

1 From Bicgstab to S-Step Bicgstab

In this section we briefly introduce BiCGStab (Sect. 1.1) and s-step BiCGStab (Sect. 1.2). We show the relation between both methods and their convergence in reservoir simulations (Sect. 1.3).

1.1 BiCGStab

The Bi-Conjugate Gradient Stabilized method (BiCGStab), introduced by van der Vorst in 1992 [3], is an iterative Krylov subspace method that solves the general systems Ax = b. It is a variant of the Bi-Conjugate Gradient (BiCG) method that aims at smoothing BiCG’s erratic convergence. At each iteration m ≥ 0, r_m+1 = P_m+1(A)r₀ is replaced by r_m+1 = Q_m+1(A)P_m+1(A)r₀ where ${Q}_{m+1}(z)\in {\mathcal{P}}_{m+1}$ and ${P}_{m+1}(z)\in {\mathcal{P}}_{m+1}$ are polynomials of degree m + 1. Q_m+1(z) is chosen to be $Q_{m + 1} (z) = \prod_{j = 1}^{m + 1} (1 - ω_{j - 1} z) = (1 - ω_{m} z) Q_{m} (z)$ ${Q}_{m+1}(z)=\prod_{j=1}^{m+1}\left(1-{\omega }_{j-1}z\right)=\left(1-{\omega }_mz\right){Q}_m(z)$ where ω_m minimizes the norm of r_m+1.

BiCGStab, being a variant of BiCG, has a similar form. But the recurrence relations of x_m+1, r_m+1, p_m+1, α_m and β_m are different. $x_{m + 1} = x_{m} + α_{m} p_{m} + ω_{m} [r_{m} - α_{m} A p_{m}]$ ${x}_{m+1}={x}_m+{\alpha }_m{p}_m+{\omega }_m[{r}_m-{\alpha }_mA{p}_m]$ (1) $r_{m + 1} = (I - ω_{m} A) [r_{m} - α_{m} A p_{m}]$ ${r}_{m+1}=(I-{\omega }_mA)[{r}_m-{\alpha }_mA{p}_m]$ (2) $p_{m + 1} = r_{m + 1} + β_{m} (I - ω_{m} A) p_{m}$ ${p}_{m+1}={r}_{m+1}+{\beta }_m(I-{\omega }_mA){p}_m$ (3)where r₀ = b − Ax₀, and p₀ = r₀.

The scalars α_m and β_m are defined as follows: $α_{m} = \frac{〈 {\tilde{r}}_{0}, r_{m} 〉}{〈 {\tilde{r}}_{0}, A p_{m} 〉}, β_{m} = \frac{α_{m}}{ω_{m}} \frac{〈 {\tilde{r}}_{0}, r_{m + 1} 〉}{〈 {\tilde{r}}_{0}, r_{m} 〉}$ ${\alpha }_m=\frac\left\langle {\mathop{r}\limits^\tilde}_0,{r}_m\right\rangle\left\langle {\mathop{r}\limits^\tilde}_0,A{p}_m\right\rangle,\enspace {\beta }_m=\frac{{\alpha }_m}{{\omega }_m}\frac\left\langle {\mathop{r}\limits^\tilde}_0,{r}_{m+1}\right\rangle\left\langle {\mathop{r}\limits^\tilde}_0,{r}_m\right\rangle$ (4)where ${\mathop{r}\limits^\tilde}_0$ is chosen such that $\left\langle {\mathop{r}\limits^\tilde}_0,{r}_0\right\rangle\ne 0$ as shown in Algorithm 1. In general ${\mathop{r}\limits^\tilde}_0$ is set equal to r₀. As for ω_m, it is defined by minimizing the norm of the residual r_m+1, i.e. $||(I-{\omega }_mA)({r}_m-{\alpha }_mA{p}_m)||={\mathrm{min}}_{\omega \in \mathbb{R}}||(I-{\omega A})({r}_m-{\alpha }_mA{p}_m)||$ , where $ω_{m} = \frac{〈 A r_{m} - α_{m} A^{2} p_{m}, r_{m} - α_{m} A p_{m} 〉}{〈 A r_{m} - α_{m} A^{2} p_{m}, A r_{m} - α_{m} A^{2} p_{m} 〉}$ ${\omega }_m=\frac\left\langle A{r}_m-{\alpha }_m{A}^2{p}_m,{r}_m-{\alpha }_mA{p}_m\right\rangle\left\langle A{r}_m-{\alpha }_m{A}^2{p}_m,A{r}_m-{\alpha }_m{A}^2{p}_m\right\rangle$ (5)

In addition, we have that for m > 0 ${\begin{matrix} p_{m}, r_{m} \in K_{2 m + 1} (A, p_{0}) + K_{2 m} (A, r_{0}) \\ x_{m} - x_{0} \in K_{2 m} (A, p_{0}) + K_{2 m - 1} (A, r_{0}) \end{matrix}$ $\left\{\begin{array}{c}{p}_m,{r}_m\in {\mathcal{K}}_{2m+1}(A,{p}_0)+{\mathcal{K}}_{2m}(A,{r}_0)\\ {x}_m-{x}_0\in {\mathcal{K}}_{2m}(A,{p}_0)+{\mathcal{K}}_{2m-1}(A,{r}_0)\end{array}\right.$ (6)and more generally for m ≥ 0 and j > 0 ${\begin{matrix} p_{m + j}, r_{m + j} \in K_{2 j + 1} (A, p_{m}) + K_{2 j} (A, r_{m}) \\ x_{m + j} - x_{m} \in K_{2 j} (A, p_{m}) + K_{2 j - 1} (A, r_{m}) \end{matrix}$ $\left\{\begin{array}{c}{p}_{m+j},{r}_{m+j}\in {\mathcal{K}}_{2j+1}(A,{p}_m)+{\mathcal{K}}_{2j}(A,{r}_m)\\ {x}_{m+j}-{x}_m\in {\mathcal{K}}_{2j}(A,{p}_m)+{\mathcal{K}}_{2j-1}(A,{r}_m)\end{array}\right.$ (7)

Algorithm 1: BiCGStab

Input:A, b, x₀, m_max: the maximum allowed iterations

Output:x_m: the mth approximate solution satisfying the stopping criteria

1: Let r₀ = b − Ax₀, p₀ = r₀, ρ₀ = 〈 r₀, r₀ 〉, and m = 0

2: Choose ${\mathop{r}\limits^\tilde}_0$ such that ${\delta }_0=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_0\right\rangle\ne 0$ .

3: While ( $\sqrt{{\rho }_m}>\epsilon ||b|{|}_2$ and m < m_max) Do

4: ${\alpha }_m={\delta }_m\enspace /\enspace \left\langle {\mathop{r}\limits^\tilde}_{0,}\enspace {A}{p}_m\right\rangle$

5: s = r_m − α_mAp_m

6: t = As

7: ω_m = Ǡ〈 t, s 〉 / 〈 t, t 〉

8: x_m+1 = x_m + α_mp_m + ω_ms

9: r_m+1 = (I − ω_mA)s

10: ${\delta }_{m+1}=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_{m+1}\right\rangle$

11: β_m = (δ_m+1/δ_m)(α_m/ω_m)

12: p_m+1 = r_m+1 + β_m(I − ω_mA)p_m

13: ρ_m+1 = 〈 r_m+1, r_m+1 〉, m = m + 1

14: end for

At each iteration of Algorithm 1, two sparse matrix-vector multiplications, six saxpy’s, and five dot products are computed. Given that each processor has the scalars and its corresponding part of the vectors, then the saxpy’s can be parallelized without communication. However, this is not the case for the sparse matrix-vector multiplications and the dot products, which require communication to obtain the desired results. Such operations cause a severe performance degradation, especially when using modern computing resources.

1.2 S-Step BiCGStab

To reduce the communication in parallel and sequential implementations of BiCGStab, Carson et al. [5] introduced the s-step version of BiCGStab. The reformulation is based on the computation of s BiCGStab iterations at once, and on the fact that for m ≥ 0 and 1 ≤ j ≤ s ${\begin{matrix} p_{m + j}, r_{m + j} \in K_{2 s + 1} (A, p_{m}) + K_{2 s} (A, r_{m}) \\ x_{m + j} - x_{m} \in K_{2 s} (A, p_{m}) + K_{2 s - 1} (A, r_{m}) \end{matrix}$ $\left\{\begin{array}{c}{p}_{m+j},{r}_{m+j}\in {\mathcal{K}}_{2s+1}(A,{p}_m)+{\mathcal{K}}_{2s}(A,{r}_m)\\ {x}_{m+j}-{x}_m\in {\mathcal{K}}_{2s}(A,{p}_m)+{\mathcal{K}}_{2s-1}(A,{r}_m)\end{array}\right.$ (8)since ${\mathcal{K}}_{2j+1}(A,z)\subseteq {\mathcal{K}}_{2s+1}(A,z)$ for any z ≠ 0.

The goal is to perform more flops per communication, by computing 2s matrix-vector products at the beginning of each iteration of the s-step BiCGStab. This would reduce the communication cost, specifically the number of messages, by O(s) times in parallel [5]. However, this is not possible using the same formulation as BiCGStab. Therefore, at the beginning of each s-step iteration, one computes P_2s+1 and R_2s, the Krylov matrices corresponding to the ${\mathcal{K}}_{2s+1}(A,{p}_m)$ and ${\mathcal{K}}_{2s}(A,{r}_m)$ bases respectively, where m = 0, s, 2s, 3s,…. Then, by Equation (8), p_m+j, r_m+j, and x_m+j − x_m can be defined as the product of the basis vectors and a span vector, for j = 0, …, s, $p_{m + j} = [P_{2 s + 1}, R_{2 s}] a_{j}$ ${p}_{m+j}=[{P}_{2s+1},{R}_{2s}]{a}_j$ (9) $r_{m + j} = [P_{2 s + 1}, R_{2 s}] c_{j}$ ${r}_{m+j}=[{P}_{2s+1},{R}_{2s}]{c}_j$ (10) $x_{m + j} = x_{m} + [P_{2 s + 1}, R_{2 s}] e_{j}$ ${x}_{m+j}={x}_m+[{P}_{2s+1},{R}_{2s}]{e}_j$ (11)where [P_2s+1, R_2s] is an n × (4s +1) matrix containing the basis vectors of ${\mathcal{K}}_{2s+1}(A,{p}_m)$ and K_2s(A, r_m), and a_j, c_j, and e_j are span vectors of size 4s + 1. Note that e₀ = 0. As for a₀ and c₀, their definition depends on the type of computed basis. One can compute a basis defined by a recurrence relation with three or less terms, such as a monomial, scaled monomial, Newton, or Chebyshev basis. Then, we have that $A P_{2 s} = P_{2 s + 1} T_{2 s + 1}$ $A{P}_{2s}={P}_{2s+1}{T}_{2s+1}$ (12) $A R_{2 s - 1} = R_{2 s} T_{2 s}$ $A{R}_{2s-1}={R}_{2s}{T}_{2s}$ (13) $A [P_{2 s}, 0, R_{2 s - 1}, 0] = [P_{2 s + 1}, R_{2 s}] T'$ $A[{P}_{2s},0,{R}_{2s-1},0]=[{P}_{2s+1},{R}_{2s}]T\mathrm{\prime}$ (14)where T_2s+1 and T_2s are change of basis matrices of size (2s + 1) × (2s) and (2s) × (2s − 1) respectively, and $T' = [\begin{array}{l} [T_{2 s + 1} & 0] \\ [T_{2 s} & 0] \end{array}]$ $T\mathrm{\prime}=\left[\begin{array}{llll}[{T}_{2s+1}& 0]& & \\ & & [{T}_{2s}& 0]\end{array}\right]$ is a (4s + 1) × (4s + 1) matrix.

The definition of T_2s+1, T_2s and eventually T'depends on the chosen type of basis $[P_{2 s + 1}, R_{2 s}] = [{\overset{̅}{p}}_{m}, {\overset{̅}{p}}_{m + 1}, \dots, {\overset{̅}{p}}_{m + 2 s}, {\overset{̅}{r}}_{m}, {\overset{̅}{r}}_{m + 1}, \dots, {\overset{̅}{r}}_{m + 2 s - 1}]$ $[{P}_{2s+1},{R}_{2s}]=[{\bar{p}}_m,{\bar{p}}_{m+1},\dots,{\bar{p}}_{m+2s},{\bar{r}}_m,{\bar{r}}_{m+1},\dots,{\bar{r}}_{m+2s-1}]$ For example, in the monomial basis case, where ${\bar{p}}_m={p}_m$ , ${\bar{r}}_m={r}_m$ , ${\bar{p}}_{m+i}=A{\bar{p}}_{m+i-1}$ , and ${\bar{r}}_{m+i}=A{\bar{r}}_{m+i-1}$ for i > 0, the matrices T_2s+1 and T_2s are all zeros except the lower diagonal which is ones, i.e. T_2s+1(i + 1, i) = 1 for i = 1, …, 2s. In the case of the scaled monomial basis, where ${\bar{p}}_m=\frac{{p}_m}{||{p}_m||}$ and ${\bar{p}}_{m+i}=\frac{A{\bar{p}}_{m+i-1}}{||A{\bar{p}}_{m+i-1}||}$ , the matrices are defined as ${T}_{2s+1}(i+1,i)=||A{\bar{p}}_{m+i-1}||$ for i = 1, …, 2s, and ${T}_{2s}(i+1,i)=||A{\bar{r}}_{m+i-1}||$ for i = 1, …, 2s − 1, and zero elsewhere.

The reformulation of BiCGStab into s-step BiCGStab starts by replacing the definitions (9)-(11) in Equations (1)-(3), and taking into consideration that for j = 0 to s − 1. $A p_{m + j} = A [P_{2 s + 1}, R_{2 s}] a_{j} = A [P_{2 s}, 0, R_{2 s - 1}, 0] a_{j}$ $A{p}_{m+j}=A[{P}_{2s+1},{R}_{2s}]{a}_j=A[{P}_{2s},0,{R}_{2s-1},0]{a}_j$ (15) $= [P_{2 s + 1}, R_{2 s}] T' a_{j}$ $=[{P}_{2s+1},{R}_{2s}]T\mathrm{\prime}{a}_j$ (16) $A r_{m + j} = [P_{2 s + 1}, R_{2 s}] T' c_{j}$ $A{r}_{m+j}=[{P}_{2s+1},{R}_{2s}]T\mathrm{\prime}{c}_j$ (17) $A^{2} p_{m + j} = [P_{2 s + 1}, R_{2 s}] (T')^{2} a_{j}$ ${A}^2{p}_{m+j}=[{P}_{2s+1},{R}_{2s}](T\mathrm{\prime}{)}^2{a}_j$ (18)

Then we get the following, $e_{j + 1} = e_{j} + α_{m + j} a_{j} + ω_{m + j} c_{j} - ω_{m + j} α_{m + j} T' a_{j}$ ${e}_{j+1}={e}_j+{\alpha }_{m+j}{a}_j+{\omega }_{m+j}{c}_j-{\omega }_{m+j}{\alpha }_{m+j}T\mathrm{\prime}{a}_j$ (19) $a_{j + 1} = c_{j + 1} + β_{m + j} a_{j} - β_{m + j} ω_{m + j} T' a_{j}$ ${a}_{j+1}={c}_{j+1}+{\beta }_{m+j}{a}_j-{\beta }_{m+j}{\omega }_{m+j}T\mathrm{\prime}{a}_j$ (20) $c_{j + 1} = c_{j} - α_{m + j} T' a_{j} - ω_{m + j} T' (c_{j} - α_{m + j} T' a_{j})$ ${c}_{j+1}={c}_j-{\alpha }_{m+j}T\mathrm{\prime}{a}_j-{\omega }_{m+j}T\mathrm{\prime}({c}_j-{\alpha }_{m+j}T\mathrm{\prime}{a}_j)$ (21)with e₀ = 0, a₀ = [1, 0_4s]^T, and c₀ = [0_2s+1, 1, 0_2s−1] in the case of monomial and Newton basis. Then, the span vectors, a_j, c_j, and e_j are updated for j = 1, …, s, rather than p_m+j, r_m+j, and x_m+j which are of size n ≫ 4s + 1.

As for α_m+j, δ_m+j+1, and ω_m+j, it is sufficient to replace r_m+j and p_m+j by definitions (10) and (9) to obtain $δ_{m + j + 1} = 〈 {\tilde{r}}_{0}, r_{m + j + 1} 〉 = 〈 g, c_{j + 1} 〉$ ${\delta }_{m+j+1}=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_{m+j+1}\right\rangle=\left\langle g,{c}_{j+1}\right\rangle$ $α_{m + j} = δ_{m + j} / 〈 {\tilde{r}}_{0}, A p_{m + j} 〉 = δ_{m + j} / 〈 g, T' a_{j + 1} 〉$ ${\alpha }_{m+j}={\delta }_{m+j}/\left\langle {\mathop{r}\limits^\tilde}_0,A{p}_{m+j}\right\rangle={\delta }_{m+j}/\left\langle g,T\mathrm{\prime}{a}_{j+1}\right\rangle$ $ω_{m + j} = \frac{〈 T' c_{j} - α_{m + j} (T')^{2} a_{j}, G c_{j} - α_{m + j} GT' a_{j} 〉}{〈 T' c_{j} - α_{m + j} (T')^{2} a_{j}, GT' c_{j} - α_{m + j} G (T')^{2} a_{j} 〉}$ ${\omega }_{m+j}=\frac\left\langle T\mathrm{\prime}{c}_j-{\alpha }_{m+j}(T\mathrm{\prime}{)}^2{a}_j,G{c}_j-{\alpha }_{m+j}{GT}\mathrm{\prime}{a}_j\right\rangle\left\langle T\mathrm{\prime}{c}_j-{\alpha }_{m+j}(T\mathrm{\prime}{)}^2{a}_j,{GT}\mathrm{\prime}{c}_j-{\alpha }_{m+j}G(T\mathrm{\prime}{)}^2{a}_j\right\rangle$ where G = [P_2s+1, R_2s]^T[P_2s+1, R_2s] is a Gram-like matrix, $g=[{P}_{2s+1},{R}_{2s}{]}^T{\mathop{r}\limits^\tilde}_0$ . Then, ${\beta }_{m+j}=\frac{{\delta }_{m+j+1}}{{\delta }_{m+j}}\frac{{\alpha }_{m+j}}{{\omega }_{m+j}}$ .

Algorithm 2:s-step BiCGStab

Input:A, b, x₀, m_max, s, Type of Basis

Output:x_m: the mth approximate solution satisfying the stopping criteria

1: Let r₀ = b − Ax₀, p₀ = r₀, ρ₀ = 〈 r₀, r₀ 〉, and k = 0

2: Choose ${\mathop{r}\limits^\tilde}_0$ such that ${\delta }_0=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_0\right\rangle\ne 0$ .

3: While $\left(\sqrt{{\rho }_k}>\epsilon |\left|b\right|{|}_2\enspace \mathrm{and}\enspace k<\lfloor \frac{{m}_{\mathrm{max}}}{s}\rfloor \right)$ Do

4: Compute P_2s+1 and R_2s depending on Type of Basis, and output the diagonals of T′

6: G = [P_2s+1, R_2s]^T[P_2s+1, R_2s] and $g=[{P}_{2s+1},{R}_{2s}{]}^T{\mathop{r}\limits^\tilde}_0$

7: Initialize a₀, e₀, c₀ and set m = k * s

8: for (j = 0 to s − 1) Do

9: t_a = T′a_j

10: α_m+j = δ_m+j / 〈 g, t_a 〉

11: d = c_j − α_m+jt_a

12: t_d = T′d

13: g_d = Gd

14: g_t = Gt_d

15: ω_m+j = 〈 t_d, g_d 〉 / 〈 t_d, g_t 〉

16: e_j+1 = e_j + α_m+ja_j + ω_m+jd

17: c_j+1 = c_j − ω_m+jt_d − α_m+jt_a

18: δ_m+j+1 = 〈 g, c_j+1 〉

19: β_m+j = (δ_m+j+1/δ_m+j)(α_m+j/ω_m+j)

20: a_j+1 = c_j+1 + β_m+ja_j − β_m+jω_m+jt_a

21: end for

22: p_m+s = [P_2s+1, R_2s]a_s, r_m+s = [P_2s+1, R_2s]z_s

23: x_m+s = x_m + [P_2s+1, R_2s]e_s

24: ρ_m+s = 〈 r_m+s, r_m+s 〉, k = k + 1

25: end While

Algorithm 2 is a reformulation of BiCGStab that reduces communication where the matrix-vector multiplications are grouped at the beginning of the outer iteration, and the Gram-like matrix G is computed once per outer iteration. Then, in the inner iterations, the vector operations of size n are replaced by vector operations of size 4s + 1, where 4s + 1 ≪ n. However, this reformulation alone is not sufficient to reduce communication.

For example, in the sequential case the basis computation should be done using the matrix powers kernel [12], where ${\bar{p}}_{m+j+1}$ is computed by parts that fit into cache memory, for j = 0, …, 2s − 1. This reduces communication in the memory hierarchy of the processor and increases cache hits. In the parallel case, each processor fetches, at the beginning, the needed data from neighboring processors to compute its assigned part of the 2s vectors ${\bar{p}}_{m+j+1}$ without any communication, for j = c 0, …, 2s − 1. Similarly, we can compute the 2s − 1 vectors ${\bar{r}}_{m+j+1}$ for j = 0, …, 2s − 2. Note that it is possible to compute the two bases simultaneously using a block version of the matrix powers kernel that computes a block of vectors without communication.

1.3 S-Step BiCGStab for Geoscience Applications

In geoscience applications, specifically in reservoir simulations, at each time step a new linear system of the form Ax = b has to be solved. The difficulty and the ill-conditioning of the systems may vary throughout the simulation. However, in most cases, an iterative method and a preconditioner are chosen at the beginning of the simulation and are used for solving all the obtained linear systems. Since the obtained systems are not symmetric, Krylov subspace methods such as BiCGStab are used. Our aim is to implement a numerically stable version of s-step BiCGStab that has a similar convergence rate as BiCGStab for the reservoir simulations systems. The stability of s-step BiCGStab is related to the chosen s value and to the type of the basis.

Thus, we study the convergence of the s-step BiCGStab (Algorithm 2) method for different s values, using the monomial and Newton basis [7, 14, 15] and compare it to BiCGStab’s convergence. We do not consider the scaled versions of the monomial and Newton basis, since this requires the computation of the norm of each vector at a time, which annihilates the possibility of avoiding communication in the matrix powers kernel. The test matrices, described in Section 3.1, are obtained from different reservoir simulations. A sample of the obtained results is described in Section 3.2 (Tab. 3). For the well-conditioned matrices, the s-step BiCGStab with the monomial basis converges in fewer s-step iterations as s increases from 2 to 6. But for the ill-conditioned matrices, the convergence of the s-step BiCGStab with monomial basis is chaotic with respect to s. Note that, we focus on the consistency of the convergence behavior as s increases for the following reasons. First, as s increases, the communication cost per s-step BiCGStab iteration decreases, since more flops are performed per communication. Moreover, if the convergence of s-step BiCGStab improves as s increases, then the overall communication cost is decreased which should lead to a speedup in parallel implementations. Second, although we test the convergence of s-step BiCGStab on matrices obtained from reservoir simulations, our goal is the speedup obtained in the full simulation. As mentioned earlier, the obtained linear systems could be well-conditioned or ill-conditioned. However, at the beginning of the simulation a single s value is chosen without having any information on the systems to be solved. Thus, we would like to implement an s-step BiCGStab version for which the number of iterations needed for convergence decreases as s increases to some upper limit.

An alternative to the monomial basis is the Newton basis, which is known to be more stable [7] in the case of GMRES. However, in the case of s-step BiCGStab, computing a Newton basis is expensive. First, the 2s largest eigenvalues have to be computed once per matrix using some library such as ARPACK [16]. In general, the computed eigenvalues could be almost equal, which does not improve the stability of the basis. Thus, the eigenvalues are reordered in the Leja ordering as discussed in [7]. For the well-conditioned matrices obtained in geoscience simulations, using the Newton basis did not improve the convergence of s-step BiCGStab. Moreover, for the ill-conditioned matrices, finding the desired number of eigenvalues is time consuming. In addition, in geoscience simulations, we seek a relatively “cheap” and stable basis, since several linear systems are solved during the simulation (at least one per time step). For all these reasons, we will use the monomial basis and improve its numerical stability, as discussed in the next section.

2 Orthonormalized S-Step Bicgstab

The s-step BiCGStab method with the monomial basis, has an irregular convergence with respect to the s values, and converges slower than BiCGStab for some of the tested systems. This irregular and slow convergence might be due to the fact that the estimated residual used for stopping criterion is not equal to the exact residual [4]. However, in our case, the slow convergence is caused by the basis vectors which become numerically linearly dependent. One way to improve the stability of the basis is by orthonormalizing it. We propose a new variant of s-step BiCGStab that orthonormalizes the basis vectors. This new version is referred to as orthonormalized s-step BiCGStab (Algorithm 3).

There are several ways of constructing the basis and orthonormalizing it. However, we derive the algorithm irrespective of the method used. We replace the 4s + 1 basis vectors [P_2s+1, R_2s] by an orthonormal basis Q_4s+1. Then, the vectors p_m+j, r_m+j, and x_m+j can be defined as follows, $p_{m + j} = Q_{4 s + 1} a_{j}$ ${p}_{m+j}={Q}_{4s+1}{a}_j$ (22) $r_{m + j} = Q_{4 s + 1} c_{j}$ ${r}_{m+j}={Q}_{4s+1}{c}_j$ (23) $x_{m + j} - x_{m} = Q_{4 s + 1} e_{j}$ ${x}_{m+j}-{x}_m={Q}_{4s+1}{e}_j$ (24)

The n × (4s + 1) orthonormal matrix Q_4s+1 should satisfy AQ_4s+1v = Q_4s+1H_4s+1v, where ${Q}_{4s+1}v\in {\mathcal{K}}_{2s}(A,{p}_m)+{\mathcal{K}}_{2s-1}(A,{r}_m)$ and H_4s+1 is a (4s + 1) × (4s + 1) upper Hessenberg matrix. Then, for j = 0 to s − 1 we get $A p_{m + j} = Q_{4 s + 1} H_{4 s + 1} a_{j}$ $A{p}_{m+j}={Q}_{4s+1}{H}_{4s+1}{a}_j$ (25) $A r_{m + j} = Q_{4 s + 1} H_{4 s + 1} c_{j}$ $A{r}_{m+j}={Q}_{4s+1}{H}_{4s+1}{c}_j$ (26) $A^{2} p_{m + j} = Q_{4 s + 1} (H_{4 s + 1})^{2} a_{j}$ ${A}^2{p}_{m+j}={Q}_{4s+1}({H}_{4s+1}{)}^2{a}_j$ (27)

By replacing the definitions (22)-(27) in Equations (1)-(3), we get that $\begin{matrix} e_{j + 1} = e_{j} + α_{m + j} a_{j} + ω_{m + j} c_{j} - ω_{m + j} α_{m + j} H_{4 s + 1} a_{j} \\ a_{j + 1} = c_{j + 1} + β_{m + j} a_{j} - β_{m + j} ω_{m + j} H_{4 s + 1} a_{j} \\ c_{j + 1} = c_{j} - α_{m + j} H_{4 s + 1} a_{j} - ω_{m + j} H_{4 s + 1} (c_{j} - α_{m + j} H_{4 s + 1} a_{j}) \end{matrix}$ $\begin{array}{c}{e}_{j+1}={e}_j+{\alpha }_{m+j}{a}_j+{\omega }_{m+j}{c}_j-{\omega }_{m+j}{\alpha }_{m+j}{H}_{4s+1}{a}_j\\ {a}_{j+1}={c}_{j+1}+{\beta }_{m+j}{a}_j-{\beta }_{m+j}{\omega }_{m+j}{H}_{4s+1}{a}_j\\ {c}_{j+1}={c}_j-{\alpha }_{m+j}{H}_{4s+1}{a}_j-{\omega }_{m+j}{H}_{4s+1}\left({c}_j-{\alpha }_{m+j}{H}_{4s+1}{a}_j\right)\end{array}$ with e₀ = 0. As for a₀ and c₀, their definitions depend on the orthonormalization technique used. We will discuss this in Section 2.1.

As for α_m+j and δ_m+j+1, it is sufficient to replace r_m+j and p_m+j by definitions (22) and (23) to obtain $\begin{matrix} δ_{m + j + 1} = 〈 {\tilde{r}}_{0}, r_{m + j + 1} 〉 = 〈 g, c_{j + 1} 〉 \\ α_{m + j} = \frac{δ_{m + j}}{〈 {\tilde{r}}_{0}, A p_{m + j} 〉} = \frac{δ_{m + j}}{〈 g, H_{4 s + 1} a_{j + 1} 〉} \end{matrix}$ $\begin{array}{c}{\delta }_{m+j+1}=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_{m+j+1}\right\rangle=\left\langle g,{c}_{j+1}\right\rangle\\ {\alpha }_{m+j}=\frac{{\delta }_{m+j}}\left\langle {\mathop{r}\limits^\tilde}_0,A{p}_{m+j}\right\rangle\enspace =\enspace \frac{{\delta }_{m+j}}\left\langle g,{H}_{4s+1}{a}_{j+1}\right\rangle\end{array}$ where $g={Q}_{4s+1}^T{\mathop{r}\limits^\tilde}_0$ . Similarly for ω_m+j, we get $ω_{m + j} = \frac{〈 c_{j} - α_{m + j} H_{4 s + 1} a_{j}, H_{4 s + 1} c_{j} - α_{m + j} (H_{4 s + 1})^{2} a_{j} 〉}{〈 H_{4 s + 1} c_{j} - α_{m + j} (H_{4 s + 1})^{2} a_{j}, H_{4 s + 1} c_{j} - α_{m + j} (H_{4 s + 1})^{2} a_{j} 〉}$ ${\omega }_{m+j}=\frac\left\langle {c}_j-{\alpha }_{m+j}{H}_{4s+1}{a}_j,{H}_{4s+1}{c}_j-{\alpha }_{m+j}({H}_{4s+1}{)}^2{a}_j\right\rangle\left\langle {H}_{4s+1}{c}_j-{\alpha }_{m+j}({H}_{4s+1}{)}^2{a}_j,{H}_{4s+1}{c}_j-{\alpha }_{m+j}({H}_{4s+1}{)}^2{a}_j\right\rangle$ (28)since ${Q}_{4s+1}^T{Q}_{4s+1}=I$ . Finally, ${\beta }_{m+j}=\frac{{\delta }_{m+j+1}}{{\delta }_{m+j}}\frac{{\alpha }_{m+j}}{{\omega }_{m+j}}$ .

Algorithm 3 describes the orthonormalized s-step BiCGStab method except for the orthonormal basis construction phase, which we discuss in Section 2.1.

Algorithm 3: Orthonormalized s-step BiCGStab

Input:A, b, x₀, m_max, s, Type of Basis

Output:x_m: the mth approximate solution satisfying the stopping criteria

1: Let r₀ = b − Ax₀, p₀ = r₀, ρ₀ = 〈 r₀, r₀ 〉, and k = 0

2: Choose ${\mathop{r}\limits^\tilde}_0$ such that ${\delta }_0=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_0\right\rangle\ne 0$ .

3: While $\left(\sqrt{{\rho }_k}>\epsilon |\left|b\right|{|}_2\enspace \mathrm{and}\enspace k<\lfloor \frac{{m}_{\mathrm{max}}}{s}\rfloor \right)$ Do

4: Compute the orthonormal basis Q_4s+1 and

5: the upper Hessenberg matrix H_4s+1

6: Compute $g={Q}_{4s+1}^T\mathop{r}\limits^\tilde$

7: Initialize a₀, e₀, c₀ and set m = k * s

8: for (j = 0 to s − 1) Do

9: h_a = H_4s+1a_j

10: α_m+j = δ_m+j / 〈 g, h_a 〉

11: d = c_j − α_m+jh_a

12: h_d = H_4s+1d

13: ω_m+j = 〈 d, h_d 〉 / 〈 h_d, h_d 〉

14: e_j+1 = e_j + α_m+ja_j + ω_m+jd

15: c_j+1 = c_j − ω_m+jh_d − α_m+jh_a

16: δ_m+j+1 = 〈 g, c_j+1 〉

17: β_m+j = (δ_m+j+1/δ_m+j)(α_m+j/ω_m+j)

18: a_j+1 = c_j+1 + β_m+ja_j − β_m+jω_m+jh_a

19: end for

20: p_m+s = Q_4s+1a_s, r_m+s = Q_4s+1z_s, x_m+s = x_m + Q_4s+1e_s

x_m+s = x_m + Q_4s+1e_s

21: ρ_m+s = 〈 r_m+s, r_m+s 〉, k = k + 1

22: end While

2.1 Construction of the Orthonormal Basis

The simplest parallelizable way to compute the orthonormal basis Q_4s+1 is to compute first [P_2s+1, R_2s] using the matrix powers kernel. Then, orthonormalize it using a QR algorithm, such as the Tall and Skinny QR (TSQR) algorithm [13] that requires log (p) messages in parallel, where p is the number of processors. In this case, $a_{0} = U_{4 s + 1} \times [1, 0_{4 s}]^{T}$ ${a}_0={U}_{4s+1}\times [1,{0}_{4s}{]}^T$ and $c_{0} = U_{4 s + 1} \times [0_{2 s + 1}, 1, 0_{2 s - 1}]^{T}$ ${c}_0={U}_{4s+1}\times [{0}_{2s+1},1,{0}_{2s-1}{]}^T$ where U_4s+1 is the (4s + 1) × (4s + 1) upper triangular matrix obtained from the QR factorization of [P_2s+1, R_2s]. In addition, $A [P_{2 s}, 0, R_{2 s - 1}, 0] = [P_{2 s + 1}, R_{2 s}] T'$ $A[{P}_{2s},0,{R}_{2s-1},0]=[{P}_{2s+1},{R}_{2s}]T\mathrm{\prime}$ where T′ is the change of basis matrix. By replacing [P_2s+1, R_2s] by Q_4s+1U_4s+1, obtained from the QR factorization, and by assuming that U_4s+1 is invertible, we get: $\begin{array}{l} A Q_{4 s + 1} v & = & A [P_{2 s + 1}, R_{2 s}] U_{4 s + 1}^{- 1} v \\ = & A [P_{2 s}, 0, R_{2 s - 1}, 0] U_{4 s + 1}^{- 1} v \\ = & [P_{2 s + 1}, R_{2 s}] T' U_{4 s + 1}^{- 1} v \\ = & Q_{4 s + 1} U_{4 s + 1} T' U_{4 s + 1}^{- 1} v \\ = & Q_{4 s + 1} H_{4 s + 1} v \end{array}$ $\begin{array}{lll}A{Q}_{4s+1}v& =& A[{P}_{2s+1},{R}_{2s}]{U}_{4s+1}^{-1}v\\ & =& A[{P}_{2s},0,{R}_{2s-1},0]{U}_{4s+1}^{-1}v\\ & =& [{P}_{2s+1},{R}_{2s}]T\mathrm{\prime}{U}_{4s+1}^{-1}v\\ & =& {Q}_{4s+1}{U}_{4s+1}T\mathrm{\prime}{U}_{4s+1}^{-1}v\\ & =& {Q}_{4s+1}{H}_{4s+1}v\end{array}$ Q_4s+1 is an n × (4s + 1) orthonormal matrix, ${H}_{4s+1}={U}_{4s+1}T\mathrm{\prime}{U}_{4s+1}^{-1}$ is a (4s + 1) × (4s + 1) upper Hessenberg matrix, and $Q_{4 s + 1} v \in K_{2 s} (A, p_{m}) + K_{2 s - 1} (A, r_{m})$ ${Q}_{4s+1}v\in {\mathcal{K}}_{2s}\left(A,{p}_m\right)+{\mathcal{K}}_{2s-1}(A,{r}_m)$ Note that the matrix H_4s+1 is never constructed and multiplying H_4s+1 by a vector is equivalent to solving an upper triangular system and multiplying T′ and U_4s+1 by a vector.

However, there are two issues to take into consideration in the construction of the orthonormal basis. First, at iteration k ≥ 0, we compute two bases P_2s+1 and R_2s, for two different subspaces ${\mathcal{K}}_{2s+1}(A,{p}_k)$ and ${\mathcal{K}}_{2s}(A,{r}_k)$ . There is no guarantee that the two bases are linearly independent with respect to each others. In other words, the 4s + 1 vectors obtained are not necessarily linearly independent. Moreover, the upper triangular matrix obtained from the orthonormalization of a linearly dependent set of vectors, is not invertible. Second, at iteration k = 0, we compute two bases of ${\mathcal{K}}_{2s+1}(A,{r}_0)$ and ${\mathcal{K}}_{2s}(A,{r}_0)$ subspaces, since p₀ is initialized to r₀, as in the BiCGStab and s-step BiCGStab algorithms.

A solution to the first problem, is to perform a split orthonormalization, where P_2s+1 = Q_2s+1U_2s+1 and R_2s = Q_2sU_2s are orthonormalized separately. Then, ${Q}_{4s+1}=\left[{Q}_{2s+1},{Q}_{2s}\right]$ and ${U}_{4s+1}=\left(\begin{array}{ll}{U}_{2s+1}& 0\\ 0& {U}_{2s}\end{array}\right)$ still satisfy the relation $A Q_{4 s + 1} v = Q_{4 s + 1} H_{4 s + 1} v$ $A{Q}_{4s+1}v={Q}_{4s+1}{H}_{4s+1}v$ where ${H}_{4s+1}={U}_{4s+1}T\mathrm{\prime}{U}_{4s+1}^{-1}$ . But Q_4s+1 is not orthonormal, only Q_2s+1 and Q_2s are orthonormal. Note that in the derivation of the orthonormalized s-step BiCGStab in Section 2, we only need that Q_4s+1 is orthonormal for the definition of ω_m+j. Moreover, ω_m+j is obtained by minimizing the L2 norm of r_m+j+1. If instead we minimize the B norm of r_m+j+1, where B is an n × n matrix that satisfies ${Q}_{4s+1}^TB{Q}_{4s+1}={I}_{4s+1}$ , then ω_m+j would be defined as in Equation (28). We call this version split orthonormalized s-step BiCGStab (Algorithm 3).

Algorithm 4: Split Orthonormalized s-step BiCGStab

Input:A, b, x₀, m_max, s, Type of Basis

Output:x_m: the mth approximate solution satisfying the stopping criteria

1: Let r₀ = b − Ax₀, p₀ = r₀, ρ₀ = 〈r₀, r₀〉, and k = 0

2: Choose ${\mathop{r}\limits^\tilde}_0$ such that ${\delta }_0=\left\langle {\mathop{r}\limits^\tilde}_0,\enspace {r}_0\right\rangle\ne 0$ .

3: While $\left(\sqrt{{\rho }_k}>\epsilon |\left|b\right|{|}_2\enspace \mathrm{and}\enspace k<\lfloor \frac{{m}_{\mathrm{max}}}{s}\rfloor \right)$ Do

4: Compute P_2s+1 and R_2s depending on Type of Basis, and output the diagonals of T′

5: Perform the QR factorization of P_2s+1 = Q_2s+1U_2s+1

6: and R_2s = Q_2sU_2s.

7: Let Q_4s+1 = Q_2s+1 , Q_2s, ${U}_{4s+1}=\left(\begin{array}{ll}{R}_{2s+1}& 0\\ 0& {R}_{2s}\end{array}\right)$ ,

8: and ${H}_{4s+1}={U}_{4s+1}T\mathrm{\prime}{U}_{4s+1}^{-1}$

9: Compute $g={Q}_{4s+1}^T\mathop{r}\limits^\tilde$ 10: Initialize a₀, e₀, c₀ and set m = k * s

11: for (j = 0 to s − 1) Do

12: h_a = H_4s+1a_j

13: α_m+j = δ_m+j / 〈 g, h_a 〉

14: d = c_j − α_m+jh_a

15: h_d = H_4s+1d

16: ${\omega }_{m+j}=\frac\left\langle d,{h}_d\right\rangle\left\langle {h}_d,{h}_d\right\rangle$

17: e_j+1 = e_j + α_m+ja_j + ω_m+jd

18: c_j+1 = c_j − ω_m+jh_d − α_m+jh_a

19: δ_m+j+1 = 〈 g, c_j+1 〉

20: β_m+j = (δ_m+j+1/δ_m+j)(α_m+j/ω_m+j)

21: a_j+1 = c_j+1 + β_m+ja_j − β_m+jω_m+jh_a

22: end for

23: p_m+s = Q_4s+1a_s, r_m+s = Q_4s+1z_s,

x_m+s = x_m + Q_4s+1e_s

24: ρ_m+s = 〈 r_m+s, r_m+s 〉, k = k + 1

25: end While

For the second problem, starting with a p₀ ≠ r₀, might improve the convergence of all the previously discussed s-step BiCGStab versions. One might pick a random p₀. However, its effect on the convergence of the method is unknown. Thus, to be consistent with the previously introduced BiCGStab versions, we choose to perform one iteration of BiCGStab before constructing the first 4s + 1 basis vectors. The advantage is that all the information obtained in the BiCGStab iteration are used afterwards. Algorithm 5 could be considered as a “preprocessing” step, for all the s-step algorithms, where it is performed before the while loop and replacing the first two lines in Algorithm 2, Algorithm 3, or Algorithm 4. We refer to these versions as modified s-step BiCGStab, modified orthonormalized s-step BiCGStab, and modified split orthonormalized s-step BiCGStab.

Algorithm 5: Choosing p₀ not equal to r₀

1: Let r₀ = b − Ax₀, p₀ = r₀, ρ₀ = 〈 r₀, r₀ 〉, and k = 0

2: Choose ${\mathop{r}\limits^\tilde}_0$ such that ${\delta }_0=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_0\right\rangle\ne 0$ .

3: ${\alpha }_0={\delta }_0/\enspace \left\langle \mathop{r}\limits^\tildeA{p}_0\right\rangle$

4: s = r₀ − α₀Ap₀, t = As

5: ω₀ = 〈 t, s 〉 / 〈 t, t 〉

6: x₀ = x₀ + α₀p₀ + ω₀s

7: r₀ = (I − ω₀A)s

8: β₀ = α₀/(ω₀δ₀)

9: ${\delta }_0=\left\langle {\mathop{r}\limits^\tilde}_0,{r}_0\right\rangle$ , β₀ = β₀ * δ₀

10: p₀ = r₀ + β₀(I − ω₀A)p₀

11: ρ₀ = 〈 r₀, r₀ 〉

3 Results and Expected Performance

In this Section, we show the convergence behavior of the newly introduced split orthonormalized s-step BiCGStab and modified split orthonormalized s-step BiCGStab, on the matrices defined in Section 3.1, and compare them to that of BiCGStab, s-step BiCGStab, and modified s-step BiCGStab in Section 3.2. Then we discuss the split orthonormalized s-step BiCGStab’s computation and communication cost in Section 3.3 and compare it to that of s-step BiCGStab and BiCGStab.

3.1 Test Matrices

The study cases presented in this paper are obtained from different representative models for reservoir simulations at different time steps. Table 1 illustrates the characteristics of the models from which the square test matrices, GCS2K, CantaF3, SPE10 [17], and HIS, are generated.

Table 1

The test matrices.

Consequently, the obtained matrices have different profiles and varying degrees of difficulty. Table 2 describes the test matrices.

Table 2

The reservoir models.

3.2 Convergence Results

We have implemented the s-step BiCGStab, and the split orthonormalized s-step BiCGStab (Algorithm 4) whereby the monomial bases P_2s+1 and R_2s are first built and then orthonormalized separately using MKL’s (Math Kernel Library) QR factorization. We have also implemented the corresponding modified versions, whereby one iteration of BiCGStab is performed before building the first 4s+1 bases vectors. These algorithms are developed in MCG Solver [18, 19], a C++ based software package developed by IFPEN to provide linear solver algorithms for its industrial simulators in reservoir simulation, basin simulation or in engine combustion simulation.

Table 3 shows the convergence results for the s-step BiCGStab versions on the matrices introduced previously with tolerance tol = 10⁻⁸ except for the SPE10 matrix (tol = 10⁻⁴). The number of iterations needed for the convergence of s-step BiCGStab versions, referred to as s-Step Iterations (SI), is shown. For comparison reasons, we also show the Total number of Iterations (TI) of the s-step BiCGStab versions, which is equal to the s-step iterations times s. Note that in exact arithmetics, each s-step BiCGStab iteration is equivalent to s iterations of the classical BiCGStab method. However, in terms of computation and communication cost, they are not equivalent, as discussed in the next section.

Table 3

The convergence of the s-step BiCGStab, split orthonormalized s-step BiCGStab and their modified versions with monomial basis, for different s values, compared to BiCGStab. The s-Step Iterations (SI) and the Total Iterations (TI) are shown. The number of total iterations is equal to the number of s-step iterations times s.

For the well-conditioned matrices such as GCS2K, s-step BiCGStab with monomial basis converges faster as s increases from 2 to 6, and the corresponding total iterations are in the same range as that of BiCGStab. However, this is not the case for the other ill-conditioned matrices. The s-step BiCGStab’s convergence is chaotic with respect to s and for some s values it requires more total iterations to converge than BiCGStab. As mentioned earlier, using a more stable basis such as the Newton basis could improve the convergence of the s-step BiCGStab for the ill-conditioned matrices. However, for the ill-conditioned matrices ARPACK was not able to find the requested number of eigenvalues in 3000 iterations, which is time consuming. Thus, we orthonormalize the bases for better numerical stability.

In the case of well-conditioned matrices such GCS2K, the computed monomial basis is already numerically stable. Thus, it is expected that the convergence will not improve much by orthonormalizing the basis or by starting with a p₀ ≠ r₀. This is clear in Table 3, where the convergence of split orthonormalized s-step BiCGStab, modified s-step BiCGStab, and modified split orthonormalized s-step BiCGStab is in the same range as that of s-step BiCGStab.

On the other hand, the convergence behavior of the s-step BiCGStab methods for the ill-conditioned matrices varies. For SPE10, s-step BiCGStab converges in fewer s-step iterations, as s increases from 2 to 6. Then, orthonormalizing the basis and/or starting with p₀ ≠ r₀ improves the numerical stability of the basis, leading to a better convergence. Note that orthonormalizing the basis (split ortho s-step BiCGStab) has a larger effect on convergence than starting with a p₀ ≠ r₀ (modified s-step BiCGStab).

For the HIS matrix, s-step BiCGStab’s convergence fluctuates as s increases. Whereas, the other s-step BiCGStab methods have a strictly decreasing convergence with respect to s. Starting with p₀ ≠ r₀ (modified s-step BiCGStab) improved the convergence of s-step BiCGStab (except for s = 3). Moreover, the convergence results of modified s-step BiCGStab and modified split orthonormalized s-step BiCGStab are very similar for s ≥ 4.

CantaF3 is a special case where the modified split orthonormalized s-step BiCGStab method was the only method that had a stable convergence as s increased from 2 to 5. All the other s-step BiCGStab methods had a chaotic convergence with respect to s.

Note that in some cases the total iterations of the s-step BiCGStab variants is more than that of BiCGStab. However, the communication cost of each s-step iteration is much less than that of s iterations of BiCGStab. Thus the s-step variants should still converge faster, in terms of runtime, in parallel implementations.

In general, we can say that for well-conditioned matrices, the s-step BiCGStab methods have a similar rate of convergence. However, for the ill-conditioned matrices, orthonormalizing the bases separately and starting with a p₀ ≠ r₀ have positive effects on the stability of the basis which speeds up and stabilizes the convergence with respect to s.

As mentioned earlier, several linear system with different degrees of difficulty are solved throughout a basin or reservoir simulation using the same method (BiCGStab). Our goal is to speedup the convergence of BiCGStab by replacing it with an s-step version that reduces communication. Based on the presented convergence results, the modified split orthonormalized s-step BiCGStab method seems to be the most stable version with respect to s for both, well-conditioned and ill-conditioned matrices. The reason we focus on the convergence behavior as s increases, rather than the convergence behavior for a given s value, is that the s value is fixed throughout the simulation. And the convergence effect of the chosen s value on the different linear systems is not known beforehand. Thus, we seek a robust method that will converge faster as s increases to some upper limit (5 or 6).

3.3 Computation and Communication Cost

In Table 4, the number of flops performed in one s-step iteration of orthonormalized s-step BiCGStab and s-step BiCGStab is presented, along with the number of flops performed in s iterations of BiCGStab.

Table 4

The number of flops performed in one iteration of the sequential (modified) split orthonormalized s-step BiCGStab and s-step BiCGStab with monomial basis, and the corresponding flops for computing s iterations of BiCGStab.

In the (modified) split orthonormalized s-step BiCGStab there is a need for performing two QR factorizations of the matrices P_2s+1 and R_2s, however Gram-like matrix G is not computed. Thus, the computed flops in the (modified) split orthonormalized s-step BiCGStab is slightly less than the computed flops in the s-step BiCGStab as shown in Table 4. As discussed in [5], the only communication that occurs in the parallel implementation of s-step BiCGStab is in the construction of the basis and the matrix G. Similarly for the parallel implementation of the (modified) split orthonormalized s-step BiCGStab, only the construction of the basis and its orthonormalization using TSQR require communication. Note that both TSQR and the computation of G require log (p) messages and sending $O\left((4s+1{)}^2\mathrm{log}(p)\right)$ words where p is the number of processors. Thus in terms of computed flops, sent messages and sent words, the (modified) split orthonormalized s-step BiCGStab and the s-step BiCGStab are equivalent. Note that the only difference between the modified split orthonormalized s-step BiCGStab and the split orthonormalized s-step BiCGStab is that Algorithm 5 is called once before the while loop. Hence, we may assume that the communication and computation cost of both methods is bounded by the same value.

On the other hand, the performed flops in one iteration of s-step BiCGStab and orthonormalized s-step BiCGStab is at least twice the flops performed in s iterations of BiCGStab. However, the number of sent messages in the s-step versions is reduced by a factor of O(s), at the expense of increasing the number of sent words. Therefore, it is expected to obtain speedup in the parallel implementations of the introduced s-step BiCGStab variants.

Conclusion

In this paper, we have introduced the split orthonormalized s-step BiCGStab and the modified split orthonormalized s-step BiCGStab, variants of s-step BiCGStab where the basis vectors are orthonormalized. In addition, in the modified split orthonormalized s-step BiCGStab, we perform one iteration of BiCGStab to define a p₀ not equal to r₀.

We have studied the convergence behavior of the introduced methods with monomial basis, and compared it to that of the s-step BiCGStab for the matrices obtained from reservoir simulations. For almost all the tested matrices, the modified split orthonormalized s-step BiCGStab with monomial basis, converged faster than the s-step BiCGStab for s = 2, …, 6. Moreover, for ill-conditioned matrices, the modified split orthonormalized s-step BiCGStab has a similar convergence behavior as the BiCGStab method for s = 2, …, 6, unlike the s-step BiCGStab.

All the s-step BiCGStab versions send O(s) times less messages than s iterations of BiCGStab. Moreover, the computation cost of the introduced variants is slightly less than that of the s-step BiCGStab. Hence, it is expected that the introduced s-step BiCGStab methods, specifically modified split orthonormalized s-step BiCGStab, will perform well in parallel on multi-core architectures.

As a future work, we would like to implement the split orthonormalized s-step BiCGStab and the modified split orthonormalized s-step BiCGStab, in parallel and compare its runtime to that of the parallel BiCGStab and s-step BiCGStab.

References

Saad Y., Schultz M.H. (1986) Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput. 7, 3, 856–869. [Google Scholar]
Hestenes M.R., Stiefel E. (1952) Methods of conjugate gradients for solving linear systems, J. Res. Natl. Bur. Stand. 49, 409–436. [CrossRef] [MathSciNet] [Google Scholar]
van der Vorst H.A. (1992) Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems, SIAM J. Sci. Statist. Comput. 13, 2, 631–644. [Google Scholar]
Carson E., Knight N., Demmel J. (2011) Avoiding communication in two-sided Krylov subspace methods. Technical Report UCB/EECS-2011-93, EECS Department, University of California, Berkeley, August. [Google Scholar]
Carson E., Knight N., Demmel J. (2013) Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods, SIAM J. Sci. Comput. 35, 5, S42–S61. [CrossRef] [Google Scholar]
Grigori L, Moufawad S. (2013) Communication avoiding ILU0 preconditioner, Technical Report, ALPINES - INRIA Paris-Rocquencourt, March. [Google Scholar]
Hoemmen M. (2010) Communication-avoiding Krylov subspace methods, PhD Thesis, EECS Department, University of California, Berkeley. [Google Scholar]
Mohiyuddin M., Hoemmen M., Demmel J., Yelick K. (2009) Minimizing communication in sparse matrix solvers, In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC’09, New York, NY, USA, ACM, pp.1–12. [Google Scholar]
Chronopoulos A.T., Gear W. (1989) s-Step iterative methods for symmetric linear systems, J. Comput. Appl. Math. 25, 2, 153–168. [CrossRef] [Google Scholar]
Erhel J. (1995) A parallel GMRES version for general sparse matrices, Electron. Trans. Numer. Anal. 3, 160–176. [MathSciNet] [Google Scholar]
Walker H.F. (1988) Implementation of the GMRES method using householder transformations, SIAM J. Sci. Statist. Comput. 9, 1, 152–163. [CrossRef] [MathSciNet] [Google Scholar]
Demmel J., Hoemmen M., Mohiyuddin M., Yelick K. (2008) Avoiding communication in sparse matrix computations, In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium, 14–18 April 2008, Held in Hyatt Regency Hotel in Miami, Florida, USA, pp. 1–12. [CrossRef] [Google Scholar]
Demmel J., Grigori L., Hoemmen M., Langou J. (2012) Communication-avoiding parallel and sequential QR factorizations, SIAM J. Sci. Comput. 34, 206–239. [CrossRef] [Google Scholar]
Bai Z., Hu D., Reichel L. (1994) A Newton basis GMRES implementation, IMA J. Numer. Anal. 14, 563–581. [CrossRef] [MathSciNet] [Google Scholar]
Reichel L. (1990) Newton interpolation at Leja points, BIT Numerical Mathematics 30, 332–346. [CrossRef] [MathSciNet] [Google Scholar]
Lehoucq R., Sorensen D., Yang C. (1998) ARPACK Users’ Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]
The 10th SPE Comparative Solution Project (2000) Retrieved from http://www.spe.org/web/csp/datasets/set02.htm. [Google Scholar]
Anciaux-Sedrakian A., Eaton J., Gratien J., Guignon T., Havé P., Preux C., Ricois O. (2015) Will GPGPUs be finally a credible solution for industrial reservoir simulators, SPE Reservoir Simulation Symposium, 23-25 February, Houston, Texas, USA, SPE-173223-MS. DOI: 10.2118/173223-MS. [Google Scholar]
Anciaux-Sedrakian A., Gottschling P., Gratien J., Guignon T. (2014) Survey on efficient linear solvers for porous media flow models on recent hardware architectures, Oil Gas Sci. Technol. - Rev. IFP 69, 4, 753–766. [CrossRef] [EDP Sciences] [Google Scholar]

Cite this article as: A. Anciaux-Sedrakian, L. Grigori, S. Moufawad and S. Yousef (2016). S-Step BiCGStab Algorithms for Geoscience Dynamic Simulations, Oil Gas Sci. Technol 71, 66.

All Tables

Table 1

The test matrices.

In the text

Table 2

The reservoir models.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.