S-Step BiCGStab Algorithms for Geoscience Dynamic Simulations

— In basin and reservoir simulations, the most expensive and time consuming phase is solving systems of linear equations using Krylov subspace methods such as BiCGStab. For this reason, we explore the possibility of using communication avoiding Krylov subspace methods ( s-step BiCGStab), that speedup of the convergence time on modern-day architectures, by restructuring the algorithms to reduce communication. We introduce some variants of s-step BiCGStab with better numerical stability for the targeted systems. Résumé


INTRODUCTION
Many scientific problems require the solution of systems of linear equations of the form Ax = b, where the input matrix A is very large and sparse.These systems arise mainly from the discretization of Partial Differential Equations (PDE), and are usually solved using Krylov subspace methods, such as Generalized Minimal RESidual (GMRES) [1], Conjugate Gradient (CG) [2] and Bi-Conjugated Gradient Stabilized (BiCGStab) [3].
In the case of basin modeling or reservoir simulations with highly heterogeneous data and complex geometries, complex non-linear systems of PDE are solved.These PDE are discretized with a cell-centered finite volume scheme in space, leading to a non-linear system which is solved with an iterative Newton solver.At each Newton step, the system is linearized.Then, the generated large, sparse and unstructured linear system is solved using preconditioned GMRES, BiCG-Stab, CG, Orthomin or other preconditioned iterative methods.Some of the most commonly used preconditioners are ILU(k), ILUT, AMG and CPR-AMG.This resolution phase constitutes the most expensive part of the simulation.Thus we focus on linear solvers, since their efficiency is a key point for the simulator's performance.Furthermore, modern parallel computing resources are based on complex hardware architecture.They are composed of several multi-core processors and massively parallel processing units such as many-cores or General-Purpose GPU (GPGPU) cards.Most of the current algorithms are not able to fully exploit the highly parallel architectures.In fact, a severe degradation of performance is detected when the number of processing units is increased.This is due to the difference between the required time to perform floating point operations (flops) by processing units and the time to communicate the obtained results, where flops have become much cheaper than data communication.
Thus, recent research has focused on reformulating dense and sparse linear algebra algorithms with the aim of reducing and avoiding communication.These methods are referred to as communication avoiding methods, whereby communication refers to data movement between different processing units in parallel, and different levels of memory hierarchy.In the case of Krylov subspace methods, the introduced communication avoiding Krylov subspace methods [4][5][6][7][8] are based on s-step methods [9][10][11].The goal is to restructure the algorithms to perform s iterations at a time by using kernels that avoid or reduce communication, such as the matrix powers kernel [12], Tall and Skinny QR (TSQR) [13], and Block Gram Schmidt (BGS) methods.
Our aim is to reduce the overall cost of the linear solver resolution phase in geoscience simulations, specifically basin and reservoir simulations, using a parallel implementation of BiCGStab that avoids communication on multi-core hardware (CA-BiCGStab) [5] and has a similar convergence behavior as the classical BiCGStab method.Communication Avoiding BiCGStab (CA-BiCGStab), which was introduced in [5], is a reformulation of BiCGStab into s-step BiCGStab that avoids communication.Thus, in this paper we study the convergence behavior of a sequential version of the unpreconditioned s-step BiCGStab [5], on matrices obtained from reservoir simulations, with different s values.The obtained results show that, for most of the tested matrices, s-step BiCGStab requires more iterations to converge than BiCGStab.Thus, we design new variants of s-step BiCGStab, that have the same convergence rate as BiCGStab for s values between 2 and 6, and reduce communication similarly to s-step BiCGStab.
In Section 1, we introduce BiCGStab, its reformulation to s-step BiCGStab [5], and we discuss the performance of s-step BiCGStab in geoscience applications, specifically reservoir simulations.Then, in Section 2, we introduce the new s-step BiCGStab variants that we call orthonormalized s-step BiCGStab, split orthonormalized s-step BiCGStab, and modified split orthonormalized s-step BiCGStab.In Section 3, we present the convergence results of the newly introduced s-step BiCGStab variants and compare them to that of s-step BiCGStab.Finally, we conclude.

BiCGStab
The Bi-Conjugate Gradient Stabilized method (BiCGStab), introduced by van der Vorst in 1992 [3], is an iterative Krylov subspace method that solves the general systems Ax = b.It is a variant of the Bi-Conjugate Gradient (BiCG) method that aims at smoothing BiCG's erratic convergence.At each iteration m !0, r m+1 = P m+1 (A)r 0 is replaced by r m+1 = Q m+1 (A)P m+1 (A)r 0 where Q m+1 (z) 2 P m+1 and P m+1 (z) 2 P m+1 are polynomials of degree m + 1. Q m+1 (z) is chosen to be where x m minimizes the norm of r m+1 .BiCGStab, being a variant of BiCG, has a similar form.But the recurrence relations of x m+1 , r m+1 , p m+1 , a m and b m are different.
The scalars a m and b m are defined as follows: where r0 is chosen such that < r0 ; r 0 >6 ¼ 0 as shown in Algorithm 1.In general r0 is set equal to r 0 .As for x m , it is defined by minimizing the norm of the residual r m+1 , i.e. where In addition, we have that for m > 0 and more generally for m !0 and j > 0 Algorithm 1: BiCGStab Input: A, b, x 0 , m max : the maximum allowed iterations Output: x m : the mth approximate solution satisfying the stopping criteria 1: Let r 0 = b À Ax 0 , p 0 = r 0 , q 0 = h r 0 , r 0 i, and m = 0 2: Choose r0 such that 0 ¼ r0 ; r 0 h i6 ¼ 0. x m = h t, s i / h t, t i 8: x m+1 = x m + a m p m + x m s 9: q m+1 = h r m+1 , r m+1 i, m = m + 1 14: end for At each iteration of Algorithm 1, two sparse matrix-vector multiplications, six saxpy's, and five dot products are computed.Given that each processor has the scalars and its corresponding part of the vectors, then the saxpy's can be parallelized without communication.However, this is not the case for the sparse matrix-vector multiplications and the dot products, which require communication to obtain the desired results.Such operations cause a severe performance degradation, especially when using modern computing resources.

S-Step BiCGStab
To reduce the communication in parallel and sequential implementations of BiCGStab, Carson et al. [5] introduced the s-step version of BiCGStab.The reformulation is based on the computation of s BiCGStab iterations at once, and on the fact that for m !0 and 1 j s since K 2jþ1 ðA; zÞ K 2sþ1 ðA; zÞ for any z 6 ¼ 0. The goal is to perform more flops per communication, by computing 2s matrix-vector products at the beginning of each iteration of the s-step BiCGStab.This would reduce the communication cost, specifically the number of messages, by O(s) times in parallel [5].However, this is not possible using the same formulation as BiCGStab.Therefore, at the beginning of each s-step iteration, one computes P 2s+1 and R 2s , the Krylov matrices corresponding to the K 2sþ1 ðA; p m Þ and K 2s ðA; r m Þ bases respectively, where m = 0, s, 2s, 3s, . ...Then, by Equation ( 8), p m+j , r m+j , and x m+j À x m can be defined as the product of the basis vectors and a span vector, for j = 0, . .., s, where [P 2s+1 , R 2s ] is an n 9 (4s +1) matrix containing the basis vectors of K 2sþ1 ðA; p m Þ and K 2s (A, r m ), and a j , c j, and e j are span vectors of size 4s + 1.Note that e 0 = 0.As for a 0 and c 0 , their definition depends on the type of computed basis.One can compute a basis defined by a recurrence relation with three or less terms, such as a monomial, scaled monomial, Newton, or Chebyshev basis.Then, we have that where T 2s+1 and T 2s are change of basis matrices of size (2s + 1) 9 (2s) and (2s) 9 (2s À 1) respectively, and ! is a (4s + 1) 9 (4s + 1) matrix.
Then we get the following, with e 0 = 0, a 0 = [1, 0 4s ] T , and c 0 = [0 2s+1 , 1, 0 2sÀ1 ] in the case of monomial and Newton basis.Then, the span vectors, a j , c j , and e j are updated for j = 1, . .., s, rather than p m+j , r m+j , and x m+j which are of size n 4s + 1.
As for a m+j , d m+j+1 , and x m+j , it is sufficient to replace r m+j and p m+j by definitions ( 10) and ( 9) to obtain

Algorithm 2: s-step BiCGStab
Input: A, b, x 0 , m max , s, Type of Basis Output: x m : the mth approximate solution satisfying the stopping criteria 1: Let r 0 = b À Ax 0 , p 0 = r 0 , q 0 = h r 0 , r 0 i, and k = 0 2: Choose Compute P 2s+1 and R 2s depending on Type of 5: Basis, and output the diagonals of T 0 6: Initialize a 0 , e 0 , c 0 and set m = k * s 8: for (j = 0 to s À 1) Do 9: t a = T 0 a j 10: g d = Gd 14: g t = Gt d 15: x m+j = h t d , g d i / h t d , g t i 16: e j+1 = e j + a m+j a j + x m+j d 17: c j+1 = c j À x m+j t d À a m+j t a 18: end for 22: p m+s = [P 2s+1 , R 2s ]a s , r m+s = [P 2s+1 , R 2s ]z s 23: x m+s = x m + [P 2s+1 , R 2s ]e s 24: q m+s = h r m+s , r m+s i, k = k + 1 25: end While Algorithm 2 is a reformulation of BiCGStab that reduces communication where the matrix-vector multiplications are grouped at the beginning of the outer iteration, and the Gram-like matrix G is computed once per outer iteration.Then, in the inner iterations, the vector operations of size n are replaced by vector operations of size 4s + 1, where 4s + 1 ( n.However, this reformulation alone is not sufficient to reduce communication.
For example, in the sequential case the basis computation should be done using the matrix powers kernel [12], where p mþjþ1 is computed by parts that fit into cache memory, for j = 0, . .., 2s À 1.This reduces communication in the memory hierarchy of the processor and increases cache hits.In the parallel case, each processor fetches, at the beginning, the needed data from neighboring processors to compute its assigned part of the 2s vectors p mþjþ1 without any communication, for j = 0, . .., 2s À 1.Similarly, we can compute the 2s À 1 vectors r mþjþ1 for j = 0, . .., 2s À 2. Note that it is possible to compute the two bases simultaneously using a block version of the matrix powers kernel that computes a block of vectors without communication.

S-Step BiCGStab for Geoscience Applications
In geoscience applications, specifically in reservoir simulations, at each time step a new linear system of the form Ax = b has to be solved.The difficulty and the ill-conditioning of the systems may vary throughout the simulation.However, in most cases, an iterative method and a preconditioner are chosen at the beginning of the simulation and are used for solving all the obtained linear systems.Since the obtained systems are not symmetric, Krylov subspace methods such as BiCGStab are used.Our aim is to implement a numerically stable version of s-step BiCGStab that has a similar convergence rate as BiCGStab for the reservoir simulations systems.The stability of s-step BiCGStab is related to the chosen s value and to the type of the basis.
Thus, we study the convergence of the s-step BiCGStab (Algorithm 2) method for different s values, using the monomial and Newton basis [7,14,15] and compare it to BiCG-Stab's convergence.We do not consider the scaled versions of the monomial and Newton basis, since this requires the computation of the norm of each vector at a time, which annihilates the possibility of avoiding communication in the matrix powers kernel.The test matrices, described in Section 3.1, are obtained from different reservoir simulations.
A sample of the obtained results is described in Section 3.2 (Tab.3).For the well-conditioned matrices, the s-step BiCGStab with the monomial basis converges in fewer s-step iterations as s increases from 2 to 6.But for the ill-conditioned matrices, the convergence of the s-step BiCGStab with monomial basis is chaotic with respect to s.Note that, we focus on the consistency of the convergence behavior as s increases for the following reasons.First, as s increases, the communication cost per s-step BiCGStab iteration decreases, since more flops are performed per communication.Moreover, if the convergence of s-step BiCGStab improves as s increases, then the overall communication cost is decreased which should lead to a speedup in parallel implementations.Second, although we test the convergence of s-step BiCGStab on matrices obtained from reservoir simulations, our goal is the speedup obtained in the full simulation.As mentioned earlier, the obtained linear systems could be well-conditioned or ill-conditioned.However, at the beginning of the simulation a single s value is chosen without having any information on the systems to be solved.Thus, we would like to implement an s-step BiCGStab version for which the number of iterations needed for convergence decreases as s increases to some upper limit.
An alternative to the monomial basis is the Newton basis, which is known to be more stable [7] in the case of GMRES.However, in the case of s-step BiCGStab, computing a Newton basis is expensive.First, the 2s largest eigenvalues have to be computed once per matrix using some library such as ARPACK [16].In general, the computed eigenvalues could be almost equal, which does not improve the stability of the basis.Thus, the eigenvalues are reordered in the Leja ordering as discussed in [7].For the wellconditioned matrices obtained in geoscience simulations, using the Newton basis did not improve the convergence of s-step BiCGStab.Moreover, for the ill-conditioned matrices, finding the desired number of eigenvalues is time consuming.In addition, in geoscience simulations, we seek a relatively "cheap" and stable basis, since several linear systems are solved during the simulation (at least one per time step).For all these reasons, we will use the monomial basis and improve its numerical stability, as discussed in the next section.

ORTHONORMALIZED S-STEP BICGSTAB
The s-step BiCGStab method with the monomial basis, has an irregular convergence with respect to the s values, and converges slower than BiCGStab for some of the tested systems.This irregular and slow convergence might be due to the fact that the estimated residual used for stopping criterion is not equal to the exact residual [4].However, in our case, the slow convergence is caused by the basis vectors which become numerically linearly dependent.One way to improve the stability of the basis is by orthonormalizing it.We propose a new variant of s-step BiCGStab that orthonormalizes the basis vectors.This new version is referred to as orthonormalized s-step BiCGStab (Algorithm 3).
There are several ways of constructing the basis and orthonormalizing it.However, we derive the algorithm irrespective of the method used.We replace the 4s + 1 basis vectors [P 2s+1 , R 2s ] by an orthonormal basis Q 4s+1 .Then, the vectors p m+j , r m+j , and x m+j can be defined as follows, The n 9 (4s + 1) orthonormal matrix and H 4s+1 is a (4s + 1) 9 (4s + 1) upper Hessenberg matrix.Then, for j = 0 to s À 1 we get By replacing the definitions ( 22)-(27) in Equations ( 1)-(3), we get that with e 0 = 0.As for a 0 and c 0 , their definitions depend on the orthonormalization technique used.We will discuss this in Section 2.1.
As for a m+j and d m+j+1 , it is sufficient to replace r m+j and p m+j by definitions ( 22) and ( 23) to obtain where g ¼ Q T 4sþ1 r0 .Similarly for x m+j , we get Algorithm 3 describes the orthonormalized s-step BiCG-Stab method except for the orthonormal basis construction phase, which we discuss in Section 2.1.
Compute the orthonormal basis Q 4s+1 and 5: the upper Hessenberg matrix H 4s+1 6: Compute Initialize a 0 , e 0 , c 0 and set m = k * s 8: for (j = 0 to s À 1) Do 9: h a = H 4s+1 a j 10: e j+1 = e j + a m+j a j + x m+j d 15: c j+1 = c j À x m+j h d À a m+j h a 16: end for 20: p m+s = Q 4s+1 a s , r m+s = Q 4s+1 z s , x m+s = x m + Q 4s+1 e s 21: q m+s = h r m+s , r m+s i, k = k + 1 22: end While

Construction of the Orthonormal Basis
The simplest parallelizable way to compute the orthonormal basis Q 4s+1 is to compute first [P 2s+1 , R 2s ] using the matrix powers kernel.Then, orthonormalize it using a QR algorithm, such as the Tall and Skinny QR (TSQR) algorithm [13] that requires log (p) messages in parallel, where p is the number of processors.In this case, where U 4s+1 is the (4s + 1) 9 (4s + 1) upper triangular matrix obtained from the QR factorization of [P 2s+1 , R 2s ].In addition, where T 0 is the change of basis matrix.By replacing [P 2s+1 , R 2s ] by Q 4s+1 U 4s+1 , obtained from the QR factorization, and by assuming that U 4s+1 is invertible, we get: 4sþ1 is a (4s + 1) 9 (4s + 1) upper Hessenberg matrix, and Note that the matrix H 4s+1 is never constructed and multiplying H 4s+1 by a vector is equivalent to solving an upper triangular system and multiplying T 0 and U 4s+1 by a vector.
However, there are two issues to take into consideration in the construction of the orthonormal basis.First, at iteration k !0, we compute two bases P 2s+1 and R 2s , for two different subspaces K 2sþ1 ðA; p k Þ and K 2s ðA; r k Þ.There is no guarantee that the two bases are linearly independent with respect to each others.In other words, the 4s + 1 vectors obtained are not necessarily linearly independent.Moreover, the upper triangular matrix obtained from the orthonormalization of a linearly dependent set of vectors, is not invertible.Second, at iteration k = 0, we compute two bases of K 2sþ1 ðA; r 0 Þ and K 2s ðA; r 0 Þ subspaces, since p 0 is initialized to r 0 , as in the BiCGStab and s-step BiCGStab algorithms.
A solution to the first problem, is to perform a split orthonormalization, where P 2s+1 = Q 2s+1 U 2s+1 and R 2s = Q 2s U 2s are orthonormalized separately.Then, where H 4sþ1 ¼ U 4sþ1 T 0 U À1 4sþ1 .But Q 4s+1 is not orthonormal, only Q 2s+1 and Q 2s are orthonormal.Note that in the derivation of the orthonormalized s-step BiCGStab in Section 2, we only need that Q 4s+1 is orthonormal for the definition of x m+j .Moreover, x m+j is obtained by minimizing the L2 norm of r m+j+1 .If instead we minimize the B norm of r m+j+1 , where B is an n 9 n matrix that satisfies Q T 4sþ1 BQ 4sþ1 ¼ I 4sþ1 , then x m+j would be defined as in Equation ( 28).We call this version split orthonormalized s-step BiCGStab (Algorithm 3).

3: While
À ffiffiffiffi ffi k p > jjbjj 2 and k < m max s Ä ÅÁ Do 4: Compute P 2s+1 and R 2s depending on Type of Basis, and output the diagonals of T 0 5: Perform the QR factorization of P 2s+1 = Q 2s+1 U 2s+1 6: and and Initialize a 0 , e 0 , c 0 and set m = k * s 11: for (j = 0 to s À 1) Do 12: h a = H 4s+1 a j 13: 17: e j+1 = e j + a m+j a j + x m+j d 18: end for 23: q m+s = h r m+s , r m+s i, k = k + 1 25: end While For the second problem, starting with a p 0 6 ¼ r 0 , might improve the convergence of all the previously discussed s-step BiCGStab versions.One might pick a random p 0 .However, its effect on the convergence of the method is unknown.Thus, to be consistent with the previously introduced BiCGStab versions, we choose to perform one iteration of BiCGStab before constructing the first 4s + 1 basis vectors.The advantage is that all the information obtained in the BiCGStab iteration are used afterwards.Algorithm 5 could be considered as a "preprocessing" step, for all the s-step algorithms, where it is performed before the while loop and replacing the first two lines in Algorithm 2, Algorithm 3, or Algorithm 4. We refer to these versions as modified s-step BiCGStab, modified orthonormalized s-step BiCGStab, and modified split orthonormalized s-step BiCGStab.

RESULTS AND EXPECTED PERFORMANCE
In this Section, we show the convergence behavior of the newly introduced split orthonormalized s-step BiCGStab and modified split orthonormalized s-step BiCGStab, on the matrices defined in Section 3.1, and compare them to that of BiCGStab, s-step BiCGStab, and modified s-step BiCG-Stab in Section 3.2.Then we discuss the split orthonormalized s-step BiCGStab's computation and communication cost in Section 3.3 and compare it to that of s-step BiCGStab and BiCGStab.

Test Matrices
The study cases presented in this paper are obtained from different representative models for reservoir simulations at different time steps.Table 1 illustrates the characteristics of the models from which the square test matrices, GCS2K, CantaF3, SPE10 [17], and HIS, are generated.
Consequently, the obtained matrices have different profiles and varying degrees of difficulty.Table 2 describes the test matrices.

Convergence Results
We have implemented the s-step BiCGStab, and the split orthonormalized s-step BiCGStab (Algorithm 4) whereby the monomial bases P 2s+1 and R 2s are first built and then orthonormalized separately using MKL's (Math Kernel Library) QR factorization.We have also implemented the corresponding modified versions, whereby one iteration of BiCGStab is performed before building the first 4s+1 bases vectors.These algorithms are developed in MCG Solver [18,19], a C++ based software package developed by IFPEN to provide linear solver algorithms for its industrial simulators in reservoir simulation, basin simulation or in engine combustion simulation.
Table 3 shows the convergence results for the s-step BiCG-Stab versions on the matrices introduced previously with tolerance tol = 10 À8 except for the SPE10 matrix (tol = 10 À4 ).The number of iterations needed for the convergence of s-step BiCGStab versions, referred to as s-Step Iterations (SI), is shown.For comparison reasons, we also show the Total number of Iterations (TI) of the s-step BiCGStab versions, which is equal to the s-step iterations times s.Note that in exact arithmetics, each s-step BiCGStab iteration is equivalent to s iterations of the classical BiCGStab method.However, in terms of computation and communication cost, they are not equivalent, as discussed in the next section.
For the well-conditioned matrices such as GCS2K, s-step BiCGStab with monomial basis converges faster as s increases from 2 to 6, and the corresponding total iterations are in the same range as that of BiCGStab.However, this is not the case for the other ill-conditioned matrices.The s-step BiCGStab's convergence is chaotic with respect to s and for some s values it requires more total iterations to converge than BiCGStab.As mentioned earlier, using a more stable basis such as the Newton basis could improve the convergence of the s-step BiCGStab for the ill-conditioned matrices.However, for the ill-conditioned matrices ARPACK was not able to find the requested number of eigenvalues in 3000 iterations, which is time consuming.Thus, we orthonormalize the bases for better numerical stability.
In the case of well-conditioned matrices such GCS2K, the computed monomial basis is already numerically stable.Thus, it is expected that the convergence will not improve much by orthonormalizing the basis or by starting with a p 0 6 ¼ r 0 .This is clear in Table 3, where the convergence of split orthonormalized s-step BiCGStab, modified s-step BiCGStab, and modified split orthonormalized s-step BiCG-Stab is in the same range as that of s-step BiCGStab.
On the other hand, the convergence behavior of the s-step BiCGStab methods for the ill-conditioned matrices varies.For SPE10, s-step BiCGStab converges in fewer s-step iterations, as s increases from 2 to 6.Then, orthonormalizing the basis and/or starting with p 0 6 ¼ r 0 improves the numerical stability of the basis, leading to a better convergence.Note that orthonormalizing the basis (split ortho s-step BiCGStab) has a larger effect on convergence than starting with a p 0 6 ¼ r 0 (modified s-step BiCGStab).
For the HIS matrix, s-step BiCGStab's convergence fluctuates as s increases.Whereas, the other s-step BiCGStab methods have a strictly decreasing convergence with respect to s. Starting with p 0 6 ¼ r 0 (modified s-step BiCGStab) improved the convergence of s-step BiCGStab (except for s = 3).Moreover, the convergence results of modified s-step BiCGStab and modified split orthonormalized s-step BiCG-Stab are very similar for s ! 4.
CantaF3 is a special case where the modified split orthonormalized s-step BiCGStab method was the only method that had a stable convergence as s increased from 2 to 5. All the other s-step BiCGStab methods had a chaotic convergence with respect to s.
Note that in some cases the total iterations of the s-step BiCGStab variants is more than that of BiCGStab.However, the communication cost of each s-step iteration is much less than that of s iterations of BiCGStab.Thus the s-step variants should still converge faster, in terms of runtime, in parallel implementations.
In general, we can say that for well-conditioned matrices, the s-step BiCGStab methods have a similar rate of convergence.However, for the ill-conditioned matrices, orthonormalizing the bases separately and starting with a p 0 6 ¼ r 0 have positive effects on the stability of the basis which speeds up and stabilizes the convergence with respect to s.
As mentioned earlier, several linear system with different degrees of difficulty are solved throughout a basin or reservoir simulation using the same method (BiCGStab).Our goal is to speedup the convergence of BiCGStab by replacing it with an s-step version that reduces communication.Based on the presented convergence results, the modified split orthonormalized s-step BiCGStab method seems to be the most stable version with respect to s for both, well-conditioned and ill-conditioned matrices.The reason we focus on the convergence behavior as s increases, rather than the convergence behavior for a given s value, is that the s value is fixed throughout the simulation.And the convergence effect of the chosen s value on the different linear systems is not known beforehand.Thus, we seek a robust method that will converge faster as s increases to some upper limit (5 or 6).

Computation and Communication Cost
In Table 4, the number of flops performed in one s-step iteration of orthonormalized s-step BiCGStab and s-step BiCGStab is presented, along with the number of flops performed in s iterations of BiCGStab.
In the (modified) split orthonormalized s-step BiCGStab there is a need for performing two QR factorizations of the matrices P 2s+1 and R 2s , however Gram-like matrix G is not computed.Thus, the computed flops in the (modified) split orthonormalized s-step BiCGStab is slightly less than the computed flops in the s-step BiCGStab as shown in Table 4.As discussed in [5], the only communication that occurs in the parallel implementation of s-step BiCGStab is in the construction of the basis and the matrix G. Similarly for the parallel implementation of the (modified) split orthonormalized s-step BiCGStab, only the construction of the basis and its orthonormalization using TSQR require communication.Note that both TSQR and the computation of G require log (p) messages and sending Oðð4s þ 1Þ 2 logðpÞÞ words where p is the number of processors.Thus in terms of computed flops, sent messages and sent words, the (modified) split orthonormalized s-step BiCGStab and the s-step BiCGStab are equivalent.Note that the only difference between the modified split orthonormalized s-step BiCG-Stab and the split orthonormalized s-step BiCGStab is that Algorithm 5 is called once before the while loop.Hence, we may assume that the communication and computation cost of both methods is bounded by the same value.
On the other hand, the performed flops in one iteration of s-step BiCGStab and orthonormalized s-step BiCGStab is at least twice the flops performed in s iterations of BiCGStab.However, the number of sent messages in the s-step versions is reduced by a factor of O(s), at the expense of increasing the number of sent words.Therefore, it is expected to obtain speedup in the parallel implementations of the introduced s-step BiCGStab variants.

CONCLUSION
In this paper, we have introduced the split orthonormalized s-step BiCGStab and the modified split orthonormalized s-step BiCGStab, variants of s-step BiCGStab where the basis vectors are orthonormalized.In addition, in the modified split orthonormalized s-step BiCGStab, we perform one iteration of BiCGStab to define a p 0 not equal to r 0 .
We have studied the convergence behavior of the introduced methods with monomial basis, and compared it to that of the s-step BiCGStab for the matrices obtained from reservoir simulations.For almost all the tested matrices, the modified split orthonormalized s-step BiCGStab with monomial basis, converged faster than the s-step BiCGStab for s = 2, . .., 6.Moreover, for ill-conditioned matrices, the modified split orthonormalized s-step BiCGStab has a similar convergence behavior as the BiCGStab method for s = 2, . .., 6, unlike the s-step BiCGStab.
All the s-step BiCGStab versions send O(s) times less messages than s iterations of BiCGStab.Moreover, the computation cost of the introduced variants is slightly less than that of the s-step BiCGStab.Hence, it is expected that the introduced s-step BiCGStab methods, specifically modified split orthonormalized s-step BiCGStab, will perform well in parallel on multi-core architectures.
As a future work, we would like to implement the split orthonormalized s-step BiCGStab and the modified split orthonormalized s-step BiCGStab, in parallel and compare its runtime to that of the parallel BiCGStab and s-step BiCGStab.

TABLE 2
The reservoir models.

TABLE 3
The convergence of the s-step BiCGStab, split orthonormalized s-step BiCGStab and their modified versions with monomial basis, for different s values, compared to BiCGStab.The s-Step Iterations (SI) and the Total Iterations (TI) are shown.The number of total iterations is equal to the number of s-step iterations times s.