Dossier: SimRace 2015: Numerical Methods and High Performance Computing for Industrial Fluid Flows
Open Access
Issue
Oil Gas Sci. Technol. – Rev. IFP Energies nouvelles
Volume 71, Number 5, September–October 2016
Dossier: SimRace 2015: Numerical Methods and High Performance Computing for Industrial Fluid Flows
Article Number 63
Number of page(s) 9
DOI https://doi.org/10.2516/ogst/2016018
Published online 21 October 2016

© V. Seignole et al., published by IFP Energies nouvelles, 2016

Licence Creative Commons
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

The industrial problem we deal with is the management of geological natural gas storage utilities, composed of underground storage reservoir and its associated surface facilities. Efficiently managing and operating such utilities requires a thorough modeling of both underground two-phases flows and surface systems.

Regarding the reservoir simulation, many concerns are shared with the oil and gas industry, but several ones are emphasized in the context of natural gas storage:

  • underground flows are coupled with highly-varying boundary conditions at well heads,

  • long historical production data must be taken into account for the history matching phase (usually 50 years, with e.g weekly production data variability).

These points lead to long unitary computation times of the simulator: despite the use of moderate size meshes (considering 100 000 ~ 200 000 cells as order of magnitude), time steps must be maintained small enough in accordance with timed-data needed to be matched. This often yields, with a non-parallel version of our simulator, simulations spanning several days.

To enable accurate reservoir simulator predictions, finer quantified knowledge of reservoir attributes, known originally with a given level of uncertainties, must be obtained. This is achieved by history matching of reservoir models, allowing to refine our knowledge regarding distribution of porosities, permeabilities, position of geological faults. This is a very iterative task (manually or automatically realized), but in any case this is costly in simulation time.

The paper is organized as follows: we first provide some figures on the physical and numerical modeling implemented in the Multi reservoir simulator, as well as figures regarding typical simulations performed through sequentially resolution methods. Then, we describe how the solver component, main bottleneck in the runtime performance of simulator, could be considered for improvement, by considering improved parallel methods, in the continuity of the existing solvers in our simulator: the IDR(s) family of methods for solving large sparse linear systems. The next sections will focus on comparing and analyzing implementations of the solver implemented for CPU (Central Processing Unit), GPU (Graphical Processing Units), and cluster platform. We will emphasize the relevance of GPU for our main concerns: industrial 3D-field case reservoir simulation running time reduction. We will then conclude with some perspectives.

1 Physical and Numerical Modeling in MULTI

Multi is a specialized reservoir simulator, taking into account aspects relevant to natural gas storage activities in deep porous aquifers. It supports two-phase, multi component miscible gas compositional, reactive flows in porous media (water and natural gas).

We present in this section a short highlight of Multi modeling features.

1.1 Problem Modeling

The problem we consider is modeled first by its geometry: the reservoir geometry, and the wells locations. The reservoir is characterized by: its top and bottom, sides, not necessary with a uniform elevation and including geological faults. The wells are characterized by their position at surface, and their intersection with the reservoir.

The other reservoir attributes are porosities, permeabilities, relative permeabilities (to water and gas) and capillary pressure (P c = P g − P w).

For the sake of simplification, we do not detail the compositional modeling hereafter, and concentrate on the basic two-phase flow formulation. The porous transport of each phase is taken into account by one transport equation for each phase:andwhere ϕ denotes the medium porosity, ρ w , ρ g the densities for the water and gas phases, S g, S w the saturations of the respective phases, which obey S g + S w = 1. U g and U w correspond to the phase velocities, and q g and q w are source terms. These velocities are in turn closed by Darcy’s law for each phase, and the densities are closed via real thermodynamic state laws: ρ g = ρ g (Ρ g , T), ρ w = ρ w (Ρ w , T), where Ρ g, Ρ w denote each phase pressure (related by the capillary pressure), and T the temperature (supposed constant and common to the two phases).

This yields two primary unknowns: the gas saturation and its pressure (S g, P g).

1.2 Space/Time Discretisation

The reservoir is discretised by a mesh built from layers of elementary regular hexaedral volumes, which can be locally refined. Generally, our meshes are built using a 3D Geomodeler.

The space discretisation is based on a robust cell-centered finite-volume scheme.

The time discretisation is implicit, and handled by an implicit Euler time-marching scheme, and each time step is solved by a Newton-Raphson iterative process, involving a sparse linear system. The linear systems are (originally) solved by an Induced Dimension Reduction (IDR) [1] preconditioned by an incomplete LU factorization, which is ILU(k), k corresponding to a L and U filling level [2].

1.3 Sparse Matrix Characteristics and Profile

During the iterations described above, sparse linear systems arise, according to the problem discretisation: the matrices are obtained by integrating derivatives of fluxes between cells, source terms contributions. Cells are initially re-numbered, and the matrix lines are normalized, yielding diagonal values constant equal to 1.

In Figure 1, which presents two matrices via their profiles (corresponding to two actual 3D-field case reservoir models).

thumbnail Figure 1.

Example sparse matrices profiles.

The first one (issued from a standard two-phase model) has 136 710 lines, 918 840 non-zero coefficients, and the second one (issued from a compositional model simulation) has 730 493 lines, 18 million non-zero coefficients.

In the following, we will consider two leading linear systems based on these matrices described and we will denote them as system 1 (using the first matrix), and system 2 (using the second matrix).

These matrices are non-symmetric and not diagonally dominant. For typical simulations, the condition number of the matrices can be up to an order of magnitude of 105. This condition number depends upon several parameters: the iteration time-steps, the geometry, as well as the flow configuration.

In practice, we consider few reference meshes for reservoirs but coming from actual industrial 3D field-case data sets geologically consistent with more than hundreds km2 global areal extension. Geological consistency of the reservoir gridding is a key point to ensure the simulation runs will present consistent and predictive results provided a good fit of the reservoir history production. Practically we did seek to improve computational performance in this context: hence the size of the problems we consider is directly the one defined and daily used by the practice of geologists and reservoir engineers. As the linear systems we are confronted to along our reservoir simulations are of moderate size, it is highly challenging to efficiently parallelize in this range of moderate data volumes and obtain significant running time reduction of simulations when comparing to sequential solver running time of the same reservoir data set conditions and constraints. Indeed, when decomposing the work to handle resolution in parallel, we easily reach low-granularity local computations, whose execution time reach the same order of magnitude than the communication and synchronization costs.

It is to be noted that having reached a certain ratio of performance on these industrial 3D field-case data sets, one may seek for the model size limit that shows no more running time performance reduction but still giving accurate and predictive results. However the main drawback of these downsizing scale of the reservoir dataset is the geological likelihood of the refined datasets. As petrophysical reservoir properties are obtained from interpolated well data measurements with the care of preserving the geostatistical coherence of the reservoir properties (geological facies correlation length). Highly refined grids are in general rather conceptual datasets far from a reservoir geological consistency.

2 Selected Family of Solvers

In order to improve our sequential linear solver component performance, we evaluated, as an evolution to the existing one, the family of IDR(s) solvers [3] to take over the sparse linear systems resolution in Multi, and realize them with parallel implementations.

More precisely, we considered both the IDR(s) [3] and IDR(s)-biortho [4] / IDR(s)-minsync [5] method for implementation.

Let us consider the following notations for the linear systems:where A is a squared sparse matrix of size n × n, x is the unknown (vector of size n), and b the right hand side.

2.1 IDR(s) Base Method

The IDR(s) method is an extension of the IDR method. We direct the reader to the original paper, but still provide a synopsis of the IDR(s) method hereafter.

It consists, in the scope of Krylov methods, to identify nested sub-spaces, with decreasing dimensions. Such sub-spaces, called Sonneveld spaces, are defined in the following manner:

Let G 0 be the total Krylov space generated by the matrix A, and the residual r 0 = bAx 0, x 0 being the initial start point of the algorithm. And let S be a vector subspace of R n , such that S ∩ G 0 does not contain a sub-space being invariant by A. Let also define G j  = (I − ω j A)(G j−1 ∩ S), where I is the identity matrix, and the ω j ’s are non-zero coefficients (these coefficients are defined in [3]).

Then, these spaces have the following property:

  • G j  ⊂ G j−1 unless G j−1 = {0},

  • G j  = {0} for some j ≤ n.

Then, the IDR(s) method consists in establishing a Krylov-type method such that the sequence of residuals reside in the G j subspaces.

2.2 IDR(s)-Biortho / IDR(s)-Minsync

The concept of IDR(s)-biortho is, by applying the same framework as IDR(s), to make this method more robust by applying a bi-orthogonalization of the basis vectors for Sonneveld spaces.

We direct the reader to the reference papers for the details.

A version of IDR(s)-biortho optimised to reduce the number of communications in case of parallel implementation is denoted by IDR(s)-minsync, as described in [5].

2.3 First Comparative Analysis (Robustness and Convergence Rate)

Before going further, we want to stress out that we have considered the use of these solvers without the conjunction of additional pre-conditioners (i.e in additional to the renumbering and re-normalization of the matrix described in Sect. 1.3). This means that no initial ILU preconditioner has been used, compared to the initial solver we sought to improve which was used in conjunction with such a preconditioner.

All the results we provide in the paper regarding the results of applying IDR(s) methods do not involve more preconditioning steps than renumbering and normalization of the linear system.

The IDR(s) basic method was described in the initial paper by Sonneveld and Van Gijzen [3] as sometimes suffering from residual stagnation. We experienced this, despite the application of the patches suggested in that paper to cure these issues, and this is illustrated in Figure 2.

thumbnail Figure 2.

Comparison of residuals of IDR(4) and IDR(4)-biortho on system 1.

The residual stagnation case of IDR(s, s = 4) is found here, concretized by a very large amount of iterations to achieve the prescribed convergence criterion (10−5). In contrast, the bi-orthogonalization technique allows to the algorithm to converge in a much shorter number of iterations.

Also, on the system 2 described in Section 1.3, we observed (and also consistently on other linear systems in our simulations) that the IDR(s)-biortho/IDR(s)-minsync outperforms the initial IDR(s) algorithm.

In Figure 2, we observe that the IDR(s)/biortho-minsync residual convergence rate is far greater than the one of IDR(s), which in this case does not suffer from residual stagnation.

For the sake of additional information, we have also added in Figure 3 observed residuals of two other kinds of solvers: TFQMR and MR_STEEPEST [2] (a non-Krylov solver). The results regarding these solvers have been provided to position various types of solvers.

thumbnail Figure 3.

Comparison of residuals on system 2 (IDR(s) with s = 4.)

The IDR(s) authors compared IDR(s) with GMRES in the paper [3]. In terms of number of matrix-vectors product to reach convergence, GMRES performs better than IDR(s). But on one hand, GMRES is not a limited-memory algorithm, and on the other hand, is known to have time-costly iterations.

3 Analysis of the Results Using Various Parallelized Implementations

In order to speed up the computations, we have evaluated four programming paradigms, and some of their variants. First, a short synopsis of these techniques is provided in Table 1.

Table 1.

A short synopsis of programming paradigms

We have implemented exactly the same algorithm with each of these techniques. This allowed us to target tangible comparisons between the parallel programming strategies. We also did not use black-box solver libraries, in order to precisely understand the performance of the solver in the considered various setups.

After giving some technical details on these implementation techniques, we will jump to the performance results on our two leading test cases, and provide some analyses.

3.1 Description of Implementation Techniques

3.1.1 Multi-Threaded Programming

Since almost 10 years, multi-core general purpose processors (multi-core CPU) have become ubiquitous, and offer some general-purpose parallelism with shared memory. Hence, our first implementation technique targeted the use of threads to parallelize the considered algorithms. Threads are concurrent flows of executions as part of a process, and each can use a different processor core (this can be imposed by the programmer, or we can let operating system scheduler handle it). Threads share the process data memory, which is handy for parallel programming.

Our algorithm parallelization work has been done by parallelizing elementary linear algebra operations on intermediate vectors of size n and on sparse matrix times vector product.

We dealt with a C-language based implementation of IDR(s) and IDR(s)-minsync, and used the Linux POSIX threads to activate several threads of execution in the scope of the solver.

In order to implement an optimized management of threads, we use on one hand a collection of pre-activated threads, and on the other hand a lock-free synchronization technique to get rid of the associated synchronizations costs implemented at the operating-system level. These lock-free techniques we have adopted rely on Linux spinlocks [6] and on full memory barriers [7] (as provided by the GCC compiler with the __sync_synchronize() operation).

3.1.2 GPU Programming

In the last few years, we have also seen the emergence of end-user programmable GPU, with two global providers: NVIDIA and AMD. GPU have a specific architecture, and are mainly used as accelerators that can be coupled with CPU-based software. They embed many processing units, and several Giga bytes internal shared memory. Their memory throughput is of a larger order of magnitude than current CPU.

We refer the reader to references [8, 9] for already work done these last years to build and evaluate GPU based solutions for implementing linear solvers [8, 9].

We implemented the IDR(s)-minsync algorithm as a combination of CPU code and calls to necessary linear algebra primitives on a GPU (thanks for NVIDIA Cuda [10], NVIDIA cuBlas [11], and cuSparse [12] libraries).

In Figure 4, we illustrate the “master-slave” programming model we have adopted. C-language based code implementing the solver logic is executed on the CPU. Vectors and Sparse matrix are allocated in the GPU internal (global) memory, and copied from the host central memory. The linear algebra computations primitives are delegated to implemented functions (on the GPU), thanks to the cuBlas and cuSparse primitives.

thumbnail Figure 4.

GPU acceleration of linear primitives.

3.1.3 Cluster Programming

The traditional parallel programming setup consists of implementing communicating processes over a network of servers, with a distribution of input/intermediate/output data.

We selected the OpenSHMEM standard middleware [13] for data distribution, and in practice used the OpenSHMEM implementation part of OpenMPI [14]. OpenSHMEM implements a logically shared-distributed memory paradigm (Fig. 5).

thumbnail Figure 5.

OpenSHMEM paradigm illustrated.

We have used the METIS [15] library in order to decompose the sparse matrix, re-index it, and allocate the data onto the various processes distributed on a cluster.

Our implementation allows to use also multithreading for local parallel processing on each process data. This yields an hybrid distributed/multithreaded approach.

3.2 Discussion of Time and Scalability Results Obtained with these Techniques

3.2.1 Report on the Scalability Results Obtained with a Multi-Thread Approach

The reference hardware for these tests is made up of two 6-cores CPU Intel E5-2643 v2 (with 59 Go/s theoretical peak memory throughput), managed by Linux Centos 6.3.

In Figure 6, we considered the resolution of linear system system 1 and system 2 by our IDR(s)-minsync multi-threaded implementation, and increased the number of threads. We represent there: the duration time we would ideally expect for these computations, and the actually observed duration times.

thumbnail Figure 6.

Duration versus number of threads for system 1 and system 2.

Our main statement is that there is a strong departure from ideal speedup, starting at 3 concurrent threads. In the case of system 1, the time decreases until 7 threads. In the case of system 2 (of larger size than system 1), the behavior appears even worse.

Best stated speedups in these cases are respectively 2.7 and 2.6.

For the sake of additional information, we also investigated multithreaded linear primitives available in Intel MKL library, as an alternative to our own multi-threading mechanisms implementation, which globally demonstrates the same behavior.

We observed that for our particular cases, we got a slight performance advantage with our multithreading technique over Intel-MKL ones. MKL relies on Intel compiler OpenMP support, which targets general setups, and is optimized globally for best performance, whereas our implementation is dedicated to our needs, and was slightly superior in the scope of our specific range of vectors sizes.

To understand these results, we devised a synthetic multi-threaded benchmark implementing series of linear primitives on vectors, parameterized by a reference size n of vectors. We measured the computation times d(k, n) of this synthetic benchmark by varying the number of threads k from 1 to 8, and n from 1 to 1 million. We identified that the ratio when k is fixed, degrades with (except for k = 1) and with n large, it quickly reaches an upper limit far as k increases. In particular, we found out that r(8 × 106) ~ 2.25, which is an order a magnitude consistent with the results presented in Figure 6.

As a consequence of this synthetic test, we deduce that the large departure from ideal speedup found here has its origin in memory throughput bottleneck to reach (shared) L3 cache and central memory.

When analyzing the implemented solver algorithm, we find that it is mainly memory-bound.

3.2.2 We Now Report on the Results Obtained by the GPU-Based Approach

The reference GPU used for this benchmark is a Fermi Quadro 6000 NVIDIA GPU with a core clock at 574 MHz, 6 Go internal memory, and having a peak memory bandwith of 144 Go/s.

The results of applying the IDR(s)-minsync GPU implementation, compared to the sequential implementation of IDR(s) are provided in Table 2.

Table 2.

Results obtained by comparing GPU implementation of IDR(s)-minsync and sequential implementation of IDR(s)

These test cases reveal that the GPU implementation is shorter in term of execution delay: on one hand, there are more parallel computational resources on the GPU, and on the other hand, we benefit from a more integrated memory system architecture with a significantly higher throughput.

We cannot, with the implementation we have setup, provide scalability numbers regarding to the usage of GPU ressources per se, since the cuSparse and cuBlas NVIDIA libraries handle themselves an optimized allocation of its functions on the GPU: the programmer cannot configure these characteristics.

3.2.3 Results Obtained by the Cluster-Based Approach

Here, our experiments were performed on a cluster of machines, interconnected with an Infiniband QDR network.

We now deal with system 2 linear system case.

First of all, we show in the Figure 7 graphs regarding the obtained computation times, by increasing the number of processes in the parallel implementation of the solver. To avoid memory concurrency issues as for the multi-thread implementation, and in order to facilitate the understanding of the obtained performance, we allocated for this test only one process per cluster server, meaning that a single worker runs on each server.

thumbnail Figure 7.

Resolution time with regard to number of cluster nodes used, 1 process per node.

Our first statement is that we manage to get better speedup than in the multi-threaded case, even when using the same number of logical resources, that is, the same number of computational cores (but allocated on one node versus several nodes). For example, for 8 processes, the obtained speedup here is 4.46, whereas in the best case in the multi-threaded setup, we managed a maximum speedup of 2.6.

Our implementation in the cluster case is not fully optimal yet. By an analysis of the communications graph, we identified that further work is needed in order to minimize the number of communications.

3.3 Comparison with Original MULTI Solver (IDR+ILU)

We provide here some comparison figures about our IDR(s)-minsync implementation for the GPU, and the initial IDR+ILU sequential implementation.

To do that, we consider the first 64 linear systems encountered in an real-world reservoir simulation, from which our leading test case system1 is extracted.

We represent in Table 3 the cumulated time spent in these respective solvers, and present the repartition of individual execution times ratio in Figure 8: we present there, for the 64 considered linear systems, the population in different performance improvement ratio classes. In this figure: [0:0.5] and [0.5:1] are two classes in which our IDR(s)-minsync on GPU implementation does not outperform the original solver. But the majority of performance improvement ratio sits in the [2:10] interval.

thumbnail Figure 8.

Repartition of individual speedups on 63 first linear systems.

Table 3.

Cumulated time spent in the solvers

Conclusion

In this paper, we have demonstrated the relevance of the IDR(s)-biortho (that we implemented in the IDR(s)-minsync version) solver for reservoir simulations.

Our study leads us to the following main conclusions.

First, we managed to use the IDR(s)-biortho (implemented as IDR(s)-minsync) algorithm on several of our actual industrial cases without the use of complex preconditioners. This demonstrates the high robustness of this method.

Second, with a GPU implementation of the solver, we managed to outperform the sequential IDR+ILU traditional preconditioner/solver combination of our reservoir Simulator Multi, and this allowed for overall reservoir simulation times reductions by a factor of about 2 or 3, the remaining (i.e external to the solver) sections of the software still being non-parallel.

Third, the GPU implementation reveals in practice a quite adequate tradeoff (for the problems sizes we currently consider) between obtained simulation times and economics of hardware.

Our perspectives lie in: a) the hybridization of the solver, to combine GPU approach with the cluster approach, b) the consideration of larger use cases with the same techniques, and in c) the end-to-end parallelization of the reservoir simulator.

Acknowledgments

The authors thank Patrick Egermann for fruitful comments and suggestions.

References

Cite this article as: V. Seignole, J. Thebault and R. Nabil (2016). Analysis of IDR(s) Family of Solvers for Reservoir Simulations on Different Parallel Architectures, Oil Gas Sci. Technol 71, 63.

All Tables

Table 1.

A short synopsis of programming paradigms

Table 2.

Results obtained by comparing GPU implementation of IDR(s)-minsync and sequential implementation of IDR(s)

Table 3.

Cumulated time spent in the solvers

All Figures

thumbnail Figure 1.

Example sparse matrices profiles.

In the text
thumbnail Figure 2.

Comparison of residuals of IDR(4) and IDR(4)-biortho on system 1.

In the text
thumbnail Figure 3.

Comparison of residuals on system 2 (IDR(s) with s = 4.)

In the text
thumbnail Figure 4.

GPU acceleration of linear primitives.

In the text
thumbnail Figure 5.

OpenSHMEM paradigm illustrated.

In the text
thumbnail Figure 6.

Duration versus number of threads for system 1 and system 2.

In the text
thumbnail Figure 7.

Resolution time with regard to number of cluster nodes used, 1 process per node.

In the text
thumbnail Figure 8.

Repartition of individual speedups on 63 first linear systems.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.