Dossier: SimRace 2015: Numerical Methods and High Performance Computing for Industrial Fluid Flows
Open Access
Issue
Oil Gas Sci. Technol. – Rev. IFP Energies nouvelles
Volume 72, Number 2, March–April 2017
Dossier: SimRace 2015: Numerical Methods and High Performance Computing for Industrial Fluid Flows
Article Number 12
Number of page(s) 11
DOI https://doi.org/10.2516/ogst/2017007
Published online 05 April 2017

© J.-M. Gratien, published by IFP Energies nouvelles, 2017

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

Industrial simulation softwares have to manage: (i) the complexity of the underlying physical models, usually expressed in terms of a Partial Differential Equation (PDE) system completed with algebraic closure laws, (ii) the complexity of the numerical methods used to solve the PDE system, and finally (iii) the complexity of the low level computer science services required to have efficient software on modern hardware. Robust and effective Finite Volume (FV) methods as well as advanced programming techniques need to be combined in order to fully benefit from massively parallel architectures (implementation of parallelism, memory handling, design of connections). Moreover, the above methodologies and technologies have become more and more sophisticated and too complex to be handled by physicists alone. Today, this complexity management becomes a key issue for the development of scientific software.

A number of existing frameworks already offer advanced tools to deal with the complexity related to parallelism. They hide hardware complexity and provide low level algorithms dealing directly with hardware specificities for performance reasons. They often offer services to manage mesh data services and linear algebra services which are key elements for efficient parallel software. However, all these frameworks often provide only partial answers to the problem as they only deal with hardware complexity and low level numerical complexity like linear algebra. The complexity related to discretization methods and physical models lacks tools to help physicists develop complex applications. New paradigms for scientific software must be developed to help them seamlessly handle the different levels of complexity so that they can focus on their specific domain.

Generative programming, component engineering and Domain Specific Languages (DSL) are key technologies to make the development of complex applications easier to physicists, hiding the complexity of numerical methods and low level computer science services. These paradigms allow to write code with a high level expressive language and take advantage of the efficiency of generated code for low level services close to hardware specificities. Their application to scientific computing has been limited so far to Finite Element (FE) methods, for which a unified mathematical framework has been existing for a long time. Such kinds of DSL have been developed for FE or Galerkin methods in projects like Freefem, Getdp, Getfem++, Sundance, Feel++ and Fenics. In these projects, they are embedded in host languages like Python or C++ and are named Domain Specific Embedded Languages (DSEL).

Over the last few years, we have extended this kind of approach to lowest-order methods to solve the PDE systems resulting from geo modeling applications. Indeed, a recent consistent unified mathematical framework which allows a unified description of a large family of these methods has emerged. It enables then, as for FE methods, the design of a high level language inspired from the mathematical notation. Such languages help then physicist to implement their application writing the mathematical formulation at a high level, hiding the complexity of numerical methods and low level computer science services guaranty of high performance. We have developed such a language, that we have embedded in the C++ language, on top of Arcane plateform [1]. We have used the Boost Proto library [2], a powerful framework providing tools to design this DSEL. In [3] we have presented ArcFVDSL, our DSEL aiming to implement various lowest-order methods (Finite-Volume, Mimetic Finite Difference, Mixed Hybrid Finite Volume, etc.) for diffusive problems on general meshes. In [4] we have presented technical details on how this language has been designed with the Boost Proto library. In this paper, we focus on the capability of the language combined to runtime system tools like Heterogeneous Abstract RunTime System (HARTS) [5], to handle seamlessly new heterogeneous architectures with multi-core processors enhanced by General Purpose computing on Graphics Processing Units (GP-GPU). We present the performance results of various implementations of different academic problems on different kinds of heterogeneous hardware architecture.

In Section 1, we briefly present the mathematical framework, our DSEL relies upon. We present the C++ concepts used to define the back end and the front end of the language and explain how to parse expressions representing bilinear and linear forms. We show how our framework enables to generate useful algorithms to solve the discrete problems. In Section 2, we present how by introducing HARTS, a runtime layer to handle new heterogeneous hardware architecture, we have improved the generated algorithms to take advantage of multi-core architectures.

In Section 4, we present some performance results on various academic test cases that we have implemented with our framework.

1 Arcfvdsl, a Dsel in C++ to Solve Diffusive Problems with Lowest-Order Methods

As for FE/DG (Finite Element/Discontinuous Galerkin) methods, a unified mathematical framework which allows a unified description of a large family of lowest-order methods has recently emerged [6]. The key idea is to reformulate the method at hand as a (Petrov)-Galerkin scheme based on a possibly incomplete, broken affine space. This is done by introducing a piecewise constant gradient reconstruction, which is used to recover a piecewise affine function starting from cell (and possibly face) centered unknowns.

For example, considering the following heterogeneous diffusion model problem:(1)with source term f ∈ L2(Ω), κ piecewise constant.

The continuous weak formulation reads: find such thatwith

In this framework, for a given partition of Ω, a specific lowest-order method is defined by (i) selecting a trial function space and a test function space , (ii) defining for all (uh, vh) ∈ Uh × Vh a bilinear form ah(uh, vh) and a linear form bh(vh). Solving the discrete problem consists then in finding uh ∈ Uh such that:

The definition of a discrete function space Uh is based on four main ingredients:

  • the mesh representing Ω, a submesh of where ;

  • the space of vector of degrees of freedom with components indexed by the mesh entities (cells, faces or nodes);

  • a linear gradient operator that defines for each vector vh ∈ Vh a constant gradient on each element of ; and

  • the broken gradient operator.

Using the above ingredients, we can define for all a piecewise affine function such that: ,(2)

Usually three kinds of submesh are considered: the mesh itself, the submesh with pyramidal subcells built regarding the faces of and the submesh with subcells built regarding the nodes of . We denote the space of degrees of freedom with components indexed by cells and the space of degrees of freedom with components indexed only by faces. Usually the following choices are considered: or .

With this framework, the model problem (1) can be solved with various methods:

  • the cell centered Galerkin (ccG) method and the G-method with cell unknowns only;

  • the hybrid finite volume method with both cell and face unknowns that recover the Mimetic Finite Difference (MFD) and Mixed Hybrid Finite Volume (MHFV) family.

For example, the hybrid finite volume method recovers the SUSHI scheme [7-10]. The discrete space with hybrid unknowns is then obtained with: (i) , (ii) , (iii) with such that, for all , all and all ,(3)where the linear residual operator is defined as follows: for all and all ,

This method with hybrid unknowns reads:with(4)and ∇h broken gradient on .

This unified framework allows the design of a high level language close to the mathematical notation. Such a language enables to express the variational discretized formulation of PDE problem with various methods, each of them defining specific bilinear and linear forms. Algorithms are then generated to solve the problems, evaluating the forms representing the discrete problem. The front end of the language is based on concepts (mesh, function space, test trial functions, differential operators) close to their mathematical counterpart. They are linked to low level structures, the back end of the language, representing meshes, scalar arrays indexed by mesh entities, algebraic objects (vectors, matrices, linear operators). For theses structures we use frameworks like Arcane [6] or ALIEN a framework to handle various linear solver packages. Linear and bilinear forms are represented by expressions built with the terminals of the language linked with unary, binary operators (+, −, *, /, dot(.,.)) and with free functions like grad (.), div (.), integrate (.,.). The purpose of theses expressions is (i) to express the variational discretized formulation of the problem, (ii) to generate algorithms which consist in evaluating these expressions to build global linear systems which are solved to find the solution of the problem. The generative mechanism of our framework is based on the Boost Proto framework [2] and is described in detail [4].

For our diffusion model problem (1), such a DSEL will for instance achieve to express the variational discretized formulation (9) with the programming counterpart shown in Listing 1.

Listing 1

Diffusion problem implementation

MeshType Th;

Real K;

auto Vh = newSUSHISpace(Th);

auto u = Vh−> trial(“U”);

auto v = Vh−> test(“V”);

BilinearForm a =

integrate(allCells(Th), dot(K*grad(u), grad(v)));

LinearForm b =

integrate(allCells(Th), f*v);

// Bilinear and linear form evaluation

Matrix M(/*…*/);

Vector rhs(/*…*/);

Vector sol(/*…*/);

LinearEvalContext ctx(M, rhs);

fvdsl :: eval(a, ctx);

fvdsl :: eval(b, ctx);

// Linear system resolution

Alien :: solve(M, rhs, sol);

The bilinear form defined by (9) has the programming counterpart given in Listing 1 and the corresponding expression tree is detailed in Figure 1.

thumbnail Figure 1.

Expression tree for the bilinear form defined in Listing 1. Expressions are in light gray, language terminals in dark gray.

Listing 2 is a generic assembly algorithm consisting in iterating on each entity of the mesh group and in evaluating the test and trial expression on each entity. For such evaluation, different kinds of context objects are defined. The structure EvalContext<ItemT> enables to compute the linear combination objects, some generalized stencils or local vectors indexed by DOF. These objects are returned by the evaluation of test or trial expressions. When they are associated to a binary operator tag, they lead to a bilinear contribution, a local matrix contributing to the global linear system.

Listing 2

Integration assembly algorithm

template<typename ItemT,

     typename TestExprT,

     typename TrialExprT,

     typename tag_op,

     typename BilinearContextT>

void integrate(Mesh const& mesh,

       GroupT<ItemT> const& group,

       TrialExprT const& trial,

       TestExprT const& test,

       BilinearContextT& ctx)

{

 static const Context :: ePhaseType phase =

 BilinearContextT :: phase_type;

 auto matrix = ctx.getMatrix();

 for(auto cell : group)

 {

  //! eval context on mesh item

  EvalContext<Item> ctx(cell);

  //! trial linear combination

  auto lu = proto :: eval(trial, ctx);

  //! test linear combination

   auto lv = proto :: eval(test, ctx));

  BilinearContribution<tag_op> uv(lu, lv);

  assemble<phase>(matrix,

           measure(mesh, cell),

           uv);

 }

}

In the same way the evaluation of a linear form expression with a linear context leads to the construction of the right hand side of a global linear system.

Once the global linear system is built, it can be solved with a linear system solver provided by the linear algebra layer ALIEN.

2 Harts, an Abstract Object Oriented Runtime System Layer for Heterogeneous Architecture

In the previous section we have presented a generative framework, based on a DSEL that enables to describe numerical methods at a high level, and to generate C++ codes with back-end objects with a generative mechanism. In this section, we show how we can introduce various kind of parallelism in the generated codes with the runtime system layer HARTS presented in details [5]. This layer is aimed to handle, in a unified way, different levels of parallelism. As in most existing runtime system frameworks, it is based on: (i) an abstract architecture model that enables us to describe in a unified way most of present and future heterogeneous architectures with static and runtime information on the memory, network and computational units; (ii) an unified parallel programming model based on tasks that enables us to implement parallel algorithms for different architectures; (iii) an abstract data management model to describe the processed data, its placement in each memory level and the various ways to access it from the each computation unit. The use of HARTS as illustrated in Figure 2 has several advantages:

  1. it enables to clearly separate the implementation of the numerical layer from the implementation of the runtime system layer;

  2. it enables to take into account the evolution of hardware architecture with new extensions and new concepts implementation, limiting in that way the impact on the numerical layer based on the DSEL generative layer.

thumbnail Figure 2.

Layer architecture.

Most of the algorithms that could be parallelized rely on this layer that bridges the gap between our DSEL and the low level Application Programming Interface (API) used to execute algorithms on various computational units. For example, this layer has been used:

  1. to create an abstract API to manipulate algebraic objects like matrices and vectors with standard Basic Linear Algebra Subprograms (BLAS) 1 and 2 operations;

  2. to enhance the assembly part of linear systems, while linear and bilinear expressions are evaluated on collections of mesh items.

Numerical low level algorithms can often been described as sequences of matrices and vectors algebraic operations. Such sequences with a great number of floating points operations can be expensive, and the way to optimize them are diverse with respect to the hardware specificities. It is important for the generative framework to have a high level unified way to express such sequences independently of low level optimisations. We have for this purpose defined an abstract algebraic API detailed in Listing 3 aiming: (i) to hide hardware specificities, (ii) to manage memory allocation and locality, (iii) to manage parallel loops, (iv) to provide most BLAS 1 and 2 functionalities, (v) to provide tools to split vectors and manage vector views and range iterators.

class AlgebraKernel

{

  void allocate(Vector & v,

         std :: size_t alloc_size);

  void assign(Vector & v, LambdaT op);

  void axpy(ValueT const &a,

      Vector const& x, Vector const& y);

  double dot(Vector const& x, Vector const& y);

  void mult(Matrix const& A,

       Vector const& x,

       Vector const& y);

  void exec(Precond const& P,

        Vector const& x,

        Vector const& y);

};

We have implemented this API with HARTS and with various other runtime system layers (XKaapi, StarSS, …). With this API we can compare the following classical BiCGStab sequence to its programming counterpart in Listing 3.

Algorithm 1: BiCGStab Algorithm

Matrix A;

Vector b, p, pp, r, v

Scalar a;

do

 pp = inv(P).p;

 v = A.p;

 r += v;

 a = dot(p,r);

 if(a==0) break;

 …;

while(|r|<tol*|b|);

Listing 3

BiCGStab sequence

AlgebraKernelType alg;

Matrix A; Vector p, pp, r, v;

double alpha;

// Declare the algorithm sequence

SequenceType seq = alg.newSequence();

alg.exec(precond, p, pp, seq);

alg.mult(A, pp, v, seq);

alg.axpy(1., r, v, seq);

alg.dot(p, r, alpha, seq);

alg.assertNull(alpha, seq);

while(!iter.stop())

{

 // execute the sequence

 alg.process(seq);

}

We have seen in Section 1 that the evaluation of bilinear and linear expressions with linear context objects leads to algorithms as in Listing 2. These algorithms consist in iterating on mesh entities, in computing local matrices and vectors which are assembled in a global matrix and right hand side vector. With new hardware architectures, such algorithms can be parallelized in a number of different ways as long as the concurrency on global data like the global linear system is managed. To handle the variety of low level systems, HARTS provides functionalities like HARTS::parallelForeach() (see Listing 3) to iterate on collection of mesh entities (nodes, faces, cells) and to apply lambda functions in parallel.

ItemGroupT<Item> items = …;

parallelForeach(

 items,

 [&](ItemVectorView<Item> const& items)

{

 // Lambda function

 std :: for_each(items.begin(),

        items.end(),

        [](Item const item)

        {

        …

        });

});

Using graph coloring techniques, we can manage concurrency on shared data without locks. For that we partition the collections of mesh entities in sub collections of mesh entities with disjoined connectivities, then we apply on each sub collection any lambda functions, avoiding in that way any concurrent access to shared data (Shared degrees of freedom during the assembly phase). Listing 4 shows how the generic assembly algorithm of Listing 2 can be written delegating the parallelism to the runtime system layer.

Listing 4

Parallel integration assembly algorithm

template<typename ItemT,

    typename TestExprT,

    typename TrialExprT,

    typename tag_op,

    typename BilinearContextT>

void integrate(Mesh const& mesh,

       GroupT<ItemT> const& group,

       TrialExprT const& trial,

       TestExprT const& test,

       BilinearContextT& ctx)

{

 static const Context :: ePhaseType phase =

 BilinearContextT :: phase_type;

 auto matrix = ctx.getMatrix();

 ColorPartitioner<ItemT, ItemT> part(mesh, group);

 for(std :: size_t color = 0;

  part.getNbColor();

  ++color)

{

 Harts :: parallelForeach(

  part.getPartition(color),

  [&](ItemVectorView<ItemT> const& items)

{

  for(auto cell : items)

  {

    //! eval context on mesh item

    EvalContext<Item> ctx(cell);

    //! trial linear combination

    auto lu = proto :: eval(trial, ctx);

    //! test linear combination

    auto lv = proto :: eval(test, ctx));

    //! bilinear contribution

    BilinearContribution<tag_op> uv(lu, lv);

    //! matrix assemble phase

    assemble<phase>(matrix,

            measure(mesh, cell),

            uv);

   }

  });

 }

}

3 Example of Applications

In this section we present first three academic model problems, the heterogeneous diffusive problem, a linear elasticity problem and the Stokes problem. For each of them, we compare different mathematical formulations to their programming counterpart.

3.1 Diffusive Problem

The countinuous strong formulation reads:

The countinuous variational formulation reads:

Find such thatwith

This problem can be solved with various lowest-order methods, the hybrid method presented in Section 1, the G-method [11] or the ccG method [12, 13].

In the G-method

The trial space for is obtained with (i) , (ii) , and (iii) a gradient operator, piecewise constant on the elements , base on the L construction, detailed in [11].

The method reads then:where with.

We can compare to the following programming counterpart:

MeshType Th(/*…*/);

VariableCellReal 3×3 K(/*…*/);

VariableCellReal f(/*…*/);

// FORMS DECLARATION

auto Uh = newGSpace(Th);

auto u = Uh−> trial();

auto v = Uh−> test();

Bilinear Form ah_g =

integrate(

  internalFaces(Th),

  dot(N(Th), avr(K*grad(u)))*jump(v)))

 + integrate(

  boundaryFaces(Th),

  dot(N(Th), avr(K*grad(u)))*jump(v));

LinearForm bh_g =

  integrate(cells(Th), −f*v);

// FORMS EVALUATION

/*…*/

In the ccG method

We introduce the linear gradient operator such that, for all and all ,(5)

The discrete space for the ccG method is obtained with: (i) , (ii) , and (iii) with such that(6)where is a linear trace reconstruction operator on the faces of .

Let for all ,(7)

The method reads:(8)

We can compare to the following programming counterpart:

MeshType Th(/*…*/);

VariableCellReal 3×3 K(/*…*/);

VariableCellReal f(/*…*/);

auto Uh = newCCGSpace(Th);

auto u = Uh−> trial();

auto v = Uh−> test();

// FORMS DECLARATION

BilinearForm ah_ccg =

integrate(all Cells(Th),

     dot(K*grad(u), grad(v)))

 + integrate(allFaces(Th),

  −K*jump(u)*dot(N(Th), avr(grad(v)))

  −K*dot(N(Th), avr(grad(u)))*jump(v))

 + integrate(internalFaces(Th),

  eta /H(Th)*jump(u)*jump(v));

LinearForm bh_ccg =

  integrate(cells(Th), −f*v);

// FORMS EVALUATION

/*…*/

3.2 Linear Elasticity

The countinuous strong formulation reads:where is the vector-valued displacement field.

The countinuous variational formulation reads:

Find such thatwith

The discrete variational hybrid method reads [14]:with(9)

We can compare to the following programming counterpart:

MeshType Th(/*…*/);

auto Uh = newHybridSpace(Th);

auto u = Uh−> trialArray(“U”, Th :: dim);

auto v = Uh−> testArray(“V”, Th :: dim);

// BILINEAR AND LINEAR FORMS

BilinearForm ah =

integrate(allCells(Th),

      m_2mu*ddot(eps(u), eps(v)))

 + integrate(allCells(Th),

      m_lambda*(div(u)*div(v)));

LinearForm bh =

integrate(allCells(Th),

      dot(m_f, v));

// Boundary conditions DIRICHLET + NEUNMANN

ah +=

on(boundary(Th, “Dirichlet”,

  trace(u) = g);

ah +=

integrate(boundary(Th, “Neunmann”),

   alpha*dot(SigmaN(u), v));

bh +=

integrate(boundary(Th, “Neunmann”),

      alpha*dot(t, v)););

3.3 Stokes Problem

The countinuous strong formulation reads: where is the vector-valued velocity field, is the pressure, and is the forcing term.

The countinuous variational formulation reads:

Following [15], we consider a discretization based on the spaces(10)

The momentum diffusion is discretized by the bilinear form such that(11)where, for all wh ∈ Uh, the Cartesian components of wh are denoted by (wh,i)i∈{1,…,d}. The velocity-pressure coupling hinges on the bilinear form ):(12)

The discrete divergence operator associated to bh is not surjective with choice of spaces (10). The stability of the velocity-pressure coupling can be recovered by penalizing pressure jumps via the bilinear form such that(13)

The discrete problem reads: Find (uh, ph) ∈ Uh × Ph such that, for all (vh, qh) ∈ Uh × Ph,(14)

We can compare to the following programming counterpart:

MeshType Th;

auto Uh = newCCGSpace(Th);

auto Ph = newP0Space(Th);

auto u = Uh−> trialArray(Th :: dim);

auto v = Uh−> testArray(Th :: dim);

auto p = Ph−> trial();

auto q = Ph−> test();

FVDomain :: algo :: Range<1> _i(dim);

BilinearForm ah =

integrate(allCells(Th),

  sum(_i)[dot(grad(u(_i)), grad(v(_i))]))

+ integrate(Internal <Face > :: items(Th),

sum(_i)[

   −dot(N(Th), avg(grad(u(_i))))*jump(v(_i))

   −jump(u(_i))*dot(N(), avg(grad(v(_i))))

   +eta/H(Th)*jump(u(_i))*jump(v(_i))

     ]);

BilinearForm bh =

integrate(allCells(Th),

      −id(p)*div(v))

+ integrate(allFaces(Th),

      avg(p)*dot(fn, jump(v)));

BilinearForm bth =

integrate(allCells(Th),

      div(u)*id(q))

 + integrate(allFaces(Th),

      −dot(N(Th), jump(u))*avg(q));

BilinearForm sh =

 integrate(internalFaces(Th),

      H(Th)*jump(p)*jump(q));

LinearForm fh =

integrate(allCells(Th),

      sum(_i)[f(_i)*v(_i)]);

4 Performance Results

In this section, we present some performance results obtained on the heterogeneous diffusive problem. We focus on the two most expensive parts, the global linear system assembly and the linear system resolution. In Figure 3, we can notice that, even if the linear system assembly is not trivial to parallelize due to concurrent access matrix and vector entries, we can obtain not so bad accelerations. For the linear system resolution, our generic layer ALIEN gives us access to various linear solver packages that can perform on multi-core nodes or on GP-GPU. In Figure 4, we see the solver performance up to 16 cores on multi-cores architecture. We can also see how we can even take advantage of the power of GP-GPU with the dedicated solver package for GP-GPU.

thumbnail Figure 3.

Linear system assembly performance.

thumbnail Figure 4.

Linear solver performance.

Conclusion and Perspective

We have presented ArcFVDSL, a DSEL developed on top of the Arcane framework. This high level language enables to implement various lowest-order methods for diffusive problems on general meshes. It hides low level optimisations for new heterogeneous architecture. We have shown how this generative framework combined to the HARTS layer can generate efficient codes and allows to handle seamlessly architectures with multi-core processors enhanced by GP-GPU. This combination turns to be a good solution to provide high level tools to implement new complex numerical methods preserving the performance of the generated code on heterogeneous hardware architectures. In the future, we plan to introduce at the numerical level, new features in the language to handle for example non linear models. At the hardware level, we plan to introduce new optimisations through the HARTS layer to take advantage on new Many integrated cores architectures using for instance Intel Xeon Phi processors.

References

Cite this article as: J.-M. Gratien (2017). ArcFVDSL, a DSEL Combined to HARTS, a Runtime System Layer to Implement Efficient Numerical Methods to Solve Diffusive Problems on New Heterogeneous Hardware Architecture, Oil Gas Sci. Technol 72, 12.

All Figures

thumbnail Figure 1.

Expression tree for the bilinear form defined in Listing 1. Expressions are in light gray, language terminals in dark gray.

In the text
thumbnail Figure 2.

Layer architecture.

In the text
thumbnail Figure 3.

Linear system assembly performance.

In the text
thumbnail Figure 4.

Linear solver performance.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.