Regular Article
A method for parallel scheduling of multirate cosimulation on multicore platforms
^{1}
IFP Energies nouvelles, 14, avenue de BoisPréau, 92852 RueilMalmaison Cedex, France
^{2}
Inria centre de Paris, 2 rue Simone Iff, 75012 Paris, France
^{*} Corresponding author: nicolas.pernet@ifpen.fr
Received:
24
February
2018
Accepted:
11
February
2019
The design of cyberphysical systems is a complex process and relies on the simulation of the system behavior before its deployment. Such is the case, for instance, of joint simulation of the different subsystems that constitute a hybrid automotive powertrain. Cosimulation allows system designers to simulate a whole system composed of a number of interconnected subsystem simulators. Traditionally, these subsystems are modeled by experts of different fields using different tools, and then integrated into one tool to perform simulation at the systemlevel. This results in complex and computeintensive cosimulations and requires the parallelization of these cosimulations in order to accelerate their execution. The simulators composing a cosimulation are characterized by the rates of data exchange between the simulators, defined by the engineers who set the communication steps. The RCOSIM approach allows the parallelization on multicore processors of cosimulations using the FMI standard. This approach is based on the exploitation of the cosimulation parallelism where dependent functions perform different computation tasks. In this paper, we extend RCOSIM to handle additional cosimulation requirements. First, we extend the cosimulation to multirate, i.e. where simulators are assigned different communication steps. We present graph transformation rules and an algorithm that allow the execution of each simulator at its respective rate while ensuring correct and efficient data exchange between simulators. Second, we present an approach based on acyclic orientation of mixed graphs for handling mutual exclusion constraints between functions that belong to the same simulator due to the nonthreadsafe implementation of FMI. We propose an exact algorithm and a heuristic for performing the acyclic orientation. The final stage of the proposed approach consists in scheduling the cosimulation on a multicore architecture. We propose an algorithm and a heuristic for computing a schedule which minimizes the total execution time of the cosimulation. We evaluate the performance of our proposed approach in terms of the obtained execution speed. By applying our approach on an industrial use case, we obtained a maximum speedup of 2.91 on four cores.
© S.E. Saidi et al., published by IFP Energies nouvelles, 2019
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
The recent advancement in merging different technologies and engineering domains has led to the emergence of the field of CyberPhysical Systems (CPS) [1]. In such systems, embedded computers interact with, and control physical processes. Because of the heterogeneity of the involved components (multiphysics, sensors, actuators, embedded computers), CPS may feature very complex architectures and designs, ever raising the need for time, cost, and efforteffective approaches for building robust and reliable systems. Numerical simulation has proven successful in responding to this need, and is therefore increasingly considered to be an indisputable step in design verification and validation. For complex CPS, experts of different engineering disciplines may be involved in the design process by developing models and simulators for different parts of the system. In a later stage, the developed simulators are coupled together in order to perform what is called simulation at the system level [2]. This approach is called cosimulation [3]. Each simulator is assigned an integration step, which in some cases is driven by the dynamics of the modeled system and the control objective, and exchanges data with the other simulators according to a communication step which can be equal or different than its integration step. Enabling cosimulation of heterogeneous models requires using adequate tools and methods. In this scope, the Functional Mockup Interface (FMI) [4] was developed with the goal of facilitating the cosimulation and the exchange of subsystem models and simulators. FMI defines a standardized interface that can be implemented by modeling tools in order to create models and simulators that can be connected with other FMI models and simulators.
Cosimulation is an alternative approach to monolithic simulation where a complex system is modeled as a whole. Here, by complex system, we refer to systems where the controlled physical process constitutes a multiphysics system and is modeled in the continuoustime domain using (hybrid) Ordinary Differential Equations (ODEs). Because they are aimed to be implemented on embedded computers, numerical laws that control the physical process are modeled in the discrete time domain. All of these features add to the complexity of the models and simulators. Cosimulation has a number of advantages over the monolithic approach. It allows modeling each part of the system using the most appropriate tool instead of using a single modeling software. Also, it allows a better intervention of the experts of different fields at the subsystem level. Furthermore, cosimulation facilitates the upgrade, the reuse, and the exchange of models and simulators. In cosimulation, the equations of each Functional Mockup Unit (FMU) are integrated using a solver separately. FMUs exchange data by updating their inputs and outputs according to their communication steps [5]. For connected FMUs with different communication steps, the master algorithm can provide extrapolation techniques to produce the missing data of the continuous part of the simulated system.
Figure 1 shows an example of the evolution of time and data exchange between two FMUs. The horizontal arrows represent the simulated time of each FMU. The vertical double arrows represent data exchange between the FMUs. In general, for each FMU, simulation is performed integration step by integration step. We consider that integration step sizes of the FMUs may differ. Also, we consider that data exchange between FMUs only happens when the simulated time of both FMUs is equal. Consequently, a communication step is defined for every couple of simulators. In addition, we consider, that this communication step is a multiple of the integration steps of both. Cosimulation requires domainspecific information for each involved FMU. Such information, which is beyond the scope of FMI, can be provided by domain experts, for example by using an approach like the one proposed by Sirin et al. [6]. The communication steps of FMUs can be shared as part of this information. It is worth noting that setting communication step sizes might depend on the parameters set for the FMU, and the dynamics of the inputs.
Fig. 1 Evolution of time and data exchange between two FMUs during cosimulation. 
Usually, assembling FMUs results in a heavy cosimulation, requiring high processing power. Increasing CPU performance through frequency scaling has reached some technological limits leading in a stagnation in the transistor count in processors. As a consequence, during the last decade, parallelism has been by far the main way for increasing the performance of processors [7].
In fact, the last years have witnessed a major shift among semiconductor manufacturers to building multicore processors, i.e. integrating multiple processors into one chip allowing parallel processing on a single computer. Enabling parallel execution of heavy cosimulations on multicore processors is keenly sought by the developers and the users of simulation tools. However, fulfilling this objective is not trivial and appropriate parallelization techniques need to be applied on cosimulation simulators in order to accelerate their execution on multicore processors.
In order to achieve (co)simulation acceleration using multicore execution, different approaches are possible and were already explored. From a user point of view, it is possible to modify the model design in order to prepare its multicore execution, for example by using models that are developed using parallel computing libraries as in [8]. Using parallel solvers as in [9] is another alternative. In [10], authors present the DACCOSIM framework which allows multithreaded distributed simulation of FMUs on multicore processors as well as clusters of computers. However, the parallelization achieved through this approach is not automatic as the user has to define the distribution of the FMUs on the architecture. A more detailed discussion of related work is given in [11].
In this paper, we address the problem from a cosimulation tool provider point of view. In such a tool, the user connects different FMUs, embedding solvers or using solvers provided by the cosimulation tool. In this case, it is not possible to change the models, the solvers, nor the modeling tools.
Readers can refer to Figure 16 to figure out the main stages of the proposed solution.
The rest of the paper is organized as follows. Section 2 gives a background about the approach proposed in this paper. The dependence graph model, the graph transformation rules, algorithm and heuristic are presented in Section 3. Then Section 4 presents our multicore scheduling proposal for cosimulation acceleration. In Section 5 we evaluate our proposed approach using both randomly generated graphs and an industrial usecase. Finally, we conclude in Section 7.
2 Background
The Refined COSIMulation (RCOSIM) [11] approach allows the parallelization on multicore processors of cosimulations using the FMI standard. This approach is based on a representation of the cosimulation using a Directed Acyclic Graph (DAG). A scheduling heuristic is then used to accelerate the execution of the cosimulation.
We chose to build on the RCOSIM approach in order to achieve parallelization of FMI cosimulations for accelerated execution. Parallel execution of the functions of one FMU is interesting since, as can be seen in Figure 3, it may not be possible to execute all the functions of one FMU and then all functions of another one. Concurrent execution is needed here which may lead to using synchronization mechanisms hurting performance. However, by exploring the search space of parallelization solutions, an efficient solution can be sought.
We did not use known parallel programming libraries for the following specific reasons. It is clear that MPI is not suitable for our goal since we target shared memory architectures whereas MPI is used to program distributed memory architectures. The other option is to use OpenMP or similar libraries which are adapted to sharedmemory architectures. However, OpenMP is efficient especially in the case of data parallelism (e.g. loop parallelism) which is not apparent in the cosimulations that we target. In fact, since we do not have access to the source code of the functions, we can not perform parallelization of the functions code, e.g. solver function, by using OpenMP pragmas. We only have information about the cosimulation at the function level, i.e. the functions can only be called but their code cannot be accessed. It should be noted that libraries such as OpenMP and Intel TBB offer task programming features which can be used to execute multiple functions in parallel. Note that the code of each function is not parallelized, but two or more functions can be executed in parallel using this solution. However, they rely on online scheduling which may introduce high overhead and thus decreases the performance. In addition, given that information about dependence between functions is available and the execution times can be measured, we assume that offline scheduling is more efficient to achieve our goal.
In this paper, we deal with the limitations of this approach that we highlighted in a recent work [12]. Although RCOSIM resulted in interesting cosimulation speedups, it has two limitations that have to be considered in the multicore scheduling problem in order to obtain better performances. First, RCOSIM supports only monorate cosimulations, i.e. the different FMUs have to be assigned the same communication step size. Nevertheless, some cosimulation scenarios have groups of FMUs that are more tightly coupled than others. This leads to different communication step sizes between some FMUs of the same cosimulation. In particular, in a given cosimulation scenario, we may have small communication steps sizes to ensure accuracy between tightly coupled FMUs and large communication step sizes between loosely coupled FMUs to speed up the cosimulation. The use case we propose in Section 5 is an example of such cosimulation. Note that this work is restricted to fixed integration step sizes and communication step sizes that are greater or equal to the integration step sizes of the respective simulators. Second, the functions of an FMU may not be threadsafe, i.e. they cannot be executed in parallel as they may share some resource (e.g. variables) that might be corrupted if two operations try to use it at the same time. Consequently, if two or more operations belonging to the same FMU are executed on different cores, a mechanism that ensures these operations are executed in disjoint time intervals must be set up. It is worth noting that a cosimulation should be repeatable which would not be possible if mutual exclusion is not handled properly as this may lead to race conditions.
Previously, these mutual exclusion constraints were tackled in RCOSIM by executing the operations of the same FMU on the same core which restricts the exploitation of the parallelism.
We propose in this article a solution for each of the aforementioned limitations by making extensions to RCOSIM. In order to allow handling multirate cosimulations, we present a graph transformation algorithm that is applied to the initial constructed graph. Then, mutual exclusion constraints are represented by adding, in the graph, non oriented edges between operations belonging to the same FMU. In order assign directions to the added non oriented edges, we formulate the problem, propose a resolution using linear programming and finally present an acyclic orientation heuristic. Then we present a multicore scheduling solution for such cosimulation graph. We first propose a resolution using linear programming before presenting a multicore scheduling heuristic. Multicore scheduling problems are known to be NPhard resulting in exponential resolution times when exact algorithms are used. Heuristics have been extensively used in order to solve multicore scheduling problems. In most situations they lead to results of good quality in practicle resolution times. In particular, list heuristics are widely used in the context of offline multicore scheduling.
Automatic parallelization [13] of computer programs embodies the adaptation of the potential parallelism inherent in the program to the effective parallelism that is provided by the hardware. Because computer programs are usually complex (multiple functions, nested function calls, control flow jumps, etc.), this process of adaptation requires the use of a model for abstracting the program to be parallelized. The aim of using such model is to identify which parts of the program can be executed in parallel by expressing some features of the program such as data dependence between different parts of the code. Task dependence graphs are commonly used for this purpose. A task dependence graph is a DAG denoted G(V, A) where:

V is the set of vertices of the graph. The size of the graph n is equal to the number of its vertices. Each vertex v _{i}: 0 ≤ i < n represents a task which is an atomic unit of computation, i.e. it cannot be allocated to several computing resources.

A is the set of arcs of the graph. A directed arc is denoted as a pair (v _{i}, v _{j}) and describes a precedence constraint between v _{i} and v _{j}, i.e. v _{i} has to finish its execution before v _{j} can start its execution.
The task dependence graph defines the partial order to be respected when executing the set of tasks. This partial order describes the potential parallelism of the program, i.e. vertices that are not in precedence relation A which is asymmetric and transitive.
The cosimulation of FMUs lends itself to the task dependence graph representation as shown hereafter. According to the FMI standard, the code of an FMU can be exported in the form of source code or as precompiled binaries. However, most FMU providers tend to adopt the latter option for proprietary reasons. We are thus interested in this case. The method for automatic parallelization of FMU cosimulation that we propose in this article is based on representing the cosimulation as a task dependence graph. We present in the rest of this section how this graph is constructed and a set of attributes that characterize it. The graph construction and characterization method is part of the RCOSIM approach as presented in [11].
2.1 Construction of the dependence graph of an FMU cosimulation
The entry point for the construction of a task dependence graph of an FMU cosimulation is a userspecified set of interconnected FMUs as depicted in Figure 2. Here, we consider only cosimulations where all FMUs are assigned identical communication step sizes. We refer to the graph which represents such cosimulation as monorate graph. The execution of each FMU is seen as computing a set of inputs, a set of outputs, and the state variables of the FMU. A computation of an input, output, or state variables is performed by FMU C function calls. Thanks to FMI, it is additionally possible to access information about the internal structure of a model encapsulated in an FMU. In particular, as shown in Figure 3, FMI allows the identification of Direct Feedthrough (e.g. Y _{B1}) and Non Direct Feedthrough (e.g. Y _{A1}) outputs of an FMU and other information depending on the version of the standard:

FMI 1.0: Dependence between inputs and outputs are given. The computation of the state at a given simulation step k is considered necessary for the computation of every output at the same simulation step k. It is considered that the computation of the state at a simulation step k + 1 requires the computation of each of the inputs at the simulation step k.

FMI 2.0: In addition to the information provided in FMI 1.0, more information is given about data dependence. It is specified which output at a given simulation step depends on the state computation at the same step. Also, it is specified which input at a simulation step k needs to be computed before the computation of the state at the step k + 1.
Fig. 2 InterFMU dependence specified by the user. 
Fig. 3 IntraFMU dependence provided by FMI. 
FMU information on input/output dependence allows transforming the FMU graph into a graph with an increased granularity. For each FMU, the inputs, outputs, and state are transformed into operations. An input, output, or state operation is defined as the set of FMU function calls that are used to compute the corresponding input, output, or state respectively. The cosimulation is described by a task dependence graph G(V, A) called the operation graph where each vertex represents one operation, each arc represents a precedence relation between operations o _{i} and o _{j}, and n = V is the size of the operation graph. The operation graph is built by exploring the relations between the FMUs and between the operations of the same FMU. A vertex is created for each operation and arcs are then added between vertices if a precedence dependence exists between the corresponding operations. If FMI 1.0, which does not give information about the dependence between the state variables computation and the input and output variables computations, is used, we must add arcs between all input operations and the state operation of the same FMU. Furthermore, arcs connect all output operations and the state operation of the same FMU because the computation at the simulation step k of an output must be performed with the same value of the state (computed at simulation step k) as for all the outputs belonging to the same FMU. An execution of the obtained graph corresponds to one simulation step. The operation graph corresponding to the FMUs of Figure 3 is shown in Figure 4. Note that we assume that no rollbacks are used and that the appropriate sequence of function calls is used as stated in the FMI standard.
2.2 Dependence graph attributes
The operation graph is used as input to the scheduling algorithm. In addition to the partial order defined by the graph, the scheduling algorithm uses a number of attributes to compute an efficient schedule of the operation graph. Many list scheduling algorithms are based on the Critical Path Method (CPM) [14] and use a set of attributes and notations to characterize the operation graph.
The notation f _{m} (o _{i}) is used to refer to the FMU to which the operation o _{i} belongs, and T(o _{i}) to denote the type of the operation o _{i}, i.e. . An operation o _{i} is characterized by its communication step H(o _{i}) which is equal to the communication step of its FMU. o _{j} is a predecessor of o _{i} if there is an arc from o _{j} to o _{i}, i.e. . We denote the set of predecessors of o _{i} by pred(o _{i}). In the example shown by Figure 5, pred(d) = {b, c}. o _{j} is an ancestor of o _{i} if there is a path in G from o _{j} to o _{i}. The set of ancestors of o _{i} is denoted by ance(o _{i}). In the example shown by Figure 5, ance(d) = {a, b, c}. o _{j} is a successor of o _{i} if there is an arc from o _{i} to o _{j}, i.e. . We denote the set of successors of o _{i} by succ(o _{i}). In the example shown by Figure 5, succ(d) = {b, c}. o _{j} is a descendant of o _{i} if there is a path in G from o _{i} to o _{j}. The set of descendants of o _{i} is denoted by desc(o _{i}). In the example shown by Figure 5, desc(a) = {b, c, d}. A profiling phase allows measuring the execution time C(o _{i}). For each operation, the average execution time of multiple cosimulation runs is used. Let’s consider that the operations shown in Figure 5 have the following execution times C(a) = 2, C(b) = 2, C(c) = 1, C(d) = 4. The earliest start time from start S(o _{i}) and the earliest end time from start E(o _{i}) are defined by equations (1) and (2) respectively. S(o _{i}) defines the earliest time at which the operation o _{i} can start its execution. S(o _{i}) is constrained by the precedence relations. The earliest time the operation o _{i} can finish its execution is defined by E(o _{i}):
Fig. 5 Example of an operation graph. 
In the example of Figure 5, we have: S(a) = 0, S(b) = 2, S(c) = 2, S(d) = 4 and E(a) = 2, E(b) = 4, E(c) = 3, E(d) = 8.
The latest end time from end denoted by and the latest start time from end denoted by are defined by equations (3) and (4) respectively.
In the example of Figure 5, we have: and .
The critical path of the graph is the longest path in the graph. The length of a path is computed by accumulating the execution times of the operations that belong to it. The length of the critical path of the operation graph denoted by R is defined by equation (5). The critical path is a very important characteristic of the operation graph. It defines a lower bound on the execution time of the graph, i.e. in the best case the time needed to execute the whole graph is equal to the length of the critical path:
In the example of Figure 5, we have: R = max (E(a), E(b), E(c), E(d)) = max (2, 4, 3, 8) = 8.
The flexibility F(o _{i}) is defined by equation (6). F(o _{i}) expresses the length of an execution interval within which operation o _{i} can be executed without increasing the total execution time of the graph:
In the example of Figure 5, we have F(a) = 8−0−2−6 = 0, F(b) = 8−2−2−4 = 0, F(c) = 8−2−1−4 = 1, F(d) = 8−4−4−0 = 0.
3 Extension of the dependence graph model for FMU cosimulation
This section describes our proposed extensions to the dependence graph model that is used for representing an FMU cosimulation. The transformations that it undergoes in order to represent multirate cosimulation and mutual exclusion constraints are presented.
3.1 Dependence graph of a multirate FMU cosimulation
Our operation graph model has to be extended in order to accommodate multirate data exchange between operations.
Consider an operation graph that is constructed as described in the previous section from a multirate cosimulation, i.e. some FMUs have different communication steps. Such graph is referred to as a multirate operation graph. One way for making such operation graph suitable for multicore scheduling is to transform it into a monorate graph. This section presents an algorithm that transforms the initial multirate operation operation graph G(V, A) into a monorate operation operation graph G _{M} (V _{M}, A _{M}). The aim of this transformation is to ensure that each operation is executed according to the communication step of its respective FMU and also to maintain a correct data exchange between the different FMUs, whether they have different or identical communication steps. Similar algorithms have been used in the realtime scheduling literature to account for multirate scheduling problems [15].
We define the HyperStep (HS) as the Least Common Multiple (LCM) of the communication steps of all the operations: HS = LCM (H(o _{1}), H(o _{2}),…, H(o _{n})) where n = V _{I} is the number of operations in the initial graph. The HyperStep is the smallest interval of time for describing an infinitely repeatable pattern of all the operations. The transformation algorithm consists, first of all, in repeating each operation o _{i}, r _{i} times where r _{i} is called the repetition factor of o _{i} and . Each repetition of the operation o _{i} is called an occurrence of o _{i} and corresponds to the execution of o _{i} at a certain simulation step. We use a superscript to denote the number of each occurrence, for instance denotes the sth occurrence of o _{i}. The instant for which is computed is denoted as t(o _{i}, s) = H(o _{i}) × s. Operations belonging to the same FMU have the same repetition factor since they are all executed according to the communication step of the FMU they belong to. Therefore, we define the repetition factor of an FMU to be equal to the repetition factor of its operations. Then, arcs are added between operations following the rules presented hereafter. Consider two operations connected by an arc , then adding an arc to A _{M}, depends on the instants t(o _{i}, s) and t(o _{j}, u) for which and are computed respectively. In other words, if t(o _{i}, s), and t(o _{j}, u) are the simulation steps associated with and respectively, then the inequality t(o _{i}, s) ≤ t(o _{j}, u) is a necessary condition to add the arc to A _{M}. In addition, is connected with the latest occurrence of o _{i} that satisfies this condition. Formally, is connected with such that s = max (0, 1,…, r _{i−1}): t(o _{i}, s) ≤ t(o _{j}, u). In the case where H(o _{i}) = H(o _{j}) (and therefore r _{i} = r _{j}), occurrences and which correspond to the same number, i.e. s = u, are connected by an arc. On the other hand, if , we distinguish between two types of dependence: we call the arc a slow to fast (resp. fast to slow) dependence if H(o _{i}) > H(o _{j}) (resp. H(o _{i}) < H(o _{j})). For a slow to fast dependence , one occurrence of o _{i} is executed while several occurrences of o _{j} are executed. In this case, arcs are added between each occurrence , and the occurrence such that:
We recall that for a slow to fast dependence, the master algorithm can perform extrapolation of the inputs of the receiving FMU. For a fast to slow dependence , arcs are added between each occurrence , and the occurrence such that:
Arcs are added also between the occurrences of the same operation, i.e. where and s′ = s + 1. Finally, for each FMU, arcs are added between the sth occurrence of the state operation, where , and the (s + 1)th occurrences of the input and output operations. The multirate graph transformation is detailed in Algorithm 1. The algorithm traverses all the graph by applying the aforementioned rules in order to transform the graph and finally stops when all the nodes and the edges have been visited.
Figure 6 shows the graph obtained by applying the multirate transformation algorithm on the graph of Figure 4. In this example H _{B} = 2 × H _{A}, where H _{A} and H _{B} are the communication steps of FMUs A and B respectively.
Without any loss of generality, the superscript which denotes the number of the occurrence of an operation is not used in the remainder of this article for the sake of simplicity. Each occurrence of an operation in the graph G(V, A) becomes an operation that is referred to using the notation o _{j}.
Multirate graph transformation algorithm.
Input: Initial operation operation graph G(V, A)
Output: Transformed operation operation graph G(V,A);
foreach o _{i} ∈ V do
Compute the repetition factor of : ;
Repeat the operation o _{i}:
end
foreach (o _{i},o _{j}) ∈ A do
if H(o _{i}) > H(o _{j}) then
for p ← 0 to r(oi) − 1 do
Compute
Add the arc to the graph: ;
end
end
else if H(o _{i}) < H(o _{j}) then
for q ← 0 to r(o _{j}) − 1 do
Compute ;
Add the arc to the graph:
end
else
for p ← 0 to r(o _{i}) − 1 do
Add the arc to the graph: ;
end
end
end
for each o _{i} ∈ V do
for p ← 0 to r(o _{i}) − 2 do
Add an arc between successive occurrences of o _{i}: ;
end
end
foreach o _{i} ∈ V : tpe(p _{i}) = state do
for p ← 0 to r(o _{i}) − 2 do
foreach o _{j} ∈ V : f _{m}(o _{j}) = f _{m}(o _{i}) and tpe(o _{i}) ∈ {input, output} do
Add the arc to the graph: ;
end
end
end
3.2 Dependence graph with mutual exclusion constraints
3.2.1 Motivation
The FMI standard states that “FMI functions of one instance don’t need to be thread safe’’. Consequently, an FMU could be implemented using global variables which introduce errors when calling the different functions of the FMU in parallel. It is up to the executing environment to ensure the calling sequences of functions are respected as specified in the FMI standard. These restrictions introduce mutual exclusion constraints on the operations of the same FMU.
In [12] we have shown that using synchronization objects such as mutexes or allocation constraints strongly reduce the obtained speedup. We consequently propose an alternative solution that could satisfy the mutual exclusion constraints while: i) leaving as much flexibility as possible for allocating the operations to the cores and; ii) introducing lower synchronization overhead. The proposed method is based on modeling the mutual exclusion constraints in the operation graph of the cosimulation.
3.2.2 Acyclic orientation of mixed graphs
The operation graph model can be extended in order to represent scheduling problems that involve precedence constraints and also mutual exclusion constraints. This is commonly done using mixed graphs. A mixed graph G(V, A, D) is a graph which contains a set A of directed arcs denoted (o _{i}, o _{j}): 0 ≤ i, j < n and a set D of undirected edges denoted [o _{i}, o _{j}]: 0 ≤ i, j < n. In the scheduling literature, these graphs are known also as disjunctive graphs [16]. In addition to the arcs corresponding to the previously introduced precedence constraints, mutual exclusion relations are represented by edges in a mixed graph such that:

Precedence constraints: must finish its execution before o _{j} can start its execution.

Mutual exclusion constraints: and o _{j} must be executed in strictly disjoint time intervals.
Operations belonging to the same FMU can be executed in either order but not in parallel. In order to compute a schedule for a mixed graph, an execution order has to be defined for each pair of operations connected by an undirected edge which is interpreted by assigning a direction to this edge. Cycles must not be introduced in the graph while assigning directions to edges, otherwise, the scheduling problem becomes infeasible. Since the final goal is to accelerate the execution of the cosimulation, the acyclic orientation of the mixed graph has to minimize the length of the critical path of the graph.
The acyclic orientation problem is closely related to vertex coloring of a graph. In its general form, i.e. when all edges of the graph are undirected, vertex coloring is a function α:V → {1, 2,…, k} which labels the vertices of the graph with integers, called colors, such that the inequality 9 holds:
The acyclic orientation of the graph can then be obtained by assigning a direction to every edge such that the color of the corresponding tail vertex is smaller than the color of the corresponding head vertex. A graph coloring with k colors is referred to as kcoloring. In its general form, vertex coloring aims at finding a minimum vertex coloring, i.e. minimizing k the number of the used colors. The minimum number of colors required to color an undirected graph is called the chromatic number and is denoted χ(G). The Gallai–Hasse–Roy–Vitaver theorem [17–20] links the length of the longest path of the graph, obtained by the orientation which minimizes this length, to vertex coloring of the graph. It states that the length of the longest path of a directed graph is at least χ(G). Thus, a minimum vertex coloring leads to an acyclic orientation that minimizes the length of the critical path of the resulting graph. Computing the chromatic number of a graph is NPcomplete [21].
The acyclic orientation of a mixed graph can be obtained via vertex coloring also. However, vertex coloring of a mixed graph has to take into account both arcs and edges of the graph. More precisely, a vertex coloring of a mixed graph is a function α:V → {1, 2,…, k} such that inequalities 9 and 10 hold:
A coloring of a mixed graph G(V, A, D) exists only if it is cyclefree [22], i.e. the directed graph G(V, A, ∅) does not contain any cycle. The problem of acyclic orientation of mixed graphs has been studied in the literature in [23–25]. The authors proposed efficient algorithms for the orientation of special types of mixed graphs and showed that, in the general case, the problem is NPHard.
3.2.3 Problem formulation
Let G(V, A) be an operation graph of an FMU cosimulation constructed as described in previous sections. In order to represent mutual exclusion constraints between FMU operations, the initial operation graph G(V, A) is transformed into a mixed graph by connecting each pair of mutually exclusive operations o _{i}, o _{j} by and edge [o _{i}, o _{j}]. The resulting mixed graph is denoted G(V, A, D), where V is the set of operations, A is the set of arcs, and D is the set of edges. Once the mixed graph is constructed, directions have to be assigned to its edges in order to define an order of execution for mutually exclusive operations. The precedence and mutual exclusion relations represented by the mixed graph G(V, A, D) are given by expressions (11) and (12). If operations o _{i} and o _{j} are connected by an arc (o _{i}, o _{j}), the time interval (S(o _{i}), E(o _{i})] must precede the time interval (S(o _{j}), E(o _{j})]. Otherwise, if operations o _{i} and o _{j} are connected by an edge [o _{i}, o _{j})], time intervals (S(o _{i}), E(o _{i})] and (S(o _{j}), E(o _{j})] must be strictly disjoint:
The timing attributes of the operations in the mixed graph G(V, A, D) are the same as in the initial graph G(V, A) because the added set of edges does not impact the computation of these attributes. The attributes of an operation o _{i}, connected by an edge with another operation, may change only when this edge is assigned a direction.
An edge [o _{i}, o _{j}] is called a conflict edge if the intervals (S(o _{i}), E(o _{i})] and (S(o _{j}), E(o _{j})] in the graph G(V, A) overlap (Eq. (13)). If for a given edge [o _{i}, o _{j}] either E(o _{i}) ≤ S(o _{j}) or E(o _{j}) ≤ S(o _{i}), there is no conflict and the edge can be assigned a direction:
It should be noted that, for a given edge [o _{i}, o _{j}], choosing either of the execution orders does not impact the numerical results of the cosimulation since these operations do not have data dependence. Still, we have to ensure mutual exclusion between them due to the nonthreadsafe implementation of FMI. Following the definition given in the previous section, the corresponding coloring is a function α:V → {1, 2,…, k} which is equivalent to mapping the operations to the time intervals [S(o _{1}), E(o _{1})], [S(o _{2}), E(o _{2})],…, [S(o _{n}), E(o _{n})].
The problem of acyclic orientation of the mixed graph G(V, A, D) can be stated as an optimization problem as follows:

Input: Mixed graph G(V, A, D)

Output: DAG G(V, A′)

Find: Coloring α:V → {1, 2,…, k}

Minimize: Number of colors k

Subject to:

3.2.4 Resolution using integer linear programming
Let G(V, A, D) be a mixed graph constructed from the operation graph G(V, A) as described in the previous sections to represent precedence and mutual exclusion constraints between operations of an FMU cosimulation. In the following, we present an Integer Linear Programming formulation for the problem of acyclic orientation of G(V, A, D). The proposed formulation is based on the scheduling notation which gives a more compact set of constraints compared to a formulation that uses the vertex coloring notation.
Tables 1 and 2 summarize the variables and the constants that are used in the ILP formulation respectively.
Variables used in the ILP formulation of the acyclic orientation problem.
Constants used in the ILP formulation of acyclic orientation problem.
The following set of constraints is used in the ILP formulation of the acyclic orientation problem:

Precedence constraints: The start time of each operation is equal to the maximum of the end times of all its predecessors. Expression (14) captures this constraint. Note that expression (14) indicates that the start time of operation o _{j} is greater or equal to the end time of each predecessor o _{i}. This is sufficient to express since the formulated problem is a minimization problem.

Mutual exclusion constraints: We define the binary variable b _{ij} which is associated with the direction that is assigned to the edge [o _{i}, o _{j}]. The assignment of directions to edges is given by the function . b _{ij} is set to 1 if the edge [o _{i}, o _{j}] is assigned a direction from o _{i} to o _{j}, i.e. color ϕ([o _{i}, o _{j}]) = (o _{i}, o _{j}) and to 0 otherwise. Note that b _{ij} is the complement of b _{ji}. For every pair of operations that are connected by and edge, we have to ensure that their time intervals are strictly disjoint, i.e. . Expressions (15) and (16) capture this constraint where M is a large positive integer.

Time intervals: Expression (17) is used to compute the end time of each operation.

Length of the critical path: The critical path P is equal to the maximum of the end times of all the operations (expression (18)).
The objective of this linear program is to minimize the length of the critical path of the operation graph (expression (19)):
3.2.5 Acyclic orientation Heuristic
While exact algorithms such as ILP give optimal results, they suffer from very long execution times not acceptable for the users. For many real world applications, ILP fails to produce the results within acceptable times. Heuristics are usually good alternatives. While the optimality of the solution cannot be guaranteed when using heuristics, they, in most cases, provide results of good quality, not too far from the optimal solution within acceptable execution times. We propose in this section a heuristic for the acyclic orientation of the mixed graph G(V, A, D). A straightforward acyclic orientation can be obtained by sorting the operations in a nondecreasing order of their start times S(o _{i}) and assigning directions to edges following this order, i.e. . This is a fast greedy acyclic orientation, however it can be improved as we will show hereafter.
Let d be the sum of the repetition factors of all the FMUs. The set of operations V can be represented as a union of mutually disjoint non empty subsets such that every subset contains all operations that belong to the same FMU and that correspond to the same occurrence:
It is known that edges in the set D exist only between operations that belong to the same FMU. Furthermore, for every edge , operations o _{i} and o _{j} correspond to the same occurrence. Even if operations which belong to the same FMU and correspond to different occurrences are mutually exclusive, it is not needed to connect them by an edge because an execution order is already ensured for these operations by the way the operation graph is constructed. In other words, all the operations of an FMU, and which correspond to the same occurrence have to finish their execution before the next occurrence of any operation can start its execution. Similarly to the operation set, the edge set D can be subdivided into mutually disjoint non empty subsets:
In view of the above, we define the set of subgraphs which constitute the graph . Theorem 3.1 indicates the relationship between the acyclic orientations of the subgraphs G(V _{k}, D _{k}) and the acyclic orientation of the mixed graph G(V, A, D).
An acyclic orientation of the mixed graph G(V, A, D) can be obtained by finding an acyclic orientation for every subgraph G _{k} (V _{k} , D _{k} ) following the nondecreasing order of the start times of the operations as described previously.
Proof. In order to prove this, we have to show that every edge in D is assigned a direction and that the resulting orientation does not lead to the creation of a cycle. We use a proof by contradiction to prove this statement. Since every edge [o _{i}, o _{j}] belongs to one subset of edges D _{k}, finding acyclic orientations for all the subgraphs G _{k} (V _{k}) leads to assigning a direction to every edge in D. The existence of a cycle in the resulting graph means that there exists at least an edge [o _{i}, o _{j}] that has been transformed into the arc (o _{i}, o _{j}) and S(o _{i}) > S(o _{j}). However, this is not possible because the greedy acyclic orientation assigns directions to edges following a nondecreasing order of the start times of the operations which contradicts the previous assertion and thus proves Theorem 3.1.
Consider now that the acyclic orientation of each subgraph G _{k} (V _{k}, D _{k}) is obtained by finding a vertex coloring for this subgraph. This vertex coloring can be seen as a sequence of assignments , such that every assignment α _{l} assigns a color to one operation and leads to assigning directions to edges that connect o _{i} with other already colored operations . The number of assignments needed to perform the acyclic orientation of G _{k} (V _{k}, D _{k}) is equal to the number of edges . Following the coloring of an operation and the engendered assignment of directions, the attributes of some operations may change. Two situations have to be distinguished:

Coloring α _{l} of operation o _{i} does not lead to assigning a direction to any conflict edge. In this case, no changes of the timing attributes occur.

Coloring α _{l} of operation o _{i} leads to assigning a direction to at least one conflict edge . Without any loss of generality, suppose that the edge [o _{i}, o _{j}] is transformed into the arc (o _{i}, o _{j}), then the start time S(o _{j}) is changed to S(o _{j}) ← E(o _{i}). This leads to changing the end time E(o _{j}) also and possibly causes a domino effect for the start times and end times of all the descendants (see Algorithm 2). Moreover, if , the end time from end is changed to . Similarly, this leads to changing the start time from end and possibly causes a domino effect for the start times and end times of all the ancestors (see Algorithm 3).
Update of the start and end times following an assignment α _{l}.
Input: Attributes of the mixed graph G(V, A, D), partially colored subgraph G _{k} (V _{k}, A _{k}, D _{k});
Output: Update of the start and end times of a subset of operations {o _{i}} ⊂ V;
Set αl the last assignment of color made to an operation o _{i} ∈ V _{k};
Set A _{k,l} = {(o _{t}, o _{h})} the set of all arcs created from the orientations engendered by α _{l};
foreach (o _{t}, o _{h}) ∈ A _{k,l} do
if S(o _{h}) < E(o _{t}) and S(o _{t}) < E(o _{h}) then
S(o _{h}) ← E(o _{t});
E(o _{h}) ← S(o _{h}) + C(o _{h});
update(o _{h});
end
end
Procedure update(o _{h})
if succ(o _{h}) ≠ ∅ then
foreach do
if then
;
update ;
end
end
end
return ;
We now describe our proposed acyclic orientation heuristic. The heuristic takes as input a mixed graph G(V, A, D) and the attributes of the operations as computed for the digraph , and assigns directions to all the edges . Applying Theorem 3.1, the heuristic consists in finding vertex colorings of the subgraphs which constitute the graph G(V, A, D) (see Algorithms 4 and 5). In the first step, the graph obtained by removing all the arcs from the mixed graph G(V, A, D) is partitioned into d subgraphs where d is the number of all occurrences of all FMUs such that each subgraph contains all the operations of one FMU which correspond to the same occurrence and all the edges that connect them: . Then, the set of operations is sorted in a nondecreasing order of the start times S(o _{i}). Next, the heuristic iteratively assigns colors to operations. It keeps a list of already colored operations L _{k} for each subgraph . The operations of every list are sorted in increasing order of their assigned colors. At each iteration, the heuristic selects among the operations not yet colored , the operation which has the earliest start time S(o _{i}) to be assigned a color. Ties are broken by selecting the operation with the least flexibility. We call the operation to be colored at a given iteration, the pending operation. The heuristic checks in the order of L _{k} if the edges which connect the pending operation with the operations are conflict edges. If a conflict edge is detected, the pending operation is assigned the color α(o _{j}) and the colors assigned to all the already colored operations such that α(o _{i′}) ≥ α(o _{i}), are increased α(o _{i′}) = α(o _{i′}) + 1. The corresponding edges are then accordingly assigned directions. Afterward, the timing attributes of the operations are updated using Algorithms 2 and 3. At this point, the increase of R, the critical path of the graph, is evaluated. Next, the operations : α(o _{i′}) > α(o _{i}) are reassigned their previous colors α(o _{i′}) = α(o _{i′})−1, and the pending operation is assigned the color α(o _{i}) = α(o _{i}) + 1. The increase in the critical path is evaluated again similarly. After repeating this process for all the edges , the pending operation is finally assigned the color which leads to the least increase in the critical path, and edges are assigned directions accordingly. The heuristic begins another iteration by selecting a new operation to be colored. The heuristic assigns a color to one operation at each iteration. Every operation is assigned a color higher than the colors of all its predecessors which guarantees that no cycle is generated. The heuristic finally stops when all the operations have been assigned colors.
Update of the start and end times from end following an assignment α _{l}.
Input : Input Attributes of the mixed graph G(V, A, D), partially colored subgraph G _{k}(V _{k}, A _{k}, D _{k});
Output: Output Update of the start and end times from end of a subset of operations {o _{i}} ⊂ V; Set α _{l} the last assignment of color made to an operation o _{i} ∈ V _{k};
Set A _{k,l} = {(o _{t}, o _{h})} the set of all arcs created from the orientations engendered by α _{l}
foreach (o _{t}, o _{h}) ∈ A _{k,l} do
if S (o _{h}) < E(o _{t} ) and S (o _{t}) < E(o _{h}) then
if then
;
;
update o _{t});
end
end
end
Procedure update o _{t})
if pred(o _{t}) ≠ then
foreach o _{t} ∈ pred(o _{t} ) do
if then
;
;
update ;
end
end
end
return ;
3.2.6 Complexity
The outermost loop (while loop) of the acyclic orientation heuristic is repeated n times, such as at each iteration, one operation is assigned a color. Recall that n is the number of operations in the operation graph. The selection of the operation with latest start time is done in . The first inner loop iterates over all the edges connecting the selected operation. It is repeated at most d times, where d is the maximum number of edges connecting one operation. The inner most loop is executed twice in all cases. This results in an execution of the nested inner loops in O(d). In addition Algorithms 2 and 3 that are called in the heuristic have each a complexity of O(n) since they are based on a recursion whose depth is at most n. Therefore, the complexity of the acyclic orientation heuristic is evaluated to .
4 Multicore scheduling of FMU dependence graphs for cosimulation acceleration
This section presents methods for scheduling an operation graph on a multicore architecture. Once the operation graph has been constructed and undergone the different phases of transformations as shown in the previous sections, it is scheduled on the multicore platform with the goal of accelerating the execution of cosimulation.
In order to achieve fast execution of the cosimulation on a multicore processor, an efficient allocation and scheduling of the operation graph has to be achieved. The scheduling algorithm takes into account functional and non functional specification in order to produce an allocation of the operation graph vertices (operations) to the cores of the processor, and assign a starting time to each operation. We present hereafter a linear programming model and a heuristic for scheduling FMU dependence graphs on multicore processors with the aim of accelerating the execution of the cosimulation.
Acyclic orientation heuristic.
Input: Mixed graph G(V, A, D)
Output: DAG G(V, A);
Set s the number of all occurrences of all FMUs;
Partition the graph into s subgraphs: ;
Initialize lists ;
Set Ω the set of all the operations not already colored;
while do
Select the operation (break ties by selecting the operation with the least flexibility);
if then
α(o _{i}) ← 1; L _{k} ← L _{k} ∪ {o _{i}};
end
else
Set σ ← ∞; // Initialize the increase in the critical path
foreach o _{j} ∈ L _{k} do
if S(o _{i}) < E(o _{j}) and S(o _{j}) < E(o _{i}) then
foreach c ∈ {α(o _{j}), α(o _{j}) + 1} do
α(o _{i}) ← c;
evaluate(o _{i}, L _{k});
foreach : do
Reassign its previous color: ;
end
end
end
if S(o _{i}) ≥ E(o _{j}) then
α(o _{i}) ← α(o _{j}) + 1;
evaluate(o _{i}, L _{k});
end
else
α(o _{i}) ← α(o _{j});
evaluate(o _{i}, L _{k});
end
α(o _{i}) = color;
foreach do
if then
Assign a direction to the edge ;
end
else
Assign a direction to the edge ;
end
end
Update the timing attributes using Algorithms 2 and 3
Remove from ;
end
end
end
Evaluate procedure.
Procedure evaluate(o _{i}, L _{k})
foreach do
;
Update the timing attributes using Algorithms 2 and 3;
end
Compute the new critical path and set σ′ the increase in the critical path;
if σ′ < σ then
color ← α(o _{i}); σ ← σ′;
end
return ;
4.1 Problem formulation
The acceleration of the cosimulation corresponds to the minimization of the makespan of the dependence graph. The makespan is the total execution time of the whole graph. The dependence graph that is fed as input to the scheduling algorithm is a DAG, therefore, it represents a partial order relationship in the execution of the operations, since two operations connected by an arc must be executed sequentially whereas the other ones can be executed in parallel. The scheduling algorithm makes decisions on allocating the operations to the cores while respecting this partial order and trying to minimize the total execution time of the dependence graph. In addition to the execution time of the operations, the scheduling algorithm has to take into considerations, the cost of intercore synchronization. The scheduling problem can be stated as an optimization problem as follows:

Input: Operation graph G _{F} (V _{F}, A _{F})

Output: Offline Schedule of operations on multicore processor

Find:

Allocation of operations to cores, α:V → P

Assignment of start times to operations,


Minimize: Makespan of the graph

Subject to: Precedence constraints of the graph G _{F} (V _{F}, A _{F})
4.2 Resolution using linear programming
In this section, we give our ILP formulation of the task scheduling problem for the acceleration of FMU cosimulation.
4.2.1 Variables and constants
Tables 3 and 4 summarize respectively the variables and the constants that are used in the ILP formulation of the scheduling problem for cosimulation acceleration.
Variables used in the ILP formulation of the acyclic orientation problem.
Constants used in the ILP formulation of acyclic orientation problem.
4.2.2 Constraints
We define the decision binary variables x _{ij} which indicate whether the operation o _{i} is allocated to core p _{j} or not. Expression (22) gives the constraint that each operation has to be allocated to one and only one core:
The end time of each operation o _{i} is computed using the expression (23):
For operations that are allocated to the same core and that are completely independent, i.e. no path exists between them, we have to ensure that they are executed in non overlapping time intervals. Expressions (24) and (25) capture this constraint. b _{ij} is a binary variable that is set to one if o _{i} is executed before o _{j}:
The cost of synchronization is taken into account as follows. A synchronization cost is introduced in the computation of the start time of an operation o _{j}, if it has a predecessor that is allocated to a different core and if its start time is the earliest among the successors of this predecessor that are allocated to the same core as the operation o _{j}. sync_{ijp} is a binary variable which indicates whether synchronization is needed between o _{i} and o _{j} if o _{j} is allocated to p. Therefore, . Expressions (26) and (27) capture this constraint. V _{ip} is a binary variable that is set to one only if . It is used to define for which cores a synchronization is needed between o _{i} and its successors, in other words, if the successor is allocated to the same core as o _{i}, no synchronization is needed. Expressions (28) and (29) capture this constraint. Variable Q _{ip} denotes the earliest start time among the start times of all successors of o _{i} that are allocated to processor p. It is computed using expressions (30) and (31):
The start time of each operation is computed using expression (32). The synchronization cost is introduced taking into account the synchronizations with all predecessors of o _{j} that are allocated to different cores:
The makespan is equal to the latest end time among the end times of all the operations as captured by expession (33):
4.2.3 Objective function
The objective of this linear program is to minimize the makespan of the dependence graph.
4.3 Multicore scheduling Heuristic
A variety of list multicore scheduling heuristics exist in the literature and each heuristic may be suitable for some specific kinds of multicore scheduling problems. We detail in this section a heuristic that we have chosen to apply on the final graph G _{F} (V _{F}, A _{F}) in order to minimize its makespan. Because of the number of finegrained operations, and since the execution times and the dependence between the operations are known before runtime, it is more convenient to use an offline scheduling heuristic which has the advantage of introducing lower overhead than online scheduling heuristics. We use an offline scheduling heuristic similar to the one proposed in [26] which is a fast greedy algorithm whose cost function corresponds well to our minimization objective. In accordance with the principle of list scheduling heuristics, this heuristic is prioritybased, i.e. it builds a list of operations that are ready to be scheduled, called candidate operations and selects one operation based on the evaluation of the cost function. We denote by ρ the cost function and call it the schedule pressure. It expresses the degree of criticality of scheduling an operation. The schedule pressure of an operation is computed using its flexibility and the penalty of scheduling which refers to the increase in the critical path resulting from scheduling an operation.
The heuristic is detailed in Algorithm 6 and considers the different timing attributes of each operation in order to compute a schedule that minimizes the makespan of the graph. The heuristic schedules the operations of the graph G _{F} (V _{F}, A _{F}) on the different cores iteratively and aims at minimizing the schedule pressure of an operation on a specific core while taking into account the synchronization costs. The heuristic updates the set of candidate operations to be scheduled at each iteration. An operation is added to the set of candidate operations if it has no predecessor or if all its predecessors have already been scheduled. For each candidate operation, the schedule pressure is computed on each core and the operation is allocated to its best core, the one that minimizes the pressure. Then, a list of candidate operationbest core pairs is obtained. Finally, the operation with the largest pressure on its best core is selected and scheduled. Synchronization operations are added between the scheduled operation and all its predecessors that were allocated to different cores. The heuristic repeats this procedure and finally stops when all the operations have been scheduled.
Multicore scheduling heuristic.
Input: Operation graph G(V, A), set of cores P;
Output: Schedule of operations o _{i} ∈ V on cores p _{k} ∈ P;
Set O the set of operations without predecessors;
Set sync the cost of one synchronization operation;
Set L _{k} : p _{k} ∈ P the length of schedule of core p _{k};
foreach p _{k} ∈ P do
;
end
while do
foreach o _{i} ∈ O do
; // Initialize the schedule pressure of o _{i}
;
foreach p _{k} ∈ P do
;
; // Start time of o _{i} if executed on p _{k}
foreach o _{j} ∈ pred(o _{i}) do
if o _{j} is scheduled on a core
then
syncCost ← syncCost + sync;
end
end
S(o _{i}) ← S(o _{i}) + syncCost;
E(o _{i}) ← S(o _{i}) + C(o _{i});
;
// Cost of o _{i} if executed on p _{k}
if ρ′ < ρ then
Set ρ ← ρ′;
BestCore(o _{i}) ← p _{k};
end
end
end
Find with maximal cost ρ in O;
Schedule on its core ;
;
;
Remove from the set O;
Add to the set O all successors of for which all predecessors are already scheduled;
end
4.3.1 Complexity
The scheduling heuristic contains three nested loops. The outermost loop is executed until all the operations are scheduled. At each iteration, one operation is scheduled. Therefore, the outermost loop is executed n times where n is the number of operations in the operation graph. In the inner loops, the heuristic attempts to schedule all the ready operations on all the available cores. As such, the inner loops execute in , where p is the number of cores. From the foregoing, the complexity of our heuristic is evaluated to .
4.4 Code generation
In this section, we describe how the FMU cosimulation code is generated based on the schedule tables produced by the proposed scheduling algorithms. Since the FMU cosimulation is intended to be executed on multicore desktop computers running general purpose or realtime operating systems, the implementation is achieved using native threads. Such threads consist in threads that are provided by the operating system in contrast to threads that are related to a specific programming language and/or rely on a specific runtime library.
In the generated code, as many threads are created as there are cores. Each thread is responsible for the execution of the schedule of one core. Therefore, each thread reads from the schedule table of its corresponding core and executes the operations that are specified in this table. These operations can be computational operations, i.e. input, output, and state operations, or synchronization operations. The synchronization operations are implemented using semaphores provided by the operating system. They are of two types: signal and wait operations. The execution of a signal operation by a thread consists in signaling the corresponding semaphore. The execution of a wait operation by a thread consists in waiting for the corresponding semaphore. Each thread executes its associated schedule table repeatedly, and thus executes FMU operations and synchronizes with the other threads. Hereafter, we refer to these threads as schedule threads.
The orchestration of the cosimulation is ensured by a master thread which runs the FMI master algorithm. The master thread creates and launches the schedule threads. During the execution, the master thread and the schedule threads are synchronized at fixed points. First, the master thread signals to the schedule threads the start of the cosimulation which launches their execution. Each thread starts, then, the execution of its associated schedule table as described in the previous paragraph. When it finishes the execution of the whole schedule table, it signals this to the master thread and waits for a new signaling from it. The master thread waits until all the schedule threads signal that they finished the execution of their respective schedule tables. Then, the master thread launches a new iteration by signaling to the schedule threads to start executing their corresponding schedule tables again. This process is repeated until the desired simulation time is reached.
5 Evaluation
In this section, we evaluate our proposed approach. We start by describing a method for randomly generating benchmark operation graphs. Then, we present the evaluation of the performances of the acyclic orientation and the scheduling heuristics. Finally, we give runtime performance and numerical accuracy results obtained by applying our approach on an industrial use case.
5.1 Random generator of operation graphs
Due to the difficulty in acquiring enough industrial FMU cosimulation applications for assessing our approach, we had to use a random generator of FMU dependence graphs. The generator creates the graphs and characterizes them with attributes. We set the parameters of this generator, e.g. the number of FMUs and the number of operations based on our experience using industrial FMU cosimulation. Of course, the use of randomness for synthetic graphs generation restraints the possibility of considering exactly the same results for industrial application. Our evaluation nevertheless gives some indications about the achievable level of performance for the heuristics and the graph scale from which the exact algorithms may fail to produce result with an acceptable execution duration.
5.1.1 Random operation graph generation
The random generator that we have implemented consists in two stages. In other words, we have to generate, first, the different FMUs of the cosimulation and their internal structures. Second, we generate the dependence graph by creating interFMU dependence in such a way that the resulting operation graph is a DAG. The proposed generator is based on a technique of assignment of operations to levels. The level of an operation is the number of operations on the longest path from a source operation to this operation. The dependence graph can then be visualized on a grid of levels as depicted in Figure 7. The generator uses the following parameters:

The graph size n: the number of operations;

The number of FMUs m;

The graph height h: the maximum number of levels in the graph;

The graph width w: the maximum number of nodes on one level.
Fig. 7 Random generation of an operation graph. 
Note that parameters n and m are related. In other words, for a given size of a graph n, an adequate number of FMUs m has to be chosen.
The generation of the dependence graph is performed as follows:

Input: Size of the graph n, number of FMUs m, height of the graph h, and width of the graph w.

Step 1: Randomly distribute the n operations across the m FMUs. Given the number of operations of each FMU, we randomly determine the number of its input operations and the number of its output operations. Every FMU has one state operation.

Step 2: Randomly generate the intraFMU arcs. This step is controlled by two parameters. The number of arcs to generate and the number of Non Direct Feedthrough (NDF) outputs of the FMU. These outputs are not considered when randomly generating the arcs.

Step 3: Randomly assign the operations to the grid levels. This step is performed by assigning output operations and then input operations repeatedly.

Assign all NDF operations to level 0 of the grid.

Randomly assign remaining output operations to even levels (2, 4,…, h−3) of the grid.

Assign the input operations to the odd levels (1, 3,…, h−4) of the grid such that any input operation o _{i} that is connected to an output operations o _{j} (intraFMU dependence) is assigned to the level preceding the level to which o _{j} has been assigned.

Assign the remaining input operations (each of which is not connected with any output operation) to the level h−2 of the grid. These operations will be connected only with the state operations of their respective FMUs.

Add the state operations to the last level of the grid.


Step 4: Create the arcs of the dependence graph. At this step, we randomly generate interFMU dependence. For each operation o _{i} on the level lvl of grid, we randomly select an output operation o _{j} from the preceding level lvl−1 and which belongs to a different FMU than o _{i}. We create an arc from o _{j} to o _{i}. If no such output operation is found at level lvl−1, we select randomly an output operation from any level lvl′ < lvl−1 and connect it with the operation o _{i}. Finally the arcs from input and output operations to state operations are created. Note that non oriented edges are automatically added between every pair of operations that belong to the same FMU and are mutually exclusive as described in Section 3.1.
Figure 7 illustrates the steps of our proposed random operation graph generator.
5.1.2 Random operation graph characterization
In addition to random generation of the dependence graph structure, we need to generate the attributes of the graph. In particular, the following attributes are generated by our random generator:

Communication steps of the FMUs: A range or a set for the values of the communication steps is specified. The generator randomly assigns a communication step from this range or set to every FMU.

Execution times of the operations: Different ranges of the execution times are specified for input, output, and state operations. Execution times are generated randomly in such a way that state operations have longer execution times than output and input operations.
5.2 Performances of the Heuristics
We have carried out different tests in order to evaluate our proposed approach. For both the acyclic orientation and the scheduling, we compared the execution time of our proposed heuristic with the execution time of the ILP, and the value of the objective function of the heuristic with the value of the objective function of the ILP. For ILP resolution, we used three solvers: lpsolve [27], Gurobi [28], and CPLEX [29]. With lpsolve, we were only able to solve small instances of the scheduling problem. Gurobi was much more efficient but we obtained the best performance using CPLEX. Therefore, the results presented hereafter were obtained using CPLEX. Tests were performed on a desktop computer with a 6core Intel Xeon processor running at 2.7 GH and 16GB RAM.
5.2.1 Execution time of the acyclic orientation algorithms
In order to compare the execution time of our acyclic orientation heuristic with the execution time of the acyclic orientation ILP, we have generated 200 random operation graphs of different sizes between 5 and 10 000. We considered 10 000 as the maximum size of the operation graph because it corresponds to the size of large industrial applications.
We executed the acyclic orientation heuristic and ILP on all of the generated random graphs and measured the elapsed time between the start and the end of the execution. For the ILP, the execution is stopped if the optimal solution is not found within two days. The obtained execution times are shown on a logarithmic scale in Figure 8. The acyclic orientation ILP cannot be resolved in practical times when the size of the operation graph exceeds 250. When the number of operations is less than 250 the ILP finds the optimal solution in reasonable times, except for two graphs. In addition, we observe that an increase in the graph size does not always result in an increase in the execution time. This can be explained by the fact that other factors impact the speed of resolution, e.g. number of conflict edges. Still, it is important to notice that the application of the acyclic orientation ILP is limited to relatively small graphs. On the other hand, the acyclic orientation heuristic produces results in practical execution times even for very large operation graphs (10 000).
Fig. 8 Comparison of the execution times of the acyclic orientation algorithms. 
5.2.2 Critical path length
We compared the values of the critical path length obtained using the acyclic orientation heuristic and ILP. Tests were performed using the same set of operation graphs described in the previous section. However, we consider only graphs for which the ILP was able to return the optimal solution within the resolution time limit that we set, i.e. two days. Thus, we applied our proposed heuristic and ILP on 12 operation graphs of sizes between 20 and 240 and saved the obtained length of the critical path. Results are depicted in Figure 9. For most of the operation graphs, our acyclic orientation heuristic produces a length of the critical path that is equal to the length of the critical path produced by the acyclic orientation ILP. The heuristic returns a longer length of the critical path for three graphs but the gap is very small remaining below 8%.
Fig. 9 Comparison of the critical path length. 
5.2.3 Execution time of the scheduling algorithms for cosimulation acceleration
Similarly to the acyclic orientation tests, we compared the execution time of the scheduling heuristic with the execution time of the scheduling ILP using 200 generated random operation graphs of different sizes between 5 and 10 000. We set a two day limit for the resolution of the ILP. Tests were run for the scheduling problem with 2, 4, and 8 cores. Execution times were measured by fixing the number of cores and varying the number of operations (graph size). The results are depicted for two and eight cores in Figures 10 and 11 respectively. All results are plotted on a logarithmic scale. In these figures, we see that the execution time of the ILP resolution increases exponentially as the graph size increases, and only small instances are resolved within acceptable times. On the other hand, the acyclic orientation heuristic is very fast and produces results in very short times and even for very large graphs, the execution times remain within practical bounds.
Fig. 10 Comparison of the scheduling execution time for two cores. 
Fig. 11 Comparison of the scheduling execution time for eight cores. 
5.2.4 Makespan
We run tests to compare the value of the makespan obtained using the acyclic orientation heuristic and ILP. For these tests we have generated ten operation graphs of size n = 15. We have used graphs of size 15 because the ILP resolution returns the optimal solution in very short times which is not the case for large graphs. The graphs are different from each other because they are generated randomly which leads to different topologies and execution times of the operations. We run the scheduling heuristic and ILP on these graphs to obtain the values of the makespan. Results are shown in Figures 12 and 13 for two and eight cores respectively. Overall, the results show that the scheduling heuristic produces a makespan which is very close to the makespan produced by the scheduling ILP. The gap between the heuristic and the ILP result lies between 0% and 16%. We notice that the gap is smaller when eight cores are used than when two cores are used. In fact, when two cores are used the maximum gap is 16%, whereas when four or eight cores are used the maximum gap is 6%. This shows that the scheduling heuristic performs better when the effective parallelism is increased. It can be explained by the fact that the scheduling heuristic attempts more allocation possibilities which leads to a better exploitation of the potential parallelism.
Fig. 12 Comparison of the makespan for two cores. 
Fig. 13 Comparison of the makespan for eight cores. 
5.3 Industrial use case
We tested our proposed approach on an industrial use case. Tests have been performed on a computer with an 8core Intel core i7 processor running at 2.7 GH with 16GB RAM. In the rest of this section, we first give a description of the use case and then present the tests and the obtained results.
5.3.1 Use case description
Our use case consists in a Spark Ignition (SI) RENAULT F4RT engine cosimulation. It is a fourcylinder in line Port Fuel Injector (PFI) engine in which the engine displacement is 2000 cm^{3}. The air path is composed of a turbocharger with a monoscroll turbine controlled by a wastegate, an intake throttle and a downstreamcompressor heat exchanger (Fig. 14). This cosimulation is composed of six FMUs: an FMU of the airpath, four FMUs of the four cylinders, and one FMU of the controller. The engine model was developed using Modelica. The engine model was imported into xMOD using the FMI export features of the Dymola^{1} tool. The operation graph of this usecase contains over 100 operations.
Fig. 14 Spark Ignition (SI) RENAULT F4RT engine model. 
5.3.2 Test campaign
We based our tests on three different versions of RCOSIM. We refer to our proposed method as MUORCOSIM (for MultiRate Oriented RCOSIM). We compared the obtained results with two approaches: The first one is RCOSIM which is monorate and thus we had to use the same communication step for all the FMUs. We used a communication step of 20 μs. The second one consists in using RCOSIM with the multirate graph transformation algorithm. We refer to it as MURCOSIM (for MultiRate RCOSIM). For MUORCOSIM and MURCOSIM we used the recommended configuration of the communication steps for this use case. For each cylinder, we used a communication step of 20 μs. The communication step used for the airpath is 100 μs. The airpath has slower dynamics than the cylinders and this configuration of the communication steps corresponds to the specification given by engine engineers. For each FMU, we used a RungeKutta 4 solver with a fixed integration step equal to the communication step. The graph of this use case is transformed by Algorithm 1 into a graph containing over 280 operations that are scheduled by the multicore scheduling heuristic.
5.3.3 Speedup
The speedup obtained using MUORCOSIM is compared with the speedups obtained using RCOSIM and MURCOSIM. The speedup was evaluated by running the cosimulation in xMOD. Execution times measurements were performed by getting the system time stamp at the beginning and at the end of the cosimulation. For a given run of the cosimulation, the speedup is computed by dividing the singlecore cosimulation execution time of RCOSIM by the cosimulation execution time of this run on a fixed number of cores. Figure 15 sums up the results. The same speedup is obtained using MUORCOSIM and MURCOSIM even when only one core is used. This speedup is obtained thanks to using the multirate configuration. More specifically, increasing the communication step of the airpath from 20 μs to 100 μs results in fewer calls to the solver leading to an acceleration in the execution of the cosimulation. By using multiple cores, speedups are obtained using both MUORCOSIM and MURCOSIM. Additionally, MUORCOSIM outperforms MURCOSIM with an improvement in the speedup of, approximately 30% when two cores are used, and approximately 10% when four cores are used. This improvement is obtained thanks to the acyclic orientation heuristic which defines an efficient order of execution for the operations of each FMU that are mutually exclusive. This defined order tends to allow the multicore scheduling heuristic to better adapt the potential parallelism of the operation graph to the effective parallelism of the multicore processor (number of cores) resulting in an improvement in the performance. MURCOSIM, on the other hand, uses the solution of RCOSIM which consists in simply allocating mutual exclusive operations to the same core introducing restrictions on the possible solutions of the multicore scheduling heuristic. When using eight cores, no further improvement is possible since the potential parallelism is fully exploited. Worse still, the overhead of the synchronization between the cores becomes counterproductive, which explains why the speedup with eight cores is less than the speedup with four cores for all the approaches. The best performance is obtained using five cores with slight improvement compared to using four cores.
Fig. 15 Speedup results. 
5.4 Comparison of offline and online scheduling
In our work, we adopted an offline scheduling heuristic assuming it is more efficient than online scheduling since it introduces lower overhead. This choice was based on the fact that the grain size of the operation graph is small which makes it unsuitable for online scheduling which involves more decision overhead in runtime than offline scheduling. In fact, the decision overhead in runtime may become much more costly then the execution of the operations. Moreover, the different operations perform different tasks and have different execution times in contrast to applications that exhibit data parallelism which could be efficiently handled by online scheduling. In addition, the execution times of the operations and the dependence between them are known before the execution which allows the application of offline scheduling. In order to confirm this assumption, we have compared our approach with a runtime scheduling approach, i.e. online scheduling. For this end, we have used Intel TBB library for the parallelization of the cosimulation. We performed several speedup tests and compared the results obtained using the two approaches.
5.4.1 Intel TBB flow graph
We have chosen Intel TBB to implement an online scheduling because it offers a programming interface introduced in Intel TBB 4.0, which allows easy parallelization of programs represented as graphs. It can be combined with loop parallelism supported by Intel TBB to further improve the parallelism exploitation. In Intel TBB, we distinguish between dependence graphs and data flow graphs. In dependence graphs, a dependence represents a precedence constraint between two nodes. During execution, this dependence acts as a signal to inform a node that a predecessor has finished its execution. In data flow graphs, a dependence is accompanied by data transfer from a predecessor to a node. In our implementation we used dependence graphs as explained hereafter. Intel TBB offers a wide range of classes that can be used to implement dependence graphs. In particular the graph class and other related classes are used for this purpose. In general, a dependence graph involves three main components: a graph object, nodes, and edges. A graph object provides methods for the execution of tasks created from the nodes of the graph and to wait for the execution of the dependence graph to finish. Provided node classes allow the creation of different types of nodes. These nodes can be classified into four categories: Functional, Buffering, Split/Join and others.
The user creates a graph, its nodes and then specifies dependence between them. The classes and functions of Intel TBB are highly parametrized to allow many possibilities of implementation.
The execution of the dependence graph follows the partial order specified by the created edges. When a node receives a signal of completion, a task is spawned to execute the body of this node.
We present here the fundamental concepts necessary to describe how we used Intel TBB. The official documentation^{2} of Intel TBB should be consulted for more detailed explanation.
5.4.2 Scheduling in Intel TBB
Intel TBB is based on programming with tasks instead of threads. Tasks are atomic units of execution that are allocated to threads to be executed. The objective is to make programming simpler by thinking at a higher level, i.e. specifying the potential parallelism of the program without having to handle the adaptation to the effective parallelism. The threads that run the tasks are called worker threads. The allocation is automatically done in runtime using an online scheduling algorithm known as work stealing. Each thread keeps a pool of tasks that are ready to be executed in a deque which is a doubleended queue. Elements can be pushed onto or popped from a deque from both ends. Threads are responsible for task creation, known as task spawning. When a task is spawned by a thread, it is pushed onto the deque of this thread from the top. The thread always pops the task on the top of its deque and executes it. As such, a thread uses its local deque as a stack. If the local deque is empty, the thread tries to pick a task from another randomly chosen thread, called the victim. It pops a task from the bottom of the deque of the victim thread, therefore using the deque of the victim as a queue.
In the case of an application implemented as a dependence graph, tasks are spawned on behalf of the nodes of the graph. When a node receives messages from all its predecessors, a task is spawned on behalf of this node. When run, this task executes the body of the node. When a task finishes its execution it sends a message that is transferred to its predecessors.
5.4.3 Implementation
We used Intel TBB to implement parallel FMI cosimulation in xMOD. The first part which consists in creating the operation graph through the analysis of inter and intraFMU dependence is the same as in RCOSIM. If the cosimulation is multirate, the multirate graph transformation is performed as well. Once the operation graph is constructed, an Intel TBB dependence graph which represents this operation graph is automatically created. The graph is of a dependence graph type because we do not manage explicitly data transfer between the different operations since the functions of the FMUs are provided in the form of binaries. Data transfer is implicitly managed by the partial order defined in the operation graph. In other terms, an operation that produces data is necessarily executed before the operation that consumes it. Data writing and reading is done through shared memory and is hidden from the developer. It follows from this that data flow graphs provided by Intel TBB are not suitable for representing such cosimulations because they require explicit management of data transfer between the nodes.
The creation of the dependence graph is done as follows: First, a graph is created and then nodes and edges are added to this graph. A node is created for each operation and added to an array that stores all the created nodes. The first node that is created is a source node which has no predecessor. This node becomes a predecessor of all the nodes that have no predecessor in the operation graph. The body of this node contains the initialization of the cosimulation. Then, for each operation in the operation graph, a function node is created. A function node can have multiple ports to be connected with multiple predecessors and successors. The body of each function node contains the FMU function calls of the corresponding operation. Finally, the edges that connect the nodes are added to the graph. All the created edges are of type continue message. Such edges are used to signal that the execution finished.
The execution of the cosimulation consists in executing this dependence graph repeatedly, similarly to our offline scheduling approach, i.e. the whole graph is executed at each iteration before a new execution can begin. Initially, one thread is responsible of the creation of the dependence graph and launching the source node. Only the source node, which performs the initialization of the cosimulation, is executed explicitly using a function provided by the Intel TBB library. When this function is called, a task is spawned to execute the body of the source node. Afterward, the runtime library handles the flow of messages in the graph. When the execution of the source node body is finished, it sends a continue message to all its predecessors. Tasks are spawned for the nodes that receive the messages to be executed which in turn send continue messages when their execution is finished and so forth. After all the nodes are executed, the execution is restarted in the same way. The scheduling is managed by the runtime library which creates a pool of working threads and uses the work stealing algorithm described above.
5.4.4 Comparison
We implemented a parallelization approach of FMI cosimulations using Intel TBB for the purpose of comparing it with our proposed offline scheduling approach. We have measured the speedups obtained on different numbers of cores using both approaches. First of all, let’s summarize the differences between the two approaches. Figure 16 illustrates the main steps of both approaches. As stated above, the two first two phases which consist in the construction of the operation graph and the graph transformation in the case of a multirate cosimulation are performed in the same way in both approaches. If online scheduling is used, the next step is execution. On the other hand, if offline scheduling is used, two more phases are performed before the execution. The acyclic orientation heuristic is applied on the operation graph to handle mutual exclusion constraints. After this, the offline scheduling heuristic is used to compute a schedule of the operations. During execution, in both the offline and online scheduling approaches, a thread is executed on each core. In the case of offline scheduling, each thread reads the schedule of the corresponding core and executes the operations in the order of this schedule, which does not change during execution. In the case of online scheduling, since no schedule is computed before execution, the runtime library distributes the operations across the threads during execution in such a way to balance the load. Each thread pushes the operations onto its deque from the top. It executes these operations by popping the operation on the top from its deque, or if its deque is empty, it steals work from another thread by popping an operation for the bottom of this victim thread. Mutual exclusion constraints are handled in online scheduling using lightweight mutex locks provided by the runtime library. These locks have lower cost than mutex locks provided by the OS.
Fig. 16 Comparison of the different phases of the offline and online scheduling approaches. 
We run the cosimulation of the industrial use case on an 8core Intel core i7 processor running at 2.7 GH with 16GB RAM. The obtained speedup using offline scheduling raises 2.44 and is better than the one obtained using online scheduling (1.64) which confirms our assumption. The decision overhead of online scheduling is very costly compared to the execution times of operations which decreases the performance.
6 Discussion
In order to discuss the advantages and drawbacks of the proposed approach, we challenge here our methodology in terms of compliance with cosimulation scenario, application to industrial use case and optimality metrics. First, in Section 2 we explain that RCOSIM and our proposed extension is applied to cosimulation of connected models, where each model is imported as a FMU with its own fixed step solver. Since we aim at proposing solution to accelerate industrial cosimulation, we choose to focus on cosimulation solution where all models are imported into a unique tool. Indeed, optimizing a multisoftware cosimulation, from different tool vendors, would not be possible without accessing the code of each simulation tool. On the opposite, the FMI standard meets a real success and most modeling and simulation tools offer now the possibility to export models as FMUs. By focusing on what could be done in the importing software, we keep a large set of possibilities as a real capability to experiment. Our industrial cosimulation goal also prevents us to propose solution including models modification. In large companies or when different parties provide different components of a system, numerous models are available. Each of them was built, validated and used by domain experts. At the cosimulation stage of the development process, the engineers are rarely able to understand all the models, and sometimes they do not have access to the initial model but only to the FMU.
In our approach we restricted to FMU with fixed step solver, mainly because beyond the cosimulation acceleration problem, the next step of our work is to perform the cosimulation under realtime constraints. This goal, which is out of scope of this article, leads us to restrict to cosimulation scenarios where computation duration can be evaluated and bounded. With variable step solvers, the number of computation phases is variable and consequently the computation duration over a simulation time interval is difficult to evaluate. That prevents to ensure that realtime constraints will be met. Nevertheless, if we discard our final objective, our solution could be applied to cosimulation of FMU with variable step solver. Indeed, if for each FMU i with a variable step solver, we can set a fixed macro step value H _{i}, all the previous described process can be applied. The macrostep parameter for a variable step solver forces a regular rate at which computation have to be done. But the solver remains free to cut the macrostep H _{i} in several steps, with variable step sizes. In our approach, all these intermediate computations over the macrostep would be considered as a unique operation.
Since we describe the connected FMU with a directed acyclic graph, it could prevent to apply our methodology for cosimulation scenarios containing algebraic loops (and consequently cycles in the corresponding graph). We propose to handle algebraic loops by inserting a delay function for breaking each cycle, then solving the corresponding initialization problem to be able to parameter the output delivered by the delay functions at the first step.
One limitation of our approach when dealing with industrial use cases could be the size of the HyperStep (HS), the least common multiple of the communication step sizes. Indeed, in the case of a large number of FMUs with communication steps sharing few common factor, HS could be a large number. In our solution, it directly impacts the size of the dependence graph which has to be scheduled. In order to handle this limitation, we suggest, if the communication step cannot be harmonized, to split the cosimulation senario in several ones and apply RCOSIM principles on each part. The partition choice would be made for reducing the HS of each part. Thanks to our approach, it is easy to find how many cores are sufficient to reach the maximum speed up for each subscenario. Then the number of cores could be partitioned, one partition for each scenario. At execution, data exchange between the different cosimulation parts would have to be sequenced by adding synchronization mechanisms.
We do not deny that our solution have a cost in terms of computation time, even if the proposed heuristics offer an interesting tradeoff between optimality and computation time. If a cosimulation execution takes few seconds on a monocore, this is obviously unnecessary to lose time to compute a distributed schedule before launching it. Our approach could be easily implemented to launch the cosimulation scenario on one core while, after profiling the operation cost (from the current monocore execution), another process on another core could compute the different steps of our solution. Then, if the cosimulation is not over, the execution could switch from the initial monocore execution to the one optimized for multicore. This efficient schedule could also be recorded to be used in the next launch of the same scenario.
Our evaluation methodology could rise some observations. With automatic graph generator, we choose to test our solution on a large amount of graph rather than a few number of industrial case studies. The goal is to compare our heuristics performance to an exact solution algorithm. One may argue that the optimality comparison suffers that the exact ILP algorithm is quickly unable to find a solution when the number of operations increase and consequently we do not prove that heuristics continue to give results close to the optimal solution when graphs become bigger. However, the reader should keep in mind that our goal is to accelerate the cosimulation. Optimality is not targeted since we show that optimal solution search is too costly in terms of time. The comparison is consequently more relevant with monocore cosimulation (given by the speedup factor) or with a totally online scheduling algorithm. The presented industrial usecase tends to show that our proposed approach offers a real speedup compared to these two basic solutions. Of course, numerous parameters of the cosimulation scenario could increase the computation cost of our solution. First, the maximal number of operations in one FMU is a factor, leading to increase the time spent by the acyclic orientation heuristic. Second, as previously discussed, the HS length compared to each H _{i} directly grows the number of operations to schedule. Finally, scheduling heuristics are sensitive to the number of cores.
7 Conclusion
The complexity of cyberphysical systems is steadily increasing due to several factors. A lot of efforts is being made in industry as well as in academia in order to implement technologies and methods that respond to the requirements and challenges in the design of complex CPS. Cosimulation is increasingly being adopted as a systemlevel simulation approach in the context of CPS design thanks to its advantages over monolithic simulation. Strengths of cosimulation include easy upgrade, reuse, and exchange of models and simulators, improved computational performance compared to monolithic simulation, and allowing better intervention of experts at the subsystem level in multidomain design projects.
In this article, we are interested in the rising requirements on the computational performance of FMI cosimulation. Precisely, we look for proposing solution to efficiently exploit multicore processor, available on every computer, to accelerate desktop cosimulation. We build on the work that was previously developed at IFP Energies nouvelles and aim at improving the existing RCOSIM method.
We propose extensions to the operation graph model used in RCOSIM to represent the cosimulation. The first extension targets multirate cosimulation. We propose some rules for transforming a multirate operation graph into a monorate one in order to prepare its multicore scheduling. Based on these rules, we propose an algorithm that performs this transformation. The second extension consists in transforming the operation graph in order to handle mutual exclusion constraints between operations. First, the operation graph is transformed into a mixed graph by adding (non oriented) edges between mutually exclusive operations. Then, an acyclic orientation is computed for the mixed graph by assigning a direction to each edge. We propose two algorithms to perform the acyclic orientation: an ILPbased exact algorithm and a heuristic. Then we propose a multicore scheduling algorithms for cosimulation acceleration. For this we propose two multicore scheduling algorithms. The first algorithm is an ILPbased exact algorithm and the second one is a list scheduling heuristic. The schedule is computed using either of these algorithms over the hyperstep. During execution, this schedule is executed repeatedly. Finally, we evaluate our proposed approach. First, we propose a random generator of operation graphs. We use this graph to generate a large number of synthetic operation graphs of different sizes and structures and with different attributes. We evaluate the performances of our proposed ILPbased exact algorithms and heuristics for the acyclic orientation and scheduling for cosimulation acceleration. The obtained results show the efficiency of our heuristics. While the proposed ILP algorithms give optimal results for small operation graphs they suffer from intractable execution times. Our proposed heuristics, on the other hand, give acceptable results within acceptable execution times. Last, we validate our approach for cosimulation acceleration against an industrial use case. The obtained results show the improvements made thanks to using multirate cosimulation and also using the acyclic orientation to handle mutual exclusion constraints. In addition, we compare our approach with a runtime (online) scheduling approach. Our approach outperforms it which consolidates our choice of adopting an offline scheduling approach.
References
 Lee E.A., Seshia S.A. (2016) Introduction to embedded systems: A cyberphysical systems approach, Mit Press, Cambridge, MA. [Google Scholar]
 Van der Auweraer H., Anthonis J., De Bruyne S., Leuridan J. (2013) Virtual engineering at work: The challenges for designing mechatronic products, Eng. Comput. 29, 3, 389–408. [Google Scholar]
 Kübler R., Schiehlen W. (2000) Modular simulation in multibody system dynamics, Multibody Syst. Dyn. 4, 2–3, 107–127. [Google Scholar]
 Blochwitz T., Otter M., Arnold M., Bausch C., Clauß C., Elmqvist H., Junghanns A., Mauss J., Monteiro M., Neidhold T., Neumerkel D., Olsson H., Peetz J.V., Wolf S. (2011) The functional mockup interface for tool independent exchange of simulation models, in: Proc. of the 8th International Modelica Conference, March 20–22, Dresden, Germany. [Google Scholar]
 Gomes C., Thule C., Broman D., Larsen P.G., Vangheluwe H. (2017) Cosimulation: State of the art. arXiv preprint arXiv:1702.00686. [Google Scholar]
 Sirin G., Paredis C.J.J., Yannou B., Coatanéa E., Landel E. (2015) A model identity card to support simulation model development process in a collaborative multidisciplinary design environment, IEEE Syst. J. 9, 4, 1151–1162. [CrossRef] [Google Scholar]
 Sutter H. (2005) The free lunch is over: A fundamental turn toward concurrency in software, Dr. Dobb’s J. 30, 3, 202–210. [Google Scholar]
 Gebremedhin M., Hemmati Moghadam A., Fritzson F., Stavaker K. (2012) A dataparallel algorithmic Modelica extension for efficient execution on multicore platforms, in: Proc. of the 9th International Modelica Conference, September 3–5, Munich, Germany. [Google Scholar]
 Elmqvist H., Mattsson S.E., Olsson H. (2014) Parallel model execution on many cores, in: Proc. of the 10th International Modelica Conference, March 10–12, Lund, Sweden. [Google Scholar]
 Galtier V., Vialle S., Dad C., Tavella J.P., LamYeeMui J.P., Plessis G. (2015) FMIbased distributed multisimulation with DACCOSIM, in: Proc. of the Symposium on Theory of Modeling & Simulation: DEVS Integrative M&S Symposium, April 12–15, Alexandria, Virginia, pp. 804–811. [Google Scholar]
 Ben Khaled A., Ben Gaid M., Pernet N., Simon D. (2014) Fast multicore cosimulation of cyberphysical systems: Application to internal combustion engines, Simulat. Pract. Theory 47, 79–91. [CrossRef] [Google Scholar]
 Saidi S.E., Pernet N., Sorel Y., Ben Khaled A. (2016) Acceleration of FMU cosimulation on multicore architectures, in: Proc. of the First Japanese Modelica Conference, May 23–24, Tokyo, Japan. [Google Scholar]
 Darte A., Robert Y., Vivien F. (2012) Scheduling and automatic Parallelization, Springer Science Business Media, Berlin, Germany. [Google Scholar]
 Kohler W.H. (1975) A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems, IEEE T. Comput. 100, 12, 1235–1238. [CrossRef] [Google Scholar]
 Ramamritham K. (1995) Allocation and scheduling of precedencerelated periodic tasks, IEEE T. Parall. Distr. 6, 4, 412–420. [CrossRef] [Google Scholar]
 Blażewicz J., Pesch E., Sterna M. (2000) The disjunctive graph machine representation of the job shop scheduling problem, Eur. J. Operation. Res. 127, 2, 317–331. [CrossRef] [Google Scholar]
 Gallai T. (1968) On directed paths and circuits, in: Theory of Graphs, Academic Press, New York, NY, pp. 115–118. [Google Scholar]
 Roy B. (1967) Nombre chromatique et plus longs chemins d’un graphe, Revue française d’informatique et de recherche opérationnelle 1, 5, 129–132, in French. [Google Scholar]
 Vitaver L.M. (1962) Determination of minimal coloring of vertices of a graph by means of Boolean powers of the incidence matrix, Dokl. Akad. Nauk SSSR+ 147, 758–759. [Google Scholar]
 Hasse M., Reichel H. (1966) Zur algebraischen Begründung der Graphentheorie. III, Math. Nachr. 31, 5–6, 335–345, in German. [CrossRef] [Google Scholar]
 Karp R.M. (1972) Reducibility among combinatorial problems, in: Complexity of Computer Computations, Springer, Boston, MA, pp. 85–103. [Google Scholar]
 Ries B. (2007) Coloring some classes of mixed graphs, Discrete Appl. Math. 155, 1, 1–6. [Google Scholar]
 Andreev G.V., Sotskov Y.N., Werner F. (2000) Branch and bound method for mixed graph coloring and scheduling, in: Proc. of the 16th International Conference on CAD/CAM, Robotics and Factories of the Future, CARS and FOF, June, Port of Spain, Trinidad, West Indies, pp. 1–8. [Google Scholar]
 Sotskov Y.N., Tanaev V.S., Werner F. (2002) Scheduling problems and mixed graph colorings, Optimization 51, 3, 597–624. [Google Scholar]
 AlAnzi F.S., Sotskov Y.N., Allahverdi A., Andreev G. (2006) Using mixed graph coloring to minimize total completion time in job shop scheduling, Appl. Math. Comput. 182, 2, 1137–1148. [Google Scholar]
 Grandpierre T., Lavarenne C., Sorel Y. (1999) Optimized rapid prototyping for realtime embedded heterogeneous multiprocessors, in: Proc. of the 7th International Workshop on Hardware/Software CoDesign, CODES’99, March 3, Rome, Italy. [Google Scholar]
 Berkelaar M., Eikland K., Notebaert P. (2004) lpsolve: Open source (mixedinteger) linear programming system, Eindhoven University of Technology, Eindhoven, Netherlands. [Google Scholar]
 Gurobi Optimization, LLC (2016) Gurobi optimizer reference manual, Beaverton, Oregon, USA. [Google Scholar]
 IBM ILOG (2017) CPLEX User’s Manual, IBM Corp. Armonk, New York, USA. [Google Scholar]
All Tables
All Figures
Fig. 1 Evolution of time and data exchange between two FMUs during cosimulation. 

In the text 
Fig. 2 InterFMU dependence specified by the user. 

In the text 
Fig. 3 IntraFMU dependence provided by FMI. 

In the text 
Fig. 4 Operation graph obtained from the FMUs of Figure 3. 

In the text 
Fig. 5 Example of an operation graph. 

In the text 
Fig. 6 Graph obtained by applying the multirate transformation algorithm on the graph of Figure 4. 

In the text 
Fig. 7 Random generation of an operation graph. 

In the text 
Fig. 8 Comparison of the execution times of the acyclic orientation algorithms. 

In the text 
Fig. 9 Comparison of the critical path length. 

In the text 
Fig. 10 Comparison of the scheduling execution time for two cores. 

In the text 
Fig. 11 Comparison of the scheduling execution time for eight cores. 

In the text 
Fig. 12 Comparison of the makespan for two cores. 

In the text 
Fig. 13 Comparison of the makespan for eight cores. 

In the text 
Fig. 14 Spark Ignition (SI) RENAULT F4RT engine model. 

In the text 
Fig. 15 Speedup results. 

In the text 
Fig. 16 Comparison of the different phases of the offline and online scheduling approaches. 

In the text 