Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

. Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientiﬁc communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that signiﬁcant performance can be obtained with a few lines of Python.


Introduction
When faced with the selection of a programming language, programmers may take into account aspects such as: syntax simplicity, compilers and tools available, performance attainable, etc.However, other factors such as: language used by the community, libraries specific to the application field, current trends, etc.; can be even more important.In this sense, Python [2] currently fulfills a lot of the requirements: easy to program, good performance trade-off and has a large number of third-party libraries available.Examples of such libraries are NumPy [3] and SciPy [4], that offer vectorized data structures and numerical routines.NumPy is a de facto standard when working with tensors in Python due to the high performance achieved, its ease of use and because it automatically maps operations on vectors and matrices to the BLAS [5] and LAPACK [6] functions when present in the system.When a multi-threaded BLAS version is present in the system (using OpenMP [7] or TBB [8]), the operations are automatically parallelized.
Python is an interpreted language and CPython its most common interpreter.A very well-known limitation of CPython is the use of a Global Interpreter Lock (GIL) which disables concurrent Python threads within one process.This basically means that, although threads are supported in Python, only one will execute at a time.
There are multiple modules that have been developed to provide parallelism in Python, such as the multiprocessing, Parallel Python (PP) and MPI modules.The multiprocessing module provides [9] support for the spawning of processes in SMP machines using an API similar to the threading module, with explicit calls to create processes.On the other hand, the Parallel Python (PP) module [10] provides mechanisms for parallel execution of Python codes, with an API with specific functions to specify the number of workers to be used, submits the jobs for execution, gets the results from the workers, etc.Finally, the mpi4py [11] library provides a binding of MPI for Python which allows the programmer to open parallelism both inter-node and intra-node.In all cases, the management of the parallelism is the programmer's responsibility.
PyCOMPSs [12] is a task-based programming model that offers an interface on Python that follows the sequential paradigm.It enables the parallel execution of applications' tasks by means of building, at execution time, a data dependency graph of the tasks that compose an application.The syntax of PyCOMPSs is minimal, using decorators to enable the programmer to identify those methods that are tasks and a small API for synchronization.PyCOMPSs relies on a Runtime that can exploit the inherent parallelism at task level and execute the application in a distributed parallel platform (clusters and clouds).The Runtime is responsible for scheduling the tasks in the available computation resources, performing the necessary data transfers between distributed memory resources when no shared data system is available, synchronizing all activities, acting as an interface with different computing resources such as cloud middlewares, etc.The Runtime also gives 0support to the new container technologies, such as Docker or Singularity [13].
There are other libraries and frameworks that enable Python distributed and multi-threaded executions such as Dask [14] and PySpark [15].Dask is a native Python library that allows both the creation of custom DAG's and the distributed execution of a set of operations on NumPy and pandas [16] objects.PySpark is a binding to the widely extended framework Spark [17].A previous paper compares several Big Data algorithms using the native version of both COMPSs 1 and Spark runtimes [18], showing that COMPSs is able to get better or competitive results in comparison to Spark.
This paper presents PyCOMPSs functionalities through numerical codes such as matrix multiplication and several matrix factorizations.The main contributions are the extensions to the PyCOMPSs programming model to support multi-threaded libraries, a new scheduling infrastructure, the support for multiple tasks' versions, and its support to heterogeneous architectures (Xeon Phi and GPUs).This set of extensions allows to achieve good performance with a moderate encoding effort, enabling non expert users to reach HPC behaviors even under Big Data conditions (i.e. when data is already present in the infrastructure instead of being initialized for the computation).Moreover, the same code can be executed in a multi-threaded way on a single machine, in the cloud or an heterogeneous cluster, making it highly portable.
An additional contribution of the paper is a set of linear algebra kernels written in PyCOMPSs, which conform a prototype of a parallel linear algebra library that would be made available for the community in the future.
The rest of the paper is organized as follows: Section 2 presents PyCOMPSs and its features, Section 3 describes the linear algebra kernels, Section 4 presents performance results, and Section 5 concludes the paper and gives some guidelines for future work.

PyCOMPSs overview
PyCOMPSs is a task-based programming model that aims to make it easier the development of parallel applications, targeting distributed computing platforms.It relies on the power of its Runtime to exploit the inherent parallelism of the application at execution time by detecting the task calls and the data dependencies between them.
As shown in Figure 1, the PyCOMPSs Runtime allows applications to be executed on top of different infrastructures (such as multi-core machines, grids, clouds or containers) without modifying a single line of the application's code.Thanks to the different connectors, the Runtime is capable of handling all the underlying infrastructure so that the users only define the tasks.It also provides faulttolerant mechanisms for partial failures (with job resubmis-sion and reschedule when task or resources fail), has a live monitoring tool through a built-in web interface, supports instrumentation using the Extrae [19] tool to generate postmortem traces that can be analyzed with Paraver [20,21], has an Eclipse IDE, and has pluggable cloud connectors and task schedulers.
Moreover, since the PyCOMPSs Runtime is written in Java [22], Python syntax is supported through a binding, PyCOMPSs.This Python binding is supported by a Binding-commons layer which focuses on enabling the functionalities of the Runtime to other languages (currently, Python and C/C++).It has been designed as an API with a set of defined functions.It is written in C and performs the communication with the Runtime through the JNI [23].
Regarding the programmability, Tasks are identified by the programmer using simple annotations in the form of Python decorators, which indicate that invocations of a given method will become tasks at execution time.The @task decorator also contains information about the directionality of the method parameters specifying if a given parameter is read (IN), written (OUT) or both read and written in the method (INOUT).Figure 2 shows an example of a task annotation.The parameter c is of type INOUT, and parameters a, b, and MKLProc are set to the default type IN.The directionality tags are used at execution time to derive the data dependencies between tasks and are applied at an object level, taking into account its references to identify when two tasks access the same object.Furthermore, the priority tag is a hint for the PyCOMPSs' scheduler that will force to execute the tasks with this tag earlier, always respecting the data dependencies.
Additionally to the @task decorator, the Óconstraint decorator can be optionally defined to indicate some task hardware or software requirements.Continuing with the previous example, the task constraint Comput-ingUnits shows to the Runtime how many CPUs are consumed by each task execution.The available resources are defined by the system administrator in a separated XML configuration file.Other constraints that can be defined refer to processor architecture, memory size, etc.A tiny synchronization API completes the PyCOMPSs syntax.As shown in Figure 3, the API function compss_wait_on waits until all the tasks modifying the result's value are finished and brings the value to the node executing the main program.Once the value is retrieved, the execution of the main program code is resumed.Given that PyCOMPSs is used mostly in distributed environments, synchronizing may imply a data transfer from a remote storage or memory space to the node executing the main program.

PyCOMPSs Runtime
The PyCOMPSs Runtime handles the execution of the applications in the computing infrastructure.The computing infrastructure is composed of several heterogeneous nodes, and the execution is orchestrated following the master-worker paradigm, where the main program is started on the master node and tasks are offloaded to worker nodes.In the most general case, the node allocating the master node will also allocate a worker node.
As depicted in Figure 4, the Runtime first builds a task graph adding a new node to it every time a task is invoked in the application's code.The directionality of the parameters is used to detect the data dependencies between the new task and previous ones.Secondly, the scheduler will analyze the generated graph in a particular way to execute all the workload among the available resources.We must highlight that this analysis is highly dependent on the different scheduler implementations, but the Runtime provides information about all the data locations so that they can exploit the data locality.Eventually, when a task is scheduled to a given resource, the required objects and files are transferred directly between different memory spaces to guarantee that tasks have available their parameters before execution without passing through the master.Finally, the task is executed in the worker resource and, when specified by the synchronization API, the results are gathered back to the master resource (where the main code is running).
Only concerning the PyCOMPSs binding, when a transfer between different memory spaces is required, the binding serializes and writes the object to disk using the standard library Pickle.The transfer between different resources is then delegated to the Runtime.
Finally, the available Computing Units that each resource can offer to the Runtime is configurable.More specifically, this allows to oversubscribe the amount of work that a resource can receive; meaning that the Runtime can create more threads than the real amount of CPUs that the resource has.

Interaction with external libraries
The PyCOMPSs Runtime supports the execution of multithreaded tasks using the constraint interface.The number of cores assigned to a multi-threaded task can be indicated by the programmer in the ComputingUnits constraint tag.The PyCOMPSs scheduler can assign several cores to a given multi-threaded task.On the other hand, although support for tasks that use several nodes has been added recently, in this work we only consider tasks executing inside a single node.
Before this work, the cores were assigned blindly to the tasks.The performance results observed were relatively poor when running numerical applications, such as those using the NumPy or SciPy libraries that link to the IntelÒMKL library [24].It has been shown that, by default, IntelÒMKL tends to occupy the entire node when the multi-threading is enabled.Not considering this fact can result in a heavy oversubscribing.In addition, each task can be executed in several NUMA sockets.This fact increases the amount of transfers between the different NUMA-nodes, decreasing the performance dramatically.Knowing that this behavior can be found in other libraries, the problem has been solved in a general way.
We have modified the PyCOMPSs task executor in such a way that it is currently able to bind multi-threaded tasks to specific computing units of the infrastructure.This fact, combined with the capability to define the nodes' virtual amount of computing units, allows the user to achieve the desired rate of oversubscribing.However, this is not done blindly: the PyCOMPSs Runtime distributes the tasks evenly between the different NUMA sockets, avoiding at the maximum the transfers between memory spaces.

Scheduling infrastructure
PyCOMPSs Runtime has been extended with a scheduling infrastructure that supports pluggable scheduling policies.Almost all the tests presented in this paper are based on a data locality scheduler that takes into account the node that stores the data accessed by the tasks.More precisely, a task will have a score equal to the amount of input data present in a given node.It is important to realize that even if a centralized task manager or task scheduler can be a bottleneck in some applications, we are tacking distributed computing platforms and the time required for data transfers can be an issue as well, so we are assuming that tasks should have a given granularity that is able to hide the task manager overhead.
Defining a new score policy is enough to change the scheduler behavior.It will prioritize the tasks with the highest score for a given combination of resource, implementation, and data.In addition to the data locality score, three more policies have been defined: First In First Out (FIFO), Last In First Out (LIFO) and data locality with priority to tasks with a shared edge in the dependency graph with the finished task (FIFOData).In this last policy, there are two different scenarios.In the case where there are tasks freed by the job that has just finished, one of them is scheduled in First In First Out order; even before treating the tasks that are already free.Otherwise, data locality is considered between all the available tasks.The first two policies (LIFO, FIFO) have served to probe the robustness of the scheduling system.The third one can be seen as a relaxation of the data locality scheduler to lighten the amount of needed comparisons to schedule a task.
The available schedulers allow the users to configure the execution depending on the expected load.

Python persistent workers
In previous Runtime versions, PyCOMPSs was enhanced with a persistent Java worker, meaning that a Java worker process was started at the beginning of the application execution, communicating with the master to get information about the tasks to be executed and data transfers to be performed.However, every time a Python task was invoked, a new Python interpreter was launched.This process has been enhanced with the implementation of Python persistent workers.
More in detail, the PyCOMPSs worker module has been modified on top of the Python's built-in multiprocessing library.When the application execution begins, the primary worker process in each worker node spawns a set of processes that will be responsible for executing the tasks.These processes are kept alive during the whole application execution and communicate with the Java persistent worker through pipes.The messages that they exchange include information about the task execution requests, job parameters, and computation results.This feature improves the overall performance by reducing the overhead of deploying a new Python interpreter per task.Besides, modules loaded by previous tasks are already present in the interpreter and do not need to be reloaded.

Methods' polymorphism
GPUs are clearly faster than CPUs for some applications.Nevertheless, it is not always easy to decide whether it is better to use one architecture or the other [25].Also, FPGAs are gaining some momentum.In this context, projects with the primary focus of interest on heterogeneous architectures are arising [26].Hence, it seems reasonable to think that, in both HPC and Big Data contexts, we are going towards environments with heterogeneous architectures.
PyCOMPSs can manage those cases by providing support for the definition of different versions of the same method for different architectures.The programmer can use the Óimplements decorator to indicate that a method implements the same behavior than another.Figure 5 shows an example of polymorphism, which together with the Óconstraint decorator allows to indicate to the Runtime that some tasks can only be executed in a given set of computing resources.In fact, using polymorphism and tasks' constraints, the users can define CPU, GPU or FPGA versions of the same task.
Internally, at the beginning of the execution, the Runtime will blindly execute all the available versions that can run in a given resource in order to obtain an execution profile per version.Afterwards, the Runtime is capable to use the profiled information to choose the implementation with the lowest execution time.In this initial version, data size is not taken into account and we take the hypothesis that each kind of implementation is better or worst in the general case.There are no particularities considered capable to combine both implementations depending on the morphology of the input data, even if it would be really interesting and is can be considered in the future work.

Profiling
PyCOMPSs generates postmortem traces under demand using Extrae [19].These files can be explored with Paraver [20,21], obtaining visual information to make easier the code performance fine tuning.Some specific PyCOMPSs events have been added in order to differentiate the different steps done by the master and the workers.More precisely, it is possible to see the different actions performed by a worker each time that a task is executed.
Finally, the dependency graph generated can be plotted at the end of the computation or be explored on runtime with the monitor.

Linear algebra codes
We have evaluated PyCOMPSs with several linear algebra codes.It is important to keep in mind that PyCOMPSs is a general purpose programming model, not a specific one for dense linear algebra computations [27,28].Nevertheless, linear algebra algorithms are the base for several fields such as Machine Learning and Computational physics.Even if other good options like ScaLAPACK [29] already do this job, none of them can be called directly from Python.In general, these libraries are coded in C or Fortran.They are really low level languages that demand a high encoding effort.In return their can achieve really high performances.On the other hand, PyCOMPSs allows the user to have working versions of an algorithm really quick.In addition, the dynamic runtime scheduling empowers the resource efficiency increasing the performance.It is so perfect for prototyping and verifying new concepts and ideas and, as will be shown, can even be competitive under the efficiency point of view.
In general, the matrices are chunked in smaller square matrices (blocks) to distribute the data easily along the available resources and take advantage of this fact to consider the square blocks as the minimum entity to work with [30].All the operations performed on the blocks use the best library available for each architecture, either IntelÒMKL or PyCUDA [31] through skcuda [32].
The following schema has been pursued for all the computations.The initialization is performed in a distributed way, defining tasks to initialize the matrix blocks.These tasks do not take into account the nature of the algorithm and they are scheduled in a round robin manner.Next, all the computations are done considering that the data is already allocated in a given node.The data locality scheduler will assign the tasks taking into account the locality information and reducing the data transfers.This methodology shows that PyCOMPSs is capable of obtaining HPC performance in Big Data environments, where data is already present in the infrastructure, and it is not possible to arrange its location depending on the computation.
The following subsections provide a brief description of the encoded algorithms.

Matrix multiplication
The matrix multiplication code performs a matrix multiplication by blocks.The code has two tasks: one for the creation of the matrices' blocks and one for the blocks' multiply-accumulate.Since the blocks are defined as NumPy arrays, the operations that operate on them are overridden and the corresponding NumPy operation is invoked, which calls as well to the IntelÒMKL operation.Notice that this behavior happens even when indicated with arithmetic operators.
Figure 2 shows the code used to perform the multiplication task.The computational complexity of this algorithm is widely known: 2S 3 , where S is the side size of the matrix.In this context, it may be easier to note S = N • M where N is the amount of blocks in a side and M is the side size of a single block.

Cholesky factorization
The Cholesky factorization can be applied to Hermitian positive-defined matrices.This decomposition is a particular case of the LU factorization, obtaining two matrices of the form U = L t .
There are two main blocked algorithms to perform a Cholesky decomposition.The right-looking algorithm [33] and the left-looking version [34]. Figure 6 shows the code used in this case, corresponding to the right-looking approach.It has been chosen because it is more aggressive, meaning that in an early stage of the computation there are blocks of the solution that are already computed and all the potential parallelism is released as soon as possible.Hence, the Runtime can continue performing the following computations on the matrices.
The functions in bold in the Cholesky code (potrf, solve_triangular, and gemm) are annotated as tasks.Each of these tasks internally calls to IntelÒMKL functions, with a given number of threads.
Figure 7 shows the code of the potrf task from the Cholesky code.This task has a constraint decorator that indicates the number of ComputingUnits (CPU's in this case) required to execute the task that, during the experimentation, matches the amount of IntelÒMKL threads.Notice that the PyCOMPSs Runtime can be configured to oversubscribe the computing nodes with tasks that involve more threads than the actual available computing cores.This way we can overlap I/O operations when starting/ending tasks with the actual computation, increasing the resource efficiency.The other tasks defined in this example look very similar to the potrf one.
On the other hand, this kernel has a GPU version.The PyCOMPSs Runtime automatically handles the GPUs present in a node setting wisely the CUDA visible devices.Figure 8 shows how easy is to take advantage of the accelerators.The polymorphism syntax explained in Section 2 is used to give an alternative version of dgemm.The user can consider the presented code as a template and just change the call to cuBLAS with the desired kernel.Hence, even if the code can seem a little bit complicated, it allows a user that has never seen a line of CUDA to use the GPUs from Python in a distributed context.
Looking at the code and considering the complexity of each call, it is possible to build the Table 1.Hence, the complexity of the algorithm can be computed as shown in Figure 9, where M is the amounts of blocks per side and N the side block size.We can conclude that the algorithm that is the same complexity of the sequential algorithm.

QR factorization
Traditional QR algorithms use the Householder transformation, but this method requires accessing a whole column of the matrix, while our data structures are based on blocks.The approach followed in our implementation uses a method based on Givens rotations, which access the data in the matrices by blocks [35].While the Cholesky approach is the well known typical one, in the QR we have to follow a more complicated procedure to achieve a good degree of parallelism.The following steps are followed: 1. Decomposition of the input matrix in blocks

QR decomposition of the upper left block
Change every block of the column for zero blocks a. Rotation matrix construction through a partial QR decomposition b. Add the rotation matrix to introduce a new zero block in the R matrix Repeat the previous procedure with all the elements in the column, obtaining the following expression Return to the point 1, treating the submatrix result of discarding the first row and the first column of the obtained A N) Once all the transformations have been performed, the following expression is obtained: Figure 10 shows the QR implementation used in this paper that corresponds to the previous algorithm.An auxiliary matrix, mainly composed of identity and zero blocks except for the four blocks corresponding to the positions (i, i), (i, j), (j, i) and (j, j), where (i, j) is the position that is being rotated (changed to zero) performs the rotations.Although in our initial version of the QR algorithm we were allocating this auxiliary matrix, we have developed a second version where only those blocks different to identity or zero are actually allocated, and only the multiplications by values different to identity or zero are performed.With this second approach (called memorySaveQR in the evaluation), we save memory space and reduce useless computations.
Once again, the complexity can be deduced from the code considering the complexity of each call, building the Table 2. Thus, as demonstrated in Figure 11, the complexity of the algorithm is O (20 (NM) 3 ).In this case, the obtained complexity is larger than the sequential one even if it is in the same order.Nevertheless, the parallelism obtained is remarkable so if the main goal is to make the algorithm scalable, the trade-off is worth.

LU factorization
In this case, an approach without pivoting [36] has been the starting point.The partial pivoting blocked algorithm [37] was not considered because requires an entire column to be present in a node to compute the partial column LU decomposition.Knowing that this approach is unstable in general [38], one modification has been done to increase the stability of the algorithm while keeping the block division and avoiding bringing an entire column into a single node.
Figure 12 shows a schema with all the steps taken in an iteration of the LU factorization: 1.The current principal block, considered the one with the lowest column and row index (A in the figure) is decomposed using some underlying library.2. The first U's row (U 12 in the figure) is computed using the row of the current principal block.3. The column of the present main block is used to calculate an auxiliary result for the next steps (P 22 L 21 ). 4. The LU decomposition of the blocks with a row or column index larger than the principal one is launched (recursive step).
5. The first L's column (L 21 in the figure) is computed using the column of the current main block as well as the result of the iterative step.
Finally, Figure 13 shows the code corresponding to the previously presented algorithm.
In the previous cases, all the calls had a cubic complexity.On the other hand, this kernel has some calls like invert_triangular -which calls to solve_triangular, the MKL function to invert triangular matriceswith quadratic complexity that have been neglected, building Table 3.In Figure 14 all the terms have been added to compute the total complexity which is Since in sequential a developer would use either a non pivoting algorithm or a pivoting considering the full columns, it makes no sense to compare the complexity of this algorithm against the sequential code.Nevertheless, it is interesting to realize that it is less than a half of the QR asymptotic complexity.
Table 2. NumPy calls to perform a QR decomposition.

Function
Amount of calls Complexity    4 Results

Computing infrastructure
The execution results presented in this section have been obtained in three different clusters located at the Barcelona Supercomputing Center (BSC).
We have used PyCOMPSs version 2.2 for the initial evaluations, and the same version with the Python persistent workers to test these new features.We have also used IntelÒPython 2.7.13 and IntelÒMKL 2017.

MareNostrum III
This supercomputer was composed by 3056 nodes, each of them with two IntelÒ SandyBridge-EP E5-2670/1600 20M (8 cores at 2, 6 GHz each), main memory that varies from 32 to 128 GB, FDR-10 Infiniband and Gigabit Ethernet network interconnections, and 3 PB of disk storage [39].It was providing service for researchers from a wide range of different areas, such as life science, earth science and engineering until March 2017.

COBI cluster
This cluster is an IntelÒ SSF system composed of 8 nodes with two IntelÒ XeonÒ CPU E5-2690 v4 @ 2.60 GHz and 128 GB of main memory each (Xeon nodes) and 8 nodes with an IntelÒ Xeon PhiÒ CPU 7210 @ 1.30 GHz and 110 GB of main memory each (KNL nodes).The network technology used is Omni-Path.Also, Lustre [40] is used as shared file system.

General comments on the evaluation
We consider that Python users will use the intra-node parallelized version of the linear algebra algorithms through MKL (not only a shared memory implementation, but a one optimized for Intel processors) into their applications.Since these kernels can be directly called from Python without any further installation, we have considered them as a really good reference point for Cholesky, QR, and LU.Hence, instead of comparing the PyCOMPSs implementations against other distributed kernels, we have computed the SpeedUp's against the sequential version encoded to be optimal for this kind of processor.Since the sequential version has shared memory and does not perform transfers, we consider that having a good performance when comparing against these executions is enough to assess the good behavior of the executions.
On the other hand, it is important to keep in mind that, for a given matrix size, increasing the block size implies decreasing the number of total blocks.Hence, even if we show that better performances can be obtained when increasing the block size, this fact decreases the number of blocks and thus the maximum parallelism.Moreover, since the input sizes are bigger, each task requires a higher amount of memory.After considering all these factors and trade-offs, we have decided to perform all the tests with blocks of 4K • 4K.
Finally, we would like to highlight the fact that, for most of the executions, we delegate the transfers to the shared file system.Hence, it is difficult to evaluate accurately the transfer time since the shared file systems have lots of data cached in memory and replicated across the cluster.This is the main reason why this study has not been done.

Matrix multiplication
For the matrix multiplication example, we have first analyzed the impact of the oversubscription in a single node of MareNostrum III.The data square matrices considered had 32K • 32K doubles, organized in blocks of 4K and 8K.Larger blocks have not been tested since there is a limit on the serialization size for a single block of 4 GB for Python 2.7.This limitation is due to a hardcoded parameter in the save_bytes function, present in the class _pickle.c of the module in charge of the serializations.
We have instructed the Runtime to schedule always two tasks per NUMA socket, where all task threads are bounded to a single socket as described in Section 2. The number of threads in each execution ranged from 4 to 32 threads, in such a way that the oversubscription ratio ranged from 1 (when 4 threads/task are used) to 8 (when 32 threads/task are used).Figure 15 shows the results of this evaluation.Notice that PyCOMPSs gets the maximum performance when using a level of oversubscription of 4 and a block size of 8K.The performance in the best case is around Table 3. NumPy calls to perform a LU decomposition Function Amount of calls Complexity 200 GFLOPS, which represents the 60% of the peak of a MareNostrum III node (332 GFLOPS [41]).Next, we have executed the same PyCOMPSs code in a variable number of nodes of MareNostrum III, from 1 to 64 (from 16 to 1024 cores), with an additional node for the master.
Figure 16 shows the performance obtained with matrices of 64K • 64K doubles, organized in blocks of 4K • 4K.The Runtime was configured to execute a maximum of four tasks per node, each of them using 16 IntelÒMKL threads, which represent an oversubscription ratio of four.The light green line labeled Ideal distributed scaling is derived by considering an ideal speedup from the performance obtained in one node with PyCOMPSs.The red line Sequential performance corresponds to the performance obtained with IntelÒMKL in one node with 16 threads for a single block of 4K • 4K (251 GFLOPS).We observe a very good speedup until 16 nodes.After that, the efficiency is lower but the system keeps scaling.
The next result worth mentioning concerning the matrix multiplication is the heterogeneous execution in the COBI cluster.Figure 17 shows a post-mortem trace file obtained with Extrae and analyzed with Paraver.More precisely, the execution uses one Xeon node and one KNL node.Blue tasks at the left-side of the image correspond to initialization tasks and the white ones to block multiplications.Green flags point out beginning and end of tasks.Notice that these many-core architectures allow working with plenty of tasks simultaneously.For instance, this execution multiplies two matrices of 131K • 131K divided into blocks of 4K, with 1024 free tasks in average, and a total amount of 32 768 tasks; showing the capability of the PyCOMPSs Runtime to handle a huge amount of work.Each KNL node executes four tasks concurrently and each Xeon node executes eight tasks at a time.
In this context, and considering that there is not a dedicated node to allocate the master (it runs in a Xeon node that also executes an entire worker), the scalability study has stressed the scheduler.Figure 18 shows how the Data locality scheduler degrades much faster than the FIFOData scheduler.This is due to the simplification in the second scheduling policy, avoiding a lot of comparisons between the score of the different free tasks.In this case, the users have two options to increase the performance, either to change the scheduler or to dedicate more resources to the Runtime.
We have so demonstrated the importance of the choice both of the block size and the scheduler choice.In addition, we show that a good performance on the matrix multiplication can be achieved until 16-32 nodes.Finally, we show that PyCOMPSs Runtime is able to handle different implementations of the same function to take advantage of heterogeneous infrastructures.

Cholesky factorization
Figure 19 shows the Cholesky performance in MareNostrum III.The matrix size is again 64K • 64K doubles, and the block size is 4K • 4K doubles.We have used the same values for the maximum tasks per node and oversubscription than in the previous case (4 tasks per node, 16 threads per task, and an oversubscription ratio of 4).
The red line Sequential performance corresponds to the performance obtained with IntelÒMKL potrf with 16 threads for a block of 4K • 4K doubles (72 GFLOPS) in a single MareNostrum node.The green line Ideal scaling from sequential is equivalent to the ideal speedup calculated from the previous value.
The chart shows the performance result with PyCOMPSs 2.2.The line shows ideal scaling up to four nodes and an increasing degradation from there.
When inspecting the post-mortem performance traces, we have observed that the degradation of the performance is due to the morphology of the task-graph.Combining the observed results shown in Figures 20 and 21, the maximum available parallelism is already filled with eight nodes.We can conclude this by looking at the inverse triangle form of the DAG and the resource fulfillment of the resources in the trace.This is a good example of how useful the profiling tools can be when analyzing the code to understand the obtained performance.
Figure 22 shows the performance obtained with a heterogeneous execution in the COBI cluster.The block size is 4K • 4K and the matrix size is 130K • 130K.For each execution, half of the nodes are Xeon, and half are KNL nodes.In this case, only the gemm function can run on both architectures.The rest can run only in the Xeon nodes.
Figure 20 shows the execution trace of the previous performance test with four Xeon and four KNL nodes.Green segments correspond to initialization tasks, the blue ones to potrf, the red ones to solve_triangular and the white ones to gemm.Yellow lines separate the different nodes.While the nodes with a maximum of four tasks running at a time are KNLs, the ones with a maximum of eight are Xeon nodes.In this execution, the KNL nodes are constrained to execute gemm tasks and readers may notice that the scheduler is capable of handling the fact that not all the machines can execute all tasks' types by only sending the tasks where they can run.
Finally, two executions in the Minotauro cluster have been performed.This is an heterogeneous cluster with two different types of node.It this case, no performance studies have been done.This is mainly due to the fact that considering that each time a GPUs is used, the data has to be copied to its memory and bring back the result.This slows down the execution significantly compared to the maximum theoretical performance.Nevertheless, it is important to realize that the PyCOMPSs Runtime is able to handle a double heterogeneity in this case.On a first level, we have two different kind of nodes in the cluster.On a second level, each one of those nodes has two different kind   of processors: several CPUs and GPUs.PyCOMPSs Runtime handles successfully four different types of processors in a single execution with a minimum effort for the developer.Figure 23 shows the traces obtained when using two (Fig. 23a) and four nodes (Fig. 23b).In each case, half the nodes are of one type and half of the other.The main interest of these executions is to show that we are able to execute a kernel in an heterogeneous GPU cluster.Still, the performance remains really poor since all the data is copied between the CPU and the GPU memory each time a computation is done.We think that the performance could be increased implementing a cache to handle the GPU memory, avoiding some data transfers.Nevertheless, this remains as an idea for future work.

QR factorization
For the QR evaluation, we used a matrix of size 32K • 32K doubles organized in blocks of 4K • 4K doubles.We have evaluated the dense version of the factorization against the memory save version.As described in Section 3.3, the   difference between both versions is that the sparse version only allocates those blocks of the auxiliary matrix that are different to the identity or to zero.Additionally, the operations with this identity or zero blocks are not performed (the original block or a zero block is directly returned).To accomplish this behavior, each element is modeled as a list with an integer to indicate the block type and, if necessary, a NumPy array with its real content.Figure 24 presents the results obtained executing the code in MareNostrum III.As explained in the previous case, the application does not scale well beyond 4 nodes because it reaches the parallelism limit of the application.Nevertheless, the performance until this point is excellent.
Figure 25 shows the dependency graph corresponding to a QR decomposition of a 4 blocks • 4 blocks execution, manifesting that the PyCOMPSs Runtime is not only able to handle an enormous amount of tasks but also really complex dependency graphs.The real execution (8 blocks • 8 blocks) has a much more complicated graph but is too large to be depicted in this paper.

LU factorization
In this case, all the executions have been performed in the COBI cluster using only Xeon nodes.A matrix with 82K • 82K doubles has been factorized, obtaining the matrices P, L and U detailed in Section 3.4.
As shown in Figure 26, the reference execution performance is the maximum achieved with a threaded code executing pure NumPy code, corresponding to a 32K • 32K doubles matrix.The performance obtained with a matrix with 4K • 4K doubles (the block size used in the distributed execution) is 41 GFLOPS, far away from the 218 GFLOPS achieved with the 32K • 32K matrix.This fact suggests that when the matrix is not large enough, not all the resources are fulfilled.Nonetheless, when executing several tasks at the same time, PyCOMPSs takes advantage of all the available resources.
Two remarkable facts can be deduced from this experiment.The blocked LU decomposition scales pretty well until a reasonable amount of cores.In addition, the PyCOMPSs Runtime demonstrates its capacity to work with really complicated DAGs. Figure 27 shows a zoom of a dependency graph for a 8 blocks • 8 blocks execution that makes easier to appreciate the complexity handled by the PyCOMPSs Runtime.On the other hand, Figure 28 shows the dependency graph for a 6 blocks • 6 blocks execution.
On the other hand, not only the DAG complexity is important.The amount of tasks that the Runtime can handle is also remarkable.In this case, we have observed that the scheduler choice has a really big impact in the final performance.Figure 29a shows that, in this case, the data locality scheduler is not the best choice.It adds come overhead to compute the best locality when, in fact, all the data is directly available on lustre.When taking into account this fact, the FIFOData scheduler has been considered.Figure 29b shows that the obtained performance is much better.In this execution, 34 288 tasks are executed in 380 seconds in 196 cores.Figure 29c shows that with this amount of tasks and slots available, the Runtime is able to run correctly with tasks that last 0.7-2.0s.Hence, the throughput is in the microseconds order.This execution can serve as example to better understand PyCOMPSs  capabilities since is handles a huge amount of tasks not lasting so much.

Programming complexity
As stated in Section 2, we aim at providing a programming model that makes it easier the development of parallel applications.While it is difficult to evaluate the easiness or complexity of a programming model, we show in this subsection a summary of the lines of code required to program the PyCOMPSs codes described in this paper as a measure for programmability.
These codes are composed of the main algorithm and the tasks' functions.The main algorithm is usually expressed in a method, like the ones shown in Figures 6,   Table 4 shows a summary of the lines required to program all the codes presented in this paper.The Main function column is the number of lines in the main algorithm of the codes.The Program column is the total number of lines.The column Decorator corresponds to the number of lines specific for PyCOMPSs, that can be considered the added lines to a sequential version of the code.We can see that from very simple codes, we can obtain performance in distributed and heterogeneous platforms.Readers must consider that the codes contain spaces and blank lines to ease its  comprehension and that they include the distributed matrix initialization, the auxiliary functions to join the blocked matrix for the result verification, and the timers to compute the elapsed time.Hence, the number of lines can be considered a superior bound (which could be easily reduced) of a complete program that tests the correct behavior of the kernels.Furthermore, we explicitly state the number of lines added as decorators to obtain the distributed version from the blocked one.We highlight that it makes no sense to compare the distributed version against the sequential one since all the kernels can be called in a single MKL line.

Conclusion
PyCOMPSs is a robust programming model that enables the programmer to achieve remarkable performances with clean and simple codes.Based on the sequential version and just by adding some decorators skillfully, the code is ready to run on distributed and heterogeneous clusters thanks to the power of the Runtime that automatically handles all the tasks' scheduling and data transfers.Once the Runtime is installed on a given infrastructure, the code parallelized with PyCOMPSs is easily executed.
This paper has shown that good performances can be achieved regarding sequential codes when calling lower level libraries tuned for each particular architecture.PyCOMPSs allows the users to quickly turn a Python sequential code into a highly portable distributed version.Depending on the code's nature, PyCOMPSs can so play both the distributed programming language and orchestrator roles.In fact, when increasing the size of the matrices, we are able to take advantage of all the architecture, observing super linear SpeedUp's when considering the sequential execution as the reference.
Finally, all the executions have been performed with the data already present in the shared file system.The PyCOMPSs Runtime has shown that it is capable of achieving a good performance level when the data is already present in the infrastructure.This fact puts the programming model in the frontier between Big Data and HPC, fulfilling the needs of both environments.

Fig. 26 .
Fig. 26.LU performance in the COBI cluster with Xeon nodes.

10
or 13, with lengths of 15-30 lines composed of a set of nested loops with calls to the task functions.The task functions are much simpler, each of them of 5-10 lines (like the ones in Figures 2 or 7 ) and are wrappers to the NumPy calls with the corresponding PyCOMPSs decorators.

Table 1 .
NumPy calls done to perform a Cholesky decomposition.

Table 4 .
Amount of code lines in each kernel.