# Load balancing and OpenMP implementation of nested parallelism (2023)

ScienceDirect

ViewPDF

## Article preview

• Abstract
• Introduction
• Section snippets
• References (14)
• Cited by (35)
• Recommended articles (6)

## Parallel Computing

Volume 31, Issues 10–12,

October–December 2005

, Pages 984-998

(Video) 2020 LLVM Developers’ Meeting: “(OpenMP) Parallelism-Aware Optimizations”

## Abstract

Many problems have multiple layers of parallelism. The outer-level may consist of few and coarse-grained tasks. Next, each of these tasks may also be rich in parallelism, and be split into a number of fine-grained tasks, which again may consist of even finer subtasks, and so on. Here we argue and demonstrate by examples that utilizing multiple layers of parallelism may give much better scaling than if one restricts oneself to only one level of parallelism.

Two non-trivial issues for multi-level parallelism are load balancing and implementation. In this paper we provide an algorithm for finding good distributions of threads to tasks and discuss how to implement nested parallelism in OpenMP.

## Introduction

Computational demanding problems, where parallel computing is needed to find a solution within reasonable time, also tend to be complex problems. When breaking a complex problem into smaller independent subproblems for parallel processing, one typically finds several layers of parallelism. The first level or outer-level may consist of few, but large tasks. Each of these outer-level tasks may be split into a number of fine-grained tasks, which again may consist of even finer subtasks, and so on.

If the outer-level tasks are independent, they provide an obvious, coarse-grained parallelism. Unfortunately, coarse-grained parallelism may give poor scalability, as the tasks often are few and of unequal sizes. Consequently, many applications are parallelized at inner-level, exploiting parallelism within each task. Fine-grained parallelism quite often suffers from large parallel overhead and synchronization cost when the number of processes increases, resulting in limited scalability. When scalability is limited for coarse- as well as fine-grained parallelism, then combining the two still can provide reasonable scalability as demonstrated in our examples in Section 4.

In this paper we investigate the advantages and challenges of implementing multi-level parallelism. The great advantage we claim is enhanced scalability. However, we see two main challenges. As is illustrated by our examples, the cases where an outer-level parallelism is not sufficient, is in particular where the outer-level tasks are of unequal sizes. This makes load balancing non-trivial, but not less important. Thus, finding a good distribution of threads to outer-level tasks is very important for good efficiency. This question is addressed in Section 2, where we provide an algorithm which distributes threads to tasks.

The second challenge is that of the increased complexity of the implementation. We investigate the increased complexity in implementation in the context of OpenMP [1], the standard for SMP-programming. The popularity of SMP-programming is mainly due to its ease of programming as the programmer usually only inserts directives in front of the most time consuming loops. In the case of multi-level parallelism one would like it to be sufficient to simply apply nested sets of directives to achieve the necessary effect, and contrary to most of its predecessors OpenMP allows nested directives.

However, a complicating fact is that the effect of the OpenMP directives for nesting, is not well defined. It is, for instance, compliant with the standard to serialize nested directives. Thus there is no guaranty that multiple levels of parallelism actually are used even if the programmer specifies so by inserting directives. Consequently, the only way to make truly portable OpenMP-code, which faithfully executes the multi-level parallelism intended by the programmer, is the more cumbersome way of explicitly assigning tasks to threads. We therefore demonstrate how this can be done in Section 3.

The effect of 2-level parallelism on two real applications, a wavelet based data compression code and an adaptive mesh refinement code, have been tested and is reported in Section 4. We conclude in Section 5 with some remarks on the limitations of the current OpenMP standard.

## The distribution algorithm

In this section we will present an algorithm for the distribution of threads to tasks. We will focus our discussion on the 2-level case and try to be exhaustive for this case. We also briefly explain how our algorithms can be extended to multiple levels. The computational units at the outer-level will be called tasks. These are assumed to be independent, and of unequal sizes. Each task is supposed to possess an internal fine-grained parallelism. For our algorithms to work, an estimate of the

## Implementation of nested parallelism in OpenMP

Nested parallelism is possible to implement using message passing parallelization. In MPI [11], creating communicators will make it possible to form teams of threads,1 where the members of the same team can communicate among themselves (fine-grained parallelism), while the coarse-grained parallelism implies communication between communicators.

The distribution of work to multiple threads in SMP-programming is usually done

## Experiments

In this section we report on experiences on 2-level parallelism in work we have done on real applications. The load balancing is done using the work distribution algorithms presented in Section 2. For implementation we have used the explicit thread programming technique outlined in Section 3.2.

The first example is a wavelet based data compression routine. In this case we have a predefined and fixed number of tasks. Our second example is an adaptive mesh refinement code. Here the number of

## Conclusions

The main purpose of this paper has been to examine the possible gain of utilizing nested parallelism when available in the problem. Our findings are very encouraging. Using two levels of parallelism turned out to be imperative for good parallel performance on our test cases. Utilizing nested parallelism on a real life problem, a wavelet compression application, a speedup of 33 was achieved for 50 threads. We find this result very satisfactory.

As always good load balancing is essential in

## References (14)

• M.J. Berger et al.Adaptive mesh refinement for hyperbolic partial differential equations

### Journal of Computational Physics

(1984)

• OpenMP. Available from:...
• G. Ausiello et al.

### Complexity and Approximation

(1999)

• M.J. Berger. Adaptive mesh refinement for hyperbolic conservation laws. Available from:...
• M.J. Berger et al.

### Journal of Computational Physics

(1984)

(Video) A Compilers View of OpenMP (OpenMP Webinar)

• M.J. Berger, R.J. LeVeque. AMRCLAW, Adaptive mesh refinement+CLAWPACK. Available from:...
• M.J. Berger et al.

### SIAM Journal on Numerical Analysis

(1998)

There are more references available in the full text version of this article.

## Cited by (35)

• NestedMP: Enabling cache-aware thread mapping for nested parallel shared memory applications

2016, Parallel Computing

It is beneficial to exploit multiple levels of parallelism for a wide range of applications, because a typical server already has tens of processor cores now. As the number of cores in a computer is increasing rapidly, efficient support of nested parallelism will be more and more important.

We observe that different task-core mapping schemas may result significant performance difference because modern HPC servers are NUMA multi-core systems. So it is important to control the task-core mapping for nested parallelism. However, the number of threads management mechanism in current parallel programming models, such as OpenMP, does not provide enough information for runtime systems to make optimized decision. As a result, current nested parallel applications often suffer from suboptimal task-core mapping and get significant performance loss.

To address this problem, we propose NestedMP, a set of directives which extends OpenMP. NestedMP specifies the number of threads of each nested parallel branch in a declarative way and allows runtime systems to see the whole picture of task trees to make locality-aware task-core mapping. We have implemented NestedMP in GCC 4.8.2 and tested the performance on a 4-way 8-core SandyBridge server. The result shows NestedMP improves the performance significantly over GCC’s OpenMP implementation.

• Performance evaluation of OpenMP-based algorithms for handling Kronecker descriptors

2012, Journal of Parallel and Distributed Computing

Numerical analysis of Markovian models is relevant for performance evaluation and probabilistic analysis of systems’ behavior from several fields in science and engineering. These models can be represented in a compact fashion using Kronecker algebra. The Vector-Descriptor Product (VDP) is the key operation to obtain stationary and transient solutions of models represented by Kronecker-based descriptors. VDP algorithms are usually CPU intensive, requiring alternatives such as data partitioning to produce results in less time. This paper introduces a set of parallel implementations of a hybrid algorithm for handling descriptors and a detailed performance analysis on four real Markovian models. The implementations are based on different scheduling strategies using OpenMP and existing techniques of static and dynamic load balancing, along with data partitioning presented in the literature. The performance evaluation study contains analysis of speed-up, synchronization and scheduling overheads, task mapping policies, and memory affinity. The results presented here provide insights into different implementation choices for an application on shared-memory systems and how this application benefited from this architecture.

View all citing articles on Scopus

## Recommended articles (6)

• Research article

Improving OpenCL Programmability with the Heterogeneous Programming Library

Procedia Computer Science, Volume 51, 2015, pp. 110-119

The use of heterogeneous devices is becoming increasingly widespread. Their main drawback is their low programmability due to the large amount of details that must be handled. Another important problem is the reduced code portability, as most of the tools to program them are vendor or device-specific. The exception to this observation is OpenCL, which largely suffers from the reduced programmability problem mentioned, particularly in the host side. The Heterogeneous Programming Library (HPL) is a recent proposal to improve this situation, as it couples portability with good programmability. While the HPL kernels must be written in a language embedded in C++, users may prefer to use OpenCL kernels for several reasons such as their growing availability or a faster development from existing codes. In this paper we extend HPL to support the execution of native OpenCL kernels and we evaluate the resulting solution in terms of performance and programmability, achieving very good results.

• Research article

A C++11 implementation of arbitrary-rank tensors for high-performance computing

Computer Physics Communications, Volume 185, Issue 6, 2014, pp. 1681-1696

This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU.

Program title: cpp-array

Catalogue identifier: AESA_v1_0

Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AESA_v1_0.html

Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland

Licensing provisions: GNU General Public License

No. of lines in distributed program, including test data, etc.: 11698

No. of bytes in distributed program, including test data, etc.: 94363

Distribution format: tar.gz

Programming language: C++.

Computer: All modern architectures.

Operating system: Tested on Linux and Mac OS.

RAM: Problem dependent

Classification: 5.

External routines: GNU CMake build system and BLAS implementation. NVIDIA CUBLAS for GPU computing.

Nature of problem:

Tensors are a basic building block for any program in scientific computing. Yet, tensors are not a built-in component of the C++ programming language.

Solution method:

(Video) OpenMP Tutorial for ECP

An arbitrary-rank tensor class template is crafted by using the new features introduced by the C++11 set of requirements. In addition, an entire expression template facility is built on top, to provide mathematical straightforward notation without damaging performance.

Running time:

Problem dependent. The tests provided take only seconds. The examples take approximately 15min.

• Research article

Parallel and vectorized implementation of analytic evaluation of boundary integral operators

Engineering Analysis with Boundary Elements, Volume 96, 2018, pp. 194-208

In this paper, we describe an efficient analytic evaluation of boundary integral operators. Firstly, we concentrate on a novel approach based on the simultaneous evaluation of all three linear shape functions defined on a boundary triangle. This results in a speedup of 2.35–3.15 times compared to the old approach of separate evaluations. In the second part we comment on the OpenMP parallelized and vectorized implementation of the suggested formulae. The employed code optimizations include techniques such as data alignment and padding, array-of-structures to structure-of-arrays data transformation, or unit-strided memory accesses. The presented scalability results, with respect both to the number of threads employed and the width of the SIMD register obtained on an Intel® Xeon™ processor and two generations of the Intel® Xeon Phi™ family (co)processors, validate the performed optimizations and show that vectorization needs to be an inherent part of modern scientific codes.

• Research article

Last level cache layout remapping for heterogeneous systems

Journal of Systems Architecture, Volume 87, 2018, pp. 49-63

Heterogeneous systems with CPU and GPGPU sharing the last level cache (LLC) provide viability and flexibility. However, the different programming models lead to conflicting memory layouts, which are required for best performance of different processors. Software converting that directly accesses target layout is subject to sub-optimal localities. Converting in GPGPU shared memory also incurs copying and synchronization overhead.

In this paper, we analyze the memory layout requirement and propose to remap the memory layout in the shared LLC. A remap controller in LLC executes a simple program that calculates target requests from an LLC request in the source memory space. The LLC request is thus remapped to the target memory space with the generated requests. Consequently, all processors always access memory in their optimal data layouts. The locality is thus kept through all the private caches, and software remapping overhead is also eliminated.

The tiled-matrix multiplication is discussed as a case study and benchmarks from Polybench/GPU and Rodinia are modified to take advantage of the LLC layout remapping. The experiment results show the average benchmark execution time is decreased to 69%. Compared with CPU software layout converting, the CPU time is decreased to 41%–73%.

• Research article

A multi-platform scaling study for an OpenMP parallelization of a discontinuous Galerkin ocean model

Computers & Fluids, Volume 117, 2015, pp. 325-335

We present a cross-platform scaling investigation for an OpenMP parallelization of UTBEST3D – a coastal and regional ocean code based on the discontinuous Galerkin finite element method. The study is conducted for a real life application on an unstructured computational mesh of the Northwest Atlantic with realistic topography and well resolved coast line on a broad selection of current computing platforms. Four numerical setups of increasing physical and computational complexity are used for comparison: barotropic with no vertical eddy viscosity, barotropic with an algebraic eddy viscosity parametrization, baroclinic with an algebraic eddy viscosity, and baroclinic with k$∊$ vertical turbulence closure. In addition to Intel Xeon and IBM Power6/PowerPC architectures, we also include Intel’s new MIC processor Xeon Phi in the evaluation. Good scalability is found across all platforms with Intel Xeon CPUs producing the best runtime results and Xeon Phi demonstrating the best parallel efficiency.

• Research article

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters

Parallel Computing, Volume 58, 2016, pp. 27-36

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs. It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs for OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to potential performance improvements being untapped. In this paper, we introduce “CUDA-Aware” concepts for OpenSHMEM that enable operations to be directly performed from/on buffers residing in GPU’s memory. We propose novel and efficient designs that ensure “truly one-sided” communication for different intra-/inter-node configurations while working around the hardware limitations. We achieve 2.5 × and 7 × improvement in point-point communication for intra-node and inter-node configurations, respectively. Our proposed framework achieves 2.2$\mu$s for an intra-node 8-byte put operation from CPU to local GPU and 3.13$\mu$s for an inter-node 8-byte put operation from GPU to remote GPU. The proposed designs lead to 19% reduction in the execution time of Stencil2D application kernel from the SHOC benchmark suite on Wilkes system which is composed of 64 dual-GPU nodes. Similarly, the evolution time of GPULBM application is reduced by 45% on 64 GPUs. On 8 GPUs per node CS-Storm-based system, we show 50% and 23% improvement on 32 and 64 GPUs, respectively.

(Video) Heartbeat Scheduling: Provable Efficiency for Nested Parallelism
(Video) Intro to parallel programming with OpenMP (Part 1)
View full text

## Videos

1. ECP OpenMP Tutorial 06-28-2017
(Exascale Computing Project)
2. Hybrid MPI and OpenMP Parallel Programming
(Sharcnet HPC)
3. Dynamic Load Balancing of Loops in OpenMP - Florina Ciorba - SC19
(OpenMP)
4. Episode 4.5 - Parallel Loops, Private and Shared Variables, Scheduling
5. OpenMP: ParallelFor
(HPC Education)
6. 9. Introduction to OpenMP Offload
(NERSC)
Top Articles
Latest Posts
Article information

Author: Tish Haag

Last Updated: 02/03/2023

Views: 6368

Rating: 4.7 / 5 (67 voted)

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.