OpenMP: loop parallelization

The scheduling types
- Summary
- Static
- Dynamic
- Guided
- Auto
OpenMP in while-loop

The scheduling types

We include four different scheduling types in this blog:

static
dynamic
guided
auto
runtime

Summary

For loops where each iteration takes roughly equal time, static schedules work best, as they have little overhead.
For loops where each iteration can take very different amounts of time, dynamic schedules, work best as the work will be split more evenly across threads.
Specifying chunks, or using a guided schedule provide a trade-off (uma alternativa) between the two.
Choosing the best schedule depends on understanding your loop.

Static

schedule(static, chunk-size) clause allows OpenMP to divide the iterations into chunks with a chunk-size and it distributes the chunks to threads in a circular order.

OpenMP divides iterations into chunks with equal size if no chunk-size is specified.

schedule(static):      
****************                                                
                ****************                                
                                ****************                
                                                ****************
schedule(static, 4):   
****            ****            ****            ****            
    ****            ****            ****            ****        
        ****            ****            ****            ****    
            ****            ****            ****            ****
schedule(static, 8):   
********                        ********                        
        ********                        ********                
                ********                        ********        
                        ********                        ********

A for loop with 64 iterations and we used four threads to parallelize the for loop. Each row of stars in the examples represents a thread. Each column represents an iteration.

The first example (schedule(static)) has 16 stars in the first row. The first tread executes iterations 1, 2, 3, …, 15 and 16. The second row has 16 blanks and then 16 stars. This means that the second thread executes iterations 17, 18, 19, …, 31, 32. Similar applies to the threads three and four.

The static scheduling type is appropriate when all iterations have the same computational cost.

Dynamic

Each thread executes a chunk of iterations and then requests another chunk until there are no more chunks available. There is no particular order in which the chunks are distributed to the threads. The order changes each time when we execute the for loop. It defaults is one if no chunk-size is specified.

schedule(dynamic):     
*   ** **  * * *  *      *  *    **   *  *  * *       *  *   *  
  *       *     *    * *     * *   *    *        * *   *    *   
 *       *    *     * *   *   *     *  *       *  *  *  *  *   *
   *  *     *    * *    *  *    *    *    ** *  *   *     *   * 
schedule(dynamic, 1):  
    *    *     *        *   *    * *  *  *         *  * *  * *  
*  *  *   * *     *  * * *    * *      *   ***  *   *         * 
 *   *  *  *  *    ** *    *      *  *  * *   *  *   *   *      
  *    *     *  **        *  * *    *          *  *    *  * *  *
schedule(dynamic, 4):  
            ****                    ****                    ****
****            ****    ****            ****        ****        
    ****            ****    ****            ****        ****    
        ****                    ****            ****            
schedule(dynamic, 8):  
                ********                                ********
                        ********        ********                
********                        ********        ********        
        ********                                                

For schedule(dynamic) and schedule(dynamic, 1) OpenMP determines similar scheduling. The size of chunks is equal to one in both instances. The distribution of chunks between the threads is arbitrary.

For schedule(dynamic, 4) and schedule(dynamic, 8) OpenMP divides iterations into chunks of size four and eight, respectively. The distribution of chunks to the threads has no pattern.

The dynamic scheduling type is appropriate when the iterations require different computational costs. This means that the iterations are poorly balanced between each other. The dynamic scheduling type has higher overhead than the static scheduling type because it dynamically distributes the iterations during the runtime.

The following program demonstrates the overhead:

#include <unistd.h> 
#include <stdlib.h> 
#include <omp.h> 
#include <stdio.h>
#define THREADS 16 
#define N 100000000
int main ( ) { 
  int i;
  printf("Running %d iterations on %d threads dynamically.\n", N, THREADS); 
  #pragma omp parallel for schedule(dynamic) num_threads(THREADS) 
  for (i = 0; i < N; i++) {
     /* a loop that doesn't take very long */
  }
  /* all threads done */ printf("All done!\n");
  return 0; 
}

The program will run faster if using static.

Guided

The guided scheduling type is similar to the dynamic scheduling type. OpenMP again divides the iterations into chunks. Each thread executes a chunk of iterations and then requests another chunk until there are no more chunks available.

The difference with the dynamic scheduling type is in the size of chunks. The size of a chunk is proportional to the number of unassigned iterations divided by the number of the threads. Therefore the size of the chunks decreases.

The default chunk size is one if no chunk-size is specified.

schedule(guided):      
                            *********                        *  
                ************                     *******  ***   
                                     *******                   *
****************                            *****       **    * 
schedule(guided, 2):   
                ************                     ****     **    
                                     *******         ***    **  
                            *********                           
****************                            *****       **    **
schedule(guided, 4):   
                                     *******                    
                ************                     ****    ****   
                            *********                           
****************                            *****    ****    ***
schedule(guided, 8):   
                ************                 ********        ***
****************                                                
                                     ********                   
                            *********                ********

We can see that the size of the chunks is decreasing. First chunk has always 16 iterations. This is because the for loop has 64 iterations and we use 4 threads to parallelize the for loop. If we divide 64 / 4, we get 16.

We can also see that the minimum chunk size is determined in the schedule clause. The only exception is the last chunk. Its size might be lower then the prescribed minimum size.

The guided scheduling type is appropriate when the iterations are poorly balanced between each other. The initial chunks are larger, because they reduce overhead. The smaller chunks fills the schedule towards the end of the computation and improve load balancing. This scheduling type is especially appropriate when poor load balancing occurs toward the end of the computation.

Auto

The auto scheduling type delegates the decision of the scheduling to the compiler and/or runtime system.

In the following example, the compiler/system determined the static scheduling.

schedule(auto):        
****************                                                
                ****************                                
                                ****************                
                                                ****************

OpenMP in while-loop

This code is from https://www.cac.cornell.edu/VW/OpenMP/whileloop.aspx

#include <omp.h>
#include <stdio.h>
int main(int argc, char **argv)
{
  /* OpenMP does not provide a parallel while loop,
     so we're going to have to make one ourselves... */
  int sj, sstop, tn, tj, tstop;
  int foo(int j);
  omp_set_num_threads(8);

  /* Start carefully - we don't want all threads
     to (re!)initialize the two shared variables */
  sj = -1;     // shared loop counter
  sstop = 0;   // shared stopping condition

  #pragma omp parallel private(tn,tj,tstop)
  {
    tn = omp_get_thread_num();
    while (!sstop)
    {
      /* Threads update the shared counter by turns */
      #pragma omp critical
      {
        sj++;      // increment the shared loop counter...
        tj = sj;   // ...and keep a private copy of it
      }
      /* Threads evaulate function foo in parallel */
      tstop = foo(tj);
      /* Flip sstop for everyone if tstop is true in a thread */
      if (tstop)
      {
        sstop = 1;
        #pragma omp flush(sstop)
      }
      /* When sstop=1, most threads continue to this statment */
      printf("Thread %d, iteration %d, sstop=%d\n",tn,tj,sstop);
    }
  }
}

Wenbin Fei