How is a non-parallelized for loop inside an OpenMP parallel section executed?

Question

Consider the following code:

#pragma omp parallel
for (int run = 0; run < 10; run++)
{
  std::vector<int> out;
  #pragma omp for
  for (int i = 0; i < 1'000'000; i++)
  {
    ...
  }
}

The intent is to spawn OpenMP threads only once before the outer iterations, which are supposed to run sequentially, and then schedule the inner iterations* multiple times on the same existing parallel threads.

However, the outer iteration is not marked #pragma omp single. Should I assume that it will do something like running the same outer code in all the threads at the same time, thereby a) concurrently modifying the run variable at ill-advised times, and b) having a possible conflict in the instantiation of out. Or are these implicitly private and not shared between threads because they are inside the parallel section?

How would this be executed in practice according to the specification, and what do the various implementations do in practice?

_{* My actual code is a bit more complex than this ; in practice I intend to put the inner for loop in an orphaned function that will be called by an outer function which handles setting up the parallel execution.}

There is an openmp specification but thats not an official "standard". Perhaps its just about your wording, but the question sounds as if you are seeking for an answer in the c++ standard (which does not cover openmp) — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Nov 20 at 13:16
run and out variables are thread-private in your case, so there are no conflicts. Live demo: godbolt.org/z/MPocj1acq. The outer loop is run by each thread individually, and the inner loop is run in parallel by all the threads, which seems to be what you want. However, if it logically makes sense, I wouldn't be afraid of putting omp parallel for just before the inner loop. On modern system, creation of threads is fast, and OpenMP runtimes are able to reuse threads without creating them repeatedly under the hood. — Daniel Langr
– Daniel Langr, Commented Nov 20 at 14:01
@463035818_is_not_an_ai I was indeed thinking of the OpenMP specification and not the C++ standard itself, poor wording from me. Edited question! — F.X.
– F.X., Commented Nov 20 at 22:16
@DanielLangr in your demo the first thread actually finishes both iterations before the second thread gets a chance to run, so in practice the outer iterations just run the exact same code as fast as they can in each thread independently, until they encounter omp for? How does each thread coordinate to determine what batch of inner iterations it should run, particularly when schedule is not static? — F.X.
– F.X., Commented Nov 20 at 22:22
@SergeyAKryukov What I wanted to say was that outer iterations are not marked single: yes, they are effectively parallel because they're inside that section. Thanks for confirming that this should not cause issues! — F.X.
– F.X., Commented 2 days ago

F.X. · Accepted Answer · 2025-11-26 14:51:45Z

3

My current understanding of the situation from comments and my own experimentation is something in the following tentative diagram:

Time is from top to bottom, and threads are from left to right.

The program starts with only one thread (in red) until it encounters the first omp parallel section, at which point it creates $N-1$ additional threads to form a team of $N$ threads (in grey) together with the main thread.
Each thread is scheduled for execution independently by the OS scheduler, over which we have no control. The threads all execute the run++ outer loop and the instantiation of the out vector privately at different times depending on this.
Each time a thread encounters the omp for for the inner loop (in orange, blue, purple and green), they ask the OMP runtime (which I haven't drawn attached to any thread because I do not know the specifics of the implementation) which subset of the iterations they should do. This will change depending on the scheduler attribute. Again, this happens at completely independent times. For scheduler(static), they likely get an equal batch of $1000000/N$ values of i.
After each inner iteration, the main thread reaches the end of the omp for section and waits until all threads are also finished with a barrier.

edited yesterday

answered 2 days ago

F.X.

7,5156 gold badges53 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Daniel Langr 2 days ago

I agree with the text, but the image is wrong, since it does not reflect the implicit barriers you mention. For example, according to the image, the main thread starts run=1 while thread1 hasn't finished run=0.

F.X. yesterday

My understanding was that there were (implicitly) barriers only at the end of the omp parallel section, but not between omp for sections unless explicitly added, which was what I wanted to show on the schema. Was that incorrect?

Daniel Langr yesterday

According to the OpenMP Specification, "There is an implicit barrier at the end of a worksharing-loop construct unless a nowait clause is specified." See, e.g., openmp.org/spec-html/5.0/openmpsu41.html#x64-1290002.9.2

F.X. yesterday

Thanks, I think I missed this one! Edited to reflect that!

Collectives™ on Stack Overflow

How is a non-parallelized for loop inside an OpenMP parallel section executed?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related