Number of threads in `do concurrent` loops

Hi all,

I have questions regarding the Saxpy and Jacobi examples from https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk/.

First, the compiler diagnostics give the message Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x. Is my assumption correct that for a do concurrent loop, there are always 128 threads per block, and that the number of blocks is chosen in such a way that each loop element is associated with one single thread (assuming that the loop is not too large)? Is there a way to control the number of threads used?

Second, you can use thread parallelism on the host CPU, using -stdpar=multicore. I noticed that some OpenMP environment variables, e.g. OMP_PLACES and OMP_PROC_BIND have an influence on the performance, whereas others, especially OMP_NUM_THREADS, do not. As far as I understand the CPU parallelism, it is not connected to OpenMP, or is it? Is there a list of environment variables which can be used to pin the threads to cores and especially, set the total number of threads?

Best regards,
Christian

Hi Christian,

is my assumption correct that for a do concurrent loop, there are always 128 threads per block,

While 128 threads per block is the default, there can be cases where smaller values are used. Typically when the compiler can detect the trip count for the thread loop is less than 128.

and that the number of blocks is chosen in such a way that each loop element is associated with one single thread (assuming that the loop is not too large)?

Not necessarily, but for these small examples, one thread per loop iteration is optimal.

Is there a way to control the number of threads used?

No, you’ll need to use OpenACC in order to set this. However, it’s rare that using more than 128 threads per block gives better performance.

As far as I understand the CPU parallelism, it is not connected to OpenMP, or is it?

STDPAR is built on top of our OpenACC runtime, so to set the number of CPU cores use the “ACC_NUM_CORES” environment variable.

Is there a list of environment variables which can be used to pin the threads to cores and especially, set the total number of threads?

Since OpenACC doesn’t have environment variables to perform thread to core binding, we leverage OMP_PROC_BIND and OMP_PLACES. Personally, I use “numactl” since it also supports memory binding.

-Mat