General cuda/OpenAcc thread question

Hi, all:

I’m in the process of optimizing the performance of my hybrid cuda fortran/OpenAcc code. I come from MPI and OpenMP parallel programming environment. My original code is partially parallelized with OpenMP. I have some general questions regarding the software/hardware configuration on nvidia GPU and Openacc platform.

  1. Through research, I found that I need significantly oversubscribe the cores on the GPU, ie, the number of threads of a cuda kernel needs to be many times of the number of the actual cores a GPU has. This is very different from MPI and OpenMP, in which the number of processes/threads is usually same as the number of CPU cores. Launching too many processes/threads will significantly affect the performance of the code. Can someone explain why such over subscribing is good for GPU platform?

  2. My code needs use double precision. I’m using tesla K40 as the development platform. It has 15 SMs, each has 64 double precision processing cores. Is the warp size still 32 for double precision cores? What is the best openacc gang/worker/vector configuration for doing double precision computation on K40? I’m thinking 15 gangs, 2 workers and 32 vectors. Is this configuration good? Or I have to use much higher numbers for gangs/workers?

Thanks in advance.


Hi John,

Best to not equate an OpenMP thread with a CUDA thread nor a CPU with a GPU in terms of execution model. With a CPU you do generally only want 1 OpenMP thread running per CPU core. However with a GPU, you can have a maximum of 2048 threads running on each SM simultaneously (a minimum of 2 blocks with 1024 threads each or a maximum 16 blocks with 128 threads each, new devices have a max of 32 blocks with 64 threads each but still are limited to 2048 total threads). So you wont be “oversubscribing” on your device unless you use over 30,720 threads. Even then, a new block wont be issued until another has retired, so there’s less contention than would occur if you oversubscribed on CPU. Though, you are correct in that the K40 does have limited number of FP64 compute cores, so you may see some contention here. (newer NVIDIA GPUs have a lot more FP64 cores available).

Though to start, I’d let the compiler choose the schedule to use (typically it will use 128 vectors per gang), make sure the program is running correctly, and then get a baseline performance, From there you can start experimenting with the vector length. Though keep in mind that lowering the vector length will reduce the maximum number of concurrent threads running on a given SM. So while your FP64 contention may lower, the different block may be running different sections of code (like fetching memory), so your overall performance may be lower since there are fewer threads running. You can try different numbers of gangs as well, but I generally recommend not setting this except for special circumstances, and instead let the runtime determine the number of gangs to use based on the actual loop trip count used during execution.

I should note that for OpenACC when targeting an NVIDIA device, a “gang” maps to a CUDA block, “worker” maps to a thread block’s “y” dimension, and “vector” to the thread block’s x dimension. In general, “worker” is not used except for special cases.

Also, while 2048 is the maximum number of threads, this may be lower depending on the number of registered used per thread and shared memory used per block. Both are fixed size, so if each thread uses more than 32 registers, the maximum number of running threads will be reduced. If you do a web search for “CUDA Occupancy”, you can find additional details.

Can someone explain why such over subscribing is good for GPU platform?

The main reason is latency hiding. While a CPU memory is optimized to get small amounts of data from memory to a core very fast, a GPU memory is optimized for throughput. So to hide the time it takes to retrieve memory, as one warp (32 threads) waits for memory, other warps can be running.

So while FP64 core contention may be a performance issue, of greater importance is ensuring that your vector loop is accessing data in the contiguous (stride-1) dimension of your arrays. This way, the data for all threads in a warp are obtained in a single fetch to memory.

I’ve probably missed some details and oversimplified others, but hopefully this provides some answers to your questions. Though, of course if I was not clear, please feel free to ask follow-up questions.

Also, if you want to provide a code example, I can better provide specific suggestions on improving performance.

Hi, Mat:

Thanks for your detailed explanation about the difference between OpenMP thread and cuda threads. After spending several days on trying different settings of the number of gangs and length of the vectors, I found that these settings are problem size dependent, it is hard to make a good choice for all problem size.