Hi John,
Best to not equate an OpenMP thread with a CUDA thread nor a CPU with a GPU in terms of execution model. With a CPU you do generally only want 1 OpenMP thread running per CPU core. However with a GPU, you can have a maximum of 2048 threads running on each SM simultaneously (a minimum of 2 blocks with 1024 threads each or a maximum 16 blocks with 128 threads each, new devices have a max of 32 blocks with 64 threads each but still are limited to 2048 total threads). So you wont be “oversubscribing” on your device unless you use over 30,720 threads. Even then, a new block wont be issued until another has retired, so there’s less contention than would occur if you oversubscribed on CPU. Though, you are correct in that the K40 does have limited number of FP64 compute cores, so you may see some contention here. (newer NVIDIA GPUs have a lot more FP64 cores available).
Though to start, I’d let the compiler choose the schedule to use (typically it will use 128 vectors per gang), make sure the program is running correctly, and then get a baseline performance, From there you can start experimenting with the vector length. Though keep in mind that lowering the vector length will reduce the maximum number of concurrent threads running on a given SM. So while your FP64 contention may lower, the different block may be running different sections of code (like fetching memory), so your overall performance may be lower since there are fewer threads running. You can try different numbers of gangs as well, but I generally recommend not setting this except for special circumstances, and instead let the runtime determine the number of gangs to use based on the actual loop trip count used during execution.
I should note that for OpenACC when targeting an NVIDIA device, a “gang” maps to a CUDA block, “worker” maps to a thread block’s “y” dimension, and “vector” to the thread block’s x dimension. In general, “worker” is not used except for special cases.
Also, while 2048 is the maximum number of threads, this may be lower depending on the number of registered used per thread and shared memory used per block. Both are fixed size, so if each thread uses more than 32 registers, the maximum number of running threads will be reduced. If you do a web search for “CUDA Occupancy”, you can find additional details.
Can someone explain why such over subscribing is good for GPU platform?
The main reason is latency hiding. While a CPU memory is optimized to get small amounts of data from memory to a core very fast, a GPU memory is optimized for throughput. So to hide the time it takes to retrieve memory, as one warp (32 threads) waits for memory, other warps can be running.
So while FP64 core contention may be a performance issue, of greater importance is ensuring that your vector loop is accessing data in the contiguous (stride-1) dimension of your arrays. This way, the data for all threads in a warp are obtained in a single fetch to memory.
I’ve probably missed some details and oversimplified others, but hopefully this provides some answers to your questions. Though, of course if I was not clear, please feel free to ask follow-up questions.
Also, if you want to provide a code example, I can better provide specific suggestions on improving performance.