Dear all,
I have a doubt about CUDA cores and threads. Imaging I have a GPU with 448 cores, each thread will run on one core? It means I can run almost 448 threads each time in parallel, regardless my threads/block configuration?
Thansk a lot.
Dear all,
I have a doubt about CUDA cores and threads. Imaging I have a GPU with 448 cores, each thread will run on one core? It means I can run almost 448 threads each time in parallel, regardless my threads/block configuration?
Thansk a lot.
Incidentally, yes. If your GPU has 448 cores, it’s not a compute capability 2.1 device where multiple cores would work on one thread. On all other GPUs, threads will always be scheduled to the same core.
However, this is not something you should care about at all. On a fully loaded GPU, there are many more threads in flight than there are cores (at least 24× more threads than cores). Whether a particular thread always ends up on the same core or on different ones, or on more than one core at the same time, is an implementation detail you should not worry about.
Hi, thanks for the reply.
Another litte question: how can I detect the maximum number of threads I can map on the device?
cudaGetDeviceProperties() will give you a struct with some useful fields:
multiProcessorCount
maxThreadsPerBlock
maxThreadsPerMultiProcessor
Hi seibert, thanks for the infmo, My GPU has multiProcessorCount=14 maxThreadsPerMultiProcessor=1536. It means the max number of threads runs on the GPU is 14 x 1536 = 21504? So if I exceed this value the kernel will raise an exception?
If you exceed that value remaining blocks are started when earlier blocks finish. You do not need to worry about this unless you play dirty tricks to implement inter-block communication.
Ideally, you start a lot more threads than 21504, so that different GPUs all get maxed out (and not too much computing power is wasted while the GPU is loaded only partially in the end).
But is there a theoretical limit regards max thread number for a kernel? Or I can launch millions and millions of cores per time ideally?
The limits on the number of blocks per kernel are given in Appendix F.