Hi everybody,
I am running my CUDA code on a GeForce 210 (GeForce-Grafikkarten – das ultimative PC-Gaming) and I was wondering about the mapping of blocks to multi processors or better threads to multi processors:
1.) Running deviceQuery revealed that the maximum number of blocks per grid (one-dimensional grid) is 65535. So if I start a kernel with the maximum number of blocks ( kernel <<<65535, X>>> (param1, param2, …, paramN); ) how are they mapped to the MPs? I read that there might be 8 blocks at max concurrently being processed by 1 MP.
2.) If I start a kernel with the maximum number of blocks and a blocksize of 1 (kernel <<<65535, 1>>> (param1, param2, …, paramN); ) does it mean, that internally, a block will not care about the block size of 1 and run 32 threads anyway and just disregard the calculations of the other 31 threads?
I am pretty new to CUDA so sorry if the answers to my questions seem obvious.