First post – sorry if this is a dumb question:
I’m just trying out Cuda. I have a GT 240, spec’ed with 96 Cuda cores. I wanted to understand the basics of parallel execution before investing in something heavier.
I’m struggling a bit understanding the model of blocks, threads, cores, SMs, streams etc. So I tried a simple test to eliminate memory issues and get down to raw processing measurements. I wanted to understand the optimal configuration of blocks and threads per block, for my (GT 240) hardware, for a simple linear task that I can break into parallel chunks.
I’ve written a kernel that takes no parameters, and basically counts to 1,000,000 (adding the thread id so it has something to do) as a speed test.
global static void MyKernel()
{
const int tid = threadIdx.x;
int n;
int s=0;
for (n=0; n<=1000000; n++)
{
s+=tid;
}
}
If I run this as a single thread, on a single block – it takes about 200ms to run. Fine.
MyKernel<<<1,1>>>(); // = ~ 200 ms
If I keep a single thread per block, and increase the number of blocks, it doesn’t take much longer…
MyKernel<<<50,1>>>(); // = ~ 200 ms
Until I hit 97 blocks, and it takes twice the time
MyKernel<<<97,1>>>(); // = ~ 400 ms
Perfect. I have 96 cores, so it makes perfect sense that it can complete 96 kernel runs in parallel, then needs an extra run to complete number 97. Increasing the blocks further remains at ~ 400 ms until I hit 193 blocks, then it takes ~ 600 ms. All as expected.
At that point I’m patting myself on the back and making coffee.
The problem comes when I tested the opposite. Having a single block and increasing the number of threads per block.
MyKernel<<<1,1>>>(); // = ~ 200 ms
MyKernel<<<1,50>>>(); // = ~ 200 ms
MyKernel<<<1,100>>>(); // = ~ 210 ms
MyKernel<<<1,150>>>(); // = ~ 220 ms
MyKernel<<<1,200>>>(); // = ~ 240 ms
MyKernel<<<1,250>>>(); // = ~ 240 ms
MyKernel<<<1,300>>>(); // = ~ 260 ms
MyKernel<<<1,400>>>(); // = ~ 300 ms
MyKernel<<<1,512>>>(); // = ~ 360 ms
I’m now baffled why I seem to be able to complete 200 parallel runs in not much more time than a single run, and there is no apparent step change in timing as I hit 97 threads.
What’s going on? And how does this help me determine the optimal number of blocks and threads per block for this kind of task?
Charlie