Operating System: Windows XP 64-bit, but host and device code compiled as 32 bit
CUDA toolkit: v. 2.3
SDK: v. 2.3
Compiler: Visual Studio 2008, v. 9.0.21022.8 RTM
Hardware:
CPU Intel Core2 Duo E8600 3.33GHz,
GPU NVIDIA GeForce 9600GT (for display only),
GPU NVIDIA GeForce GTX 285 (for CUDA computation)
Driver: 191.07
Problem: Under certian conditions (see below) not all blocks of a kernel are executed at once despite the device being capable of doing so
Detail:
Consider launching two kernels (both with occupancy=1) on a device of compute capability 1.3 and with 30 stream multiprocessors:
kernelA<<<p,256>>>
kernelB<<<q,512>>>
Let q be equal to 60.
If p is smaller than X (X equals #SM*2=60 in my case, but cannot assert that particular value elsewhere) then kernelB is initially launched with less than q blocks, and only after some blocks end their work, new are scheduled in. This can lead to serious drop in performance and in some cases may cause a program to work incorrectly - for example, if the kernel assumes all its blocks are active.
Example:
__global__ void emptyKernel() {}
__global__ void myKernel(int *control, int *output) {
int numberOfAttempts=0;
int terminate=1234567;
output[blockIdx.x]=1;
__syncthreads();
//attempt to synchronise all blocks
if (threadIdx.x==0) {
atomicAdd(control,1);
volatile int v;
do {
++numberOfAttempts;
v=*(volatile int *)control; //volatile to force read from global
if (numberOfAttempts==terminate) {
output[blockIdx.x]=0; //we waited too long. Report the problem and quit the loop
break;
}
} while (v<gridDim.x);
}
__syncthreads();
}
int main() {
/* [...] initialisation which I skipped here, e.g. choosing the best GPU */
const int maxBlocks=60;
int *gpuControl;
cudaMalloc((void**)&gpuControl, sizeof(int));
int cpuControl=0;
cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);
int *gpuOutput;
cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks);
int cpuOutput[maxBlocks];
int thr=480;
int p=13;
int q=60;
emptyKernel<<<p,thr>>>();
cudaThreadSynchronize(); //not needed but to be on the safe side...
myKernel<<<q,512>>>(gpuControl, gpuOutput);
cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks,cudaMemcpyDeviceToHost);
for (int i=0; i<q; ++i)
printf("%d ",cpuOutput[i]);
}
Expected output:
Working on GeForce GTX 285, 30 SMs, CC=1.3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Obtained output:
Working on GeForce GTX 285, 30 SMs, CC=1.3
1 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
Different parameters:
The output depends on parameteres p, q and thr (see the code). For some values, the output is correct, for others - it is not.
-
If thr is between 481 and 512, the output is correct regardless of values p and q. That particular value of thr causes both kernels to have the same number of warps per block.
-
If q>=60 this leads to an obvious and expected error as GTX 285 is unable to launch more than 60 blocks in parallel.
-
If p multiple of 60 the behaviour is correct
-
If q<=60-p the behaviour is also correct
-
So the bug occurs if p<60, 60-p<q<=60