Hi everyone,
I would like to replicate a weird CUDA bug here. It causes problems when there are more than one active block in a MP.
The full test code is as following:
#include <stdio.h>
#include <stdlib.h>
#define GRIDSIZE 28
#define BLOCKSIZE 256
__device__ int sumval = 2000;
typedef struct {
#if (0)
int buf[1786];
#else
int buf[1785];
#endif
int flag;
} patch_st;
__global__ void mykernel()
{
volatile __shared__ patch_st patch;
volatile __shared__ int points[BLOCKSIZE + 1];
volatile __shared__ int changed;
// Initialize values
if (!threadIdx.x)
{
points[0] = 1;
points[BLOCKSIZE] = 0;
patch.flag = 1;
}
else
points[threadIdx.x] = 0;
__syncthreads();
do
{
if (!threadIdx.x)
changed = 0;
__syncthreads();
if (points[threadIdx.x] != points[threadIdx.x + 1])
{
points[threadIdx.x + 1] = 1;
changed = 1;
}
__syncthreads();
} while (changed);
if (!threadIdx.x)
{
int sum = 0, i;
for (i = 0; i < BLOCKSIZE; i++)
sum += points[i];
atomicMin(&sumval, sum);
patch.flag = 0;
}
}
int main(void)
{
dim3 grid, block;
int hostsum;
grid.x = GRIDSIZE;
block.x = BLOCKSIZE;
//cudaSetDevice(0);
mykernel<<<grid, block>>>();
cudaThreadSynchronize();
printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));
cudaMemcpyFromSymbol(&hostsum, "sumval", sizeof(int), 0, cudaMemcpyDeviceToHost);
printf("Min sum = %d\n", hostsum);
return 0;
}
The struct [font=“Courier New”]patch_st [/font]is for controlling the size of shared memory space occupied by each block (by varying the size of [font=“Courier New”]buf[/font] field).
First of all, I set the value of GRIDSIZE to the number of multiprocessors available on the GPU (27 for my GLX260) and the size of shared memory to 8192 bytes (half of the space available on each MP). With these settings, the program yields a correct result (Min sum = 256). In this case each block is run on one MP, so there is only one active block per MP.
$ nvcc --optimize 3 -arch=sm_11 --ptxas-options=-v -o testbug1 testbug1.cu
testbug1.cu
tmpxft_00002348_00000000-3_testbug1.cudafe1.gpu
tmpxft_00002348_00000000-8_testbug1.cudafe2.gpu
ptxas info : Compiling entry function '_Z8mykernelv'
ptxas info : Used 2 registers, 8192+8188 bytes smem, 8 bytes cmem[1], 4 bytes
cmem[14]
tmpxft_00002348_00000000-3_testbug1.cudafe1.cpp
tmpxft_00002348_00000000-13_testbug1.ii
Next, I increase GRIDSIZE while keeping the size of shared memory still 8192 bytes. Now there is a MP running 2 blocks, and these 2 blocks can be run simultaneously (2 active blocks) because each of them requires only a half of shared memory space available (the number of registers in use is quite small enough). With these settings, the program yields random incorrect result (less than 256). When there are more than one active block, the WHILE loop in the above code does not work as expected.
Then, I increase the size of shared memory to 8196 bytes (larger than a half of available space), GRIDSIZE is kept unchanged (28 for my GLX260). In this case, although there is a MP running 2 blocks, these 2 blocks cannot be run simultaneously. With these settings, the program becomes working properly (Min sum = 256).
$ nvcc --optimize 3 -arch=sm_11 --ptxas-options=-v -o testbug1 testbug1.cu
testbug1.cu
tmpxft_0000147c_00000000-3_testbug1.cudafe1.gpu
tmpxft_0000147c_00000000-8_testbug1.cudafe2.gpu
ptxas info : Compiling entry function '_Z8mykernelv'
ptxas info : Used 2 registers, 8196+8192 bytes smem, 8 bytes cmem[1], 4 bytes
cmem[14]
tmpxft_0000147c_00000000-3_testbug1.cudafe1.cpp
tmpxft_0000147c_00000000-13_testbug1.ii
The issue also happens on my 8200 mGPU.
I’ve run the test on WinXP 32-bit, CUDA 2.2.
Anybody can explain the issue? I am wondering if this is a hardware/tool bug.