Hi everyone,
I am developing an application using an old Quadro FX 580 card and I am running into a strange behavior I cannot explain despite I tried and tested many things. I am running on a Linux platform with CUDA 6.0, the adapter is also running my graphics X-Window environment.
My nVidia driver version for Linux is: 334.21
Here are the specs I am getting using the CUDA API for my adapter:
===========================================================
There is 1 device supporting CUDA
Device 0 name: Quadro FX 580
Computational Capabilities: 1.1
Maximum global memory size: 536150016
Maximum constant memory size: 65536
Maximum shared memory size per block: 16384
Maximum block dimensions: 512 x 512 x 64
Maximum grid dimensions: 65535 x 65535 x 1
Warp size: 32
Multiprocessor count: 4
My application is passing an array to the kernel, the array is copied in main memory and the kernel is loading values into shared memory before computing the cumulative sum on the rows. The shared memory array is defined as the number of y threads rows X x threads columns for a given block size and empty elements are initialized to zero and do not contribute to the sum at the end. So far, so good.
My kernel is working fine for specific block size values and at some point it returns garbage/bad values. I did check for the shared memory size and I am well below the maximum size from the table above. I am supposed to have access to 16KB/block of shared memory.
Here is a sample output of a fine working testing kernel (I do not perform any operation in this kernel, I am just loading data into the shared memory and then writing it back into an output array to display). I am truncating the lengthy rows here for simplicity.
Number of data columns: 7 Number of data rows: 4
Block size X: 128 Block size Y: 4
Section size: 256
Number of blocks X: 1
Number of blocks Y: 1
Shared memory required per block: 4096 bytes
Shared memory required: 4096 bytes
46.500572205 60.121448517 94.641960144 16.583198547 76.388687134 78.664627075 6.791918278 0.000000000 0.000000000 … 0.000000000
35.463073730 93.785087585 76.827529907 47.127616882 80.814987183 48.758403778 70.282768250 0.000000000 0.000000000 … 0.000000000
59.405567169 28.825101852 10.006287575 81.956733704 44.649925232 18.157581329 6.861516953 0.000000000 0.000000000 … 0.000000000
20.348367691 5.681340694 90.369644165 52.945861816 42.453250885 69.568466187 32.529796600 0.000000000 0.000000000 … 0.000000000
You notice I am using only 4KB of an available 16KB of shared memory/block, everything is float and then 4 bytes/element in the table which is 256x4, each thread is loading two values so the block size is 256 in the x direction. Now, if I want to use a larger block and increase 128 to 256 which is the limit for my hardware I then get the following result:
Number of data columns: 7 Number of data rows: 4
Block size X: 256 Block size Y: 4
Section size: 512
Number of blocks X: 1
Number of blocks Y: 1
Shared memory required per block: 8192 bytes
Shared memory required: 8192 bytes
46.500572205 60.121448517 94.641960144 16.583198547 76.388687134 78.664627075 6.791918278 0.000000000 0.000000000 … 0.000000000
59.405567169 28.825101852 10.006287575 81.956733704 44.649925232 18.157581329 6.861516953 0.000000000 0.000000000 … 0.000000000
26464.785156250 26072.335937500 26362.183593750 25327.906250000 25067.820312500 25915.816406250 26507.703125000 25037.628906250 26638.929687500 … garbage
25484.427734375 25544.488281250 24483.269531250 25638.626953125 25338.503906250 25298.460937500 25466.601562500 25857.996093750 26072.941406250 … garbage
My code seems ok and the outcome may vary and may depends on what I ran on the GPU before running this kernel. So, instead of garbage I may get some zeros where I shouldn’t. For reference, here is my test kernel (which is not doing any computation, just loading the data into the shared memory and then copy it back to main memory and printout the result from the host.
__global__ void test_kernel(float * input, float * shMem, unsigned int num_rows, unsigned int num_cols) {
__shared__ float M[BLOCK_SIZE_Y][SECTION_SIZE];
// Define shortcuts
unsigned int bx = blockIdx.x;
unsigned int tx = threadIdx.x;
unsigned int by = blockIdx.y;
unsigned int ty = threadIdx.y;
unsigned int section_size = blockDim.x * 2;
/*
* Load two values in shared memory
*
* Since the size of the data is twice the number of threads, each thread is loading
* the value at the same index as its own index + the section size (twice the block size)
* multiplied by the block index. First block is loading data indices from 0 to section size - 1,
* the second block is loading data indices from section size to twice section size - 1, etc.
*/
unsigned int xlimit = bx * section_size + tx;
unsigned int ylimit = by * blockDim.y + ty;
unsigned int idx1 = xlimit + ylimit * num_cols;
unsigned int idx2 = idx1 + blockDim.x;
unsigned int in_size = num_rows * num_cols;
// Load first value corresponding to thread and block
if (idx1 < in_size && xlimit < num_cols)
M[ty][tx] = input[idx1];
else
M[ty][tx] = 0.0; // padding with zeros the tail of an incomplete block
// Load second value at blockDim.x from first value
if (idx2 < in_size && (xlimit + blockDim.x) < num_cols)
M[ty][tx+blockDim.x] = input[idx2];
else
M[ty][tx+blockDim.x] = 0.0; // padding with zeros the tail of an incomplete block
unsigned int idx3 = xlimit + ylimit * section_size;
unsigned int idx4 = idx3 + blockDim.x;
shMem[idx3] = M[ty][tx];
shMem[idx4] = M[ty][tx+blockDim.x];
}
Below is a snippet of the host code:
#define BLOCK_SIZE_X 512
#define BLOCK_SIZE_Y 1
#define SECTION_SIZE (2 * BLOCK_SIZE_X)
void CudaMeanNormalization::test() {
float *deviceInput;
bool error = false;
cudaError_t err;
unsigned int inSize; // Size in bytes of the input/output
inSize = feat_size * data_size * sizeof(float);
float *input = transpose();
err = cudaMalloc((void **)&deviceInput, inSize);
if (checkError(err, __FILE__, __LINE__)) error = true;
if (!error) {
unsigned int numBlocks_x = (data_size - 1)/SECTION_SIZE + 1; // Number of blocks required in the x direction
unsigned int numBlocks_y = (feat_size - 1)/BLOCK_SIZE_Y + 1; // Number of blocks required in the y direction
std::cout << "Number of data points: " << data_size << "\tNumber of features: " << feat_size << std::endl;
std::cout << "Block size X: " << BLOCK_SIZE_X << "\tBlock size Y: " << BLOCK_SIZE_Y << "\nSection size: " << SECTION_SIZE << std::endl;
std::cout << "Number of blocks X: " << numBlocks_x << "\nNumber of blocks Y: " << numBlocks_y << std::endl;
err = cudaMemcpy(deviceInput, input, inSize, cudaMemcpyHostToDevice);
if (checkError(err, __FILE__, __LINE__)) error = true;
if (!error) {
// Initialize the grid and block dimensions here
dim3 dimBlock(BLOCK_SIZE_X, BLOCK_SIZE_Y, 1);
dim3 dimGrid(numBlocks_x, numBlocks_y, 1);
//@@ Launch the GPU Kernel here
unsigned int sMem_size = (SECTION_SIZE * numBlocks_x) * (BLOCK_SIZE_Y * numBlocks_y);
float *deviceSMem;
float *sMem = new float[sMem_size]();
err = cudaMalloc((void **)&deviceSMem, sMem_size * sizeof(float));
if (checkError(err, __FILE__, __LINE__)) error = true;
if (!error) {
//cudaMemset(deviceSMem, 0, sMem_size * sizeof(float));
test_kernel<<<dimGrid,dimBlock>>>(deviceInput, deviceSMem, feat_size, data_size);
} else
std::cerr << "Cannot allocate device memory for shared memory output." << std::endl;
cudaThreadSynchronize();
std::cout << "Shared memory required per block: " << SECTION_SIZE * BLOCK_SIZE_Y * sizeof(float) << " bytes" << std::endl;
std::cout << "Shared memory required: " << sMem_size * sizeof(float) << " bytes" << std::endl;
cudaMemcpy(sMem, deviceSMem, sMem_size * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(deviceSMem);
for (size_t i = 0; i < sMem_size; i++) {
std::cout << sMem[i] << "\t";
if (((i+1) % SECTION_SIZE) == 0)
std::cout << std::endl;
}
delete (sMem);
}
}
cudaFree(deviceInput);
delete (input);
}
The transpose() call is providing the data via the input pointer, don’t bother with what it does it provides a 1D array matrix like row major.
I noticed as well when I pick values of BLOCK_SIZE_X, BLOCK_SIZE_Y such that my shared memory requirement is exactly 16KB it seems we actually need more than that. Compilation phase abort with an error saying my requirement is 40 bytes over the available shared memory. Anyone can tell me why?
I did further testing and it looks really like GPU global memory corruption. Is there anything I can do to pin down this problem and identify if it is a bug with the nVidia driver or CUDA 6.0 or some limitation I am not aware of?
THX