random memory errors when kernel writes data

I get random memory errors when kernel writes byte data to memory.

I want to write function that processes image (array of unsigned char pixels) and
writes result in another image.
The kernel works fine “most of times” and result compares with calculated on CPU.
But once in a few times (approximately 1 out of 10) there are a few bytes wrong.

The problem seems similar to discussed in another thread “http://forums.nvidia.com/index.php?showforum=71
But I do not use any __synchthread() function.

The kernel is so simple i could not see any error in it.

It does not matter what calculation I do in kernel.
for example
global void subtract_pixels(const unsigned char *A, const unsigned char *B, int width, int height, int stride, volatile unsigned char C){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset= y
stride +x;
int c1;
if(x<width && y<height){
c1= abs((int)A[offset]-(int)B[offset])/2;
C[offset]= c1;
The problem seems to be worse when horizontal size of THREAD_BLOCK is small.
for example when THREAD_BLOCK size is 16x16 it happens often.
But when THREAD_BLOCK size is 64x16 it is not often.
Can I process pixel (BYTE) data in CUDA kernel at all?
Does horizontal size of THREAD_BLOCK must be aligned to cache memory chunk size ?
My graphic card is
GeForce GTX 460
Can it be simply because of defective graphic memory or garphic card?
Can somebody advise what is the matter?

Run Windows Memory Diagnostic.

I do not think this is right answer.

The problem specific to CUDA program development.

Anyway, is there free program to diagnose Nvidia graphic card memory?



Memory errors are more likely when the GTX 460 is an overclocked or “superclocked” card - some vendors will sell overclocked chips (either GPU or memory or both) that run fine in games, but are more likely to produce errors in compute applications. Buyer beware.


The gaming cards have no ECC This means that bits can just get flipped randomly. It seems that this problem occurs more on the overclocked cards. If the code is not the problem you could fix his by decreasing the memory speed or increasing the voltage.
In order to be really sure I suggest you run cuda-memcheck or the cuda debugger with the memory check enable to see if you are calling the function with the correct parameters. maybe you can post the line in the code where you call the function and the values of the parameters.

Just to make sure there is no simple bug:
You always call this one with (stride >= width), right?
If width is indeed the width of A&B it even has to be (stride==width).

I checked video memory with Video Stress Test. It shows no errors.
My graphic card is not overclocked.
GeForce GTX 460
Image size is width=1920 height=1080 stride=width.

Yet, when processing image the output is sometimes right,
sometimes a few bytes wrong when compared with CPU computed reference.
The error seems only happen when kernel process BYTES (unsigned char).

I can not see the source of this problem. Another thing that jsut came into my mind occurs when you use different streams to launch the kernel and to perform the memcopy. Do you use streams?
Also you might get wrong results when you fail to wait for the cuda code to finish prior to comparing your results. Do you use something like cudaDeviceSynchronize()?
Maybe you can provide a snipped of your host code, to check for problems there.

The host code and kernel test is very simple.

All is done in one thread.

I do not see obvious error.



stride= width;

checkCudaErrors( cudaMemcpy(d_A, h_A, sizeY, cudaMemcpyHostToDevice) );

checkCudaErrors( cudaMemcpy(d_B, h_B, sizeY, cudaMemcpyHostToDevice) );

#define BLOCK_SIZE_X 16

#define BLOCK_SIZE_Y 16

dim3 threadsPerBlock(BLOCK_SIZE_X, BLOCK_SIZE_Y);

dim3 numBlocks((width+(threadsPerBlock.x-1)) / threadsPerBlock.x, (height+(threadsPerBlock.y-1)) / threadsPerBlock.y); 

// invoke Kernel

subtract_pixels<<<numBlocks, threadsPerBlock>>> ((unsigned char*)d_A, (unsigned char*)d_B, width, height, stride, (unsigned char*)d_C);

getLastCudaError(“kernel launch failure”);

checkCudaErrors( cudaDeviceSynchronize() );

// Copy result from device memory to host memory

checkCudaErrors( cudaMemcpy(h_C, d_C, sizeY, cudaMemcpyDeviceToHost) );

// compare with result computed on host …

To verify I tested on another machine and graphic cards.

When tested on another machine or with different graphic card this error not happening.

It must be defective graphic card memory after all.

Thanks all.