Hello,
I’ve come across a very simple CUDA kernel which gives results different from what I expect. This kernel is intended to copy memory using float4 loads and stores:
__global__ void testKernel1(float4* g_out, float4* g_in) {
const int idx = blockDim.x*blockIdx.x + threadIdx.x;
g_out[idx] = g_in[idx];
}
The kernel appears to run correctly on my Quadro NVS 290 card. However, on my Tesla C870 card it gives random errors for a fraction of the array elements. These errors occur most readily with big array sizes and under load (many kernel invocations in rapid succession.)
I have also tried kernels which read float4’s but write float’s (no problem) and kernels which read float’s but write float4’s (exhibits the same error.) So I am led to believe that the problem is specific to writing float4 values.
Besides this problem, my Tesla C870 card appears to be working fine.
I have attached an example CUDA file which exhibits the errors (at least on my hardware.) It can be compiled and run with the commands: [font=“Courier”]nvcc float4test.cu; ./a.out[/font].
If you run this test program then please let me know the results.
Cheers,
Kip
(p.s., I’m having some trouble attaching the CUDA file, so I gave it the .txt extension instead of .cu)
float4test.txt (3.04 KB)