cuda bug?

I tested the following codes on GT8800+CUDA1.1. When the n >= 2^16, the results are error: sum is always 0.
When I remove the line of " for (int p = 0; p < n; p++)" in GPU_kernel, it is right.
Why?

gpu_kernel.cu:

#ifndef GPU_KERNEL_H
#define GPU_KERNEL_H

// includes, system
#include <math.h>

// includes, project
#include <cutil.h>

global void
GPU_kenel(int n, float *d_array)
{
int start = threadIdx.x * (n / THREAD_SIZE);
int end = (threadIdx.x + 1) * (n / THREAD_SIZE);
for (int p = 0; p < n; p++)
for (int i = start; i < end; i++)
d_array[i] = 1.0;
__syncthreads();
}

#endif // GPU_KERNEL_H

main.cu:

// size
#define THREAD_SIZE 32
#define BLOCK_SIZE 1

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cutil.h>

// includes, kernels
#include <GPU_kernel.cu>

void main(int argc, char **argv)
{

int n;
n = 1 << (atoi(argv[1]);
printf("n = %d\n", n);

float *h_part;
h_part = (float *)malloc(n*sizeof(float));
assert(h_part);
    
float *d_part;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_part, n*sizeof(float)));

dim3 threads(THREAD_SIZE, 1);
dim3 grid(BLOCK_SIZE, 1);

// ************************************************************************************
GPU_kenel<<< grid, threads >>>(n, d_part);
// ************************************************************************************
        
CUT_SAFE_CALL(cudaThreadSynchronize());

// 2.4 check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed");

// 3. copy result from device to host
CUDA_SAFE_CALL(cudaMemcpy(h_part, d_part, n*sizeof(float), cudaMemcpyDeviceToHost) );
CUT_SAFE_CALL(cudaThreadSynchronize());

// 4. compute sum 
printf("Get the sum ");
float sum = 0.;
for(int i = 0; i < n; i++){
    sum += h_part[i];
}
printf(" %f\n", sum);


CUT_SAFE_CALL(cudaFree(d_part));
free(h_part);

printf("End execution on GPU\n");

}

wow, it crashes my frontbuffer,
after running your code,
my screen presents a lot of random positioned coloured points!

also with:
for (int p = 0; p < n; p++)
{
for (int i = start; i < end; i++) d_array[i] = 1.0;
__syncthreads();
}

@gmtan,
Does this reproduce with the CUDA_2.0-beta ?
Which OS are you using?

Yes, it does with 2.0.

The OS is Red Hat Enterprise Linux 4

the same problem on Windows XP

So…what’s the problem? a bug in my code or CUDA?
That makes me despaired.

Hmm. I’m not sure, but some processors use special counter registers for things like for loops. I have no idea is this is how CUDA enabled GPUs handle it, or if they use general purpose registers for for loop counters. Try this:

global void
GPU_kenel(int n, float *d_array)
{
int start = threadIdx.x * (n / THREAD_SIZE);
int end = (threadIdx.x + 1) * (n / THREAD_SIZE);
int p=0;
while (p < n)
{
for (int i = start; i < end; i++)
d_array[i] = 1.0;
p++;
}
__syncthreads();
}

I think this is not the problem.
I don’t know why, but I think that too much sequential writes
on global memory crash it…
it is not a mechanical behavior, it is quite random.

I tried with n = 1 << 17;
and
for (int p = 0; p < 5000; p++)
it sometimes crashes sometimes not…
if I put a number higher than 5000, the probabilities to crash get higher…
with n = 1 << 18 I’ve to put a number lower than 2000 (or it crashes everytime)…

with n = 2^17
every thread writes 4096 (the inner loop) * 5000(example) -> 20480000

with n = 2^18
every thread writes 8192 (inner loop) * 1900(example) -> 15564800

the two examples are similar…
the second example has a lower number of writes, but because the inner loop grows, the global writes are “closer” (and they crash soon).

Indeed, without for (int p = 0; p < ; p++)
it runs with n = 2^25… but it crashes with n = 2^26…
with n = 2^26
every thread writes 2097152, but in the same loop (so all writes are sequentially “closer”)

Ok, my demonstration is very stupid and empirical…
but it’s a mistery!

I ran the code in your original post several times (after fixing the compile errors) on a Tesla D870 / CUDA 2.0 beta / linux x86_64 and had no problems.

$ ./write_test 18

n = 262144

Get the sum  262144.000000

End execution on GPU

$ ./write_test 18

n = 262144

Get the sum  262144.000000

End execution on GPU

Are your running this code in an X session? It runs for much longer than 5 seconds and will trigger the watchdog timer causing the results to be bogus. If you compile in debug mode, the CUT_CHECK_ERROR should report “the kernel has timed out and terminated”, but sometimes the CUT_CHECK_ERROR in cutil.h fails to work properly. You can check for errors after the kernel call yourself by calling cudaThreadSynchronize and then checking the value of cudaGetLastError.

P.S. I hope this just a toy example and not an attempt to initialize memory efficiently… You only run one block with a small thread size and memory writes are not coalesced.