Code that does nothing

Hi all,

it seems that my code does nothing, don’t know what’s wrong with it tough…

  int Base = threadIdx.x;

    int End = Base + elementN;

    float sum = 5.0f;


    for(int i = Base; i < End; i += blockDim.x){

        sum += d_data_A[i] * d_data_B[i];           



	d_data_C[0] = sum;


when this kernel function returns, the result is 5.0, so the variable sum stays unchanged.

Can anyone tell me what i do wrong please?

If can help this one is the call:

 dim3 grid(1);

  dim3 threads(elementN);


  CUDA_SAFE_CALL( cudaThreadSynchronize() );

  scalarProdGPU<<<grid, threads>>>(d_data_C, d_data_A, d_data_B, elementN);

  CUDA_SAFE_CALL( cudaThreadSynchronize() );

Thanks in advance

What is the value of elementN? It may be greater than the maximum block size.

Base is 2, End is 5 and blockDim.x is 3, so this loop is executed (apparently) once.

But the thing i don’t understand is since i have to executed the loop elementN times (3 times), how come it is executed only once?

It seems CUDA dislikes out-of-bound access to device memory.

Moreover, strange things happen when you try to use datas taken by memory cells subject to collisions. In my case (I’m making a raytracing algorithm) such these conditions are often generated.

The worst drawback is that CUDA doesn’t execute my wrong code, thus achieving some excellent speed results !!! :huh: