Ok, what am I doing wrong here? Some kind of overwrite?

Thank you DenisR, this is very useful information.

Just one quick query, __mul24( , ) takes two arguements, not one. would

__mul24(blockIdx.x * blockDim.x , 1)

be the correct way to define it?

This program will be used to track particles in a particle accelerator simulation. It is our intention to produce a research paper for the European Particle Accelerator Conference 2008. Both DenisR and AndreiB have been particularly helpful. I would like to mention you in the paper for your helpfulness. If you would like that, please let me know your names so that I may include them. The conference will be in June.

Thanks ever so much,

Mike

I guess it is __mul24( blockIdx.x, blockDim.x), but you can replace it with blockIdx.x * blockDim.x.

And one more thought on performance. It may run faster if you’ll put calculation of _XDATA, _PXDATA, etc. inside a loop and perform several iterations. Benefit here is that you don’t need to store results in slow global memory after each iteration and this may boost performance quite a bit.

Ok, I’m nearly there, just one last complication. For particles 256 to 40000, everything is OK. However, for the first 256 particles, they are showing #QNAN for all six values. It is some kind of offset I believe, and I am trying to trace it by systematically offsetting values. However, if anyone has a potential solution, this would be much appreciated.

Thanks,

Mike

THis means there’s something wrong with your kernel.
Please attach relevant files so we can take a closer look.

Yes post again your code. Looks like you have some value offset by 1, which is multiplied by your #threads per block if I would guess.

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#include <stdio.h>

__global__ void testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,

 float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,

 float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,

 float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,

 float T313, float T314, float T323, float T324, float T336, float T346,

 float T413, float T414, float T423, float T424, float T436, float T446,

 float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544) 

{

  // access thread id

  const unsigned int tid = __mul24(blockIdx.x , blockDim.x) + threadIdx.x;

  

  // read in input data from global memory

  //copy host memory to device

  

  float XDATA=x[tid];

  

  float PXDATA=px[tid];

  

  float YDATA=y[tid];

  

  float PYDATA=py[tid];

  

  float TDATA=t[tid];

  

  float DPDATA=dp[tid];

  

  // perform some computations

  

  float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA

   + T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA

    + T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;

    

  float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA

   + T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA

    + T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;

   

  float _YDATA = R33*YDATA + R34*PYDATA

   + T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;

   

  float _PYDATA = R43*YDATA + R44*PYDATA

   + T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;

  

  float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA

   + T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA

    + T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;

      	

  // write data to global memory

  

	x[tid] = _XDATA;

	px[tid] = _PXDATA;

	y[tid] = _YDATA;

	py[tid] = _PYDATA;

	t[tid] = _TDATA;

}

#endif // #ifndef _TEMPLATE_KERNEL_H_

…and the kernel call…

// setup execution parameters

    dim3  grid( num_blocks, 1, 1);

    dim3  threads( num_threads, 1, 1);

    

    // execute the kernel

    testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,

 R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,

 T111, T112, T122, T116, T126, T166, T133, T134, T144,

 T211, T212, T222, T216, T226, T266, T233, T234, T244,

 T313, T314, T323, T324, T336, T346,

 T413, T414, T423, T424, T436, T446,

 T511, T512, T522, T516, T526, T566, T533, T534, T544);

I have also changed the C++ code that ensures the number of particles is a multiple of 256 using padding. Basically, if I change the number of threads to a different number, then it that many particles that have a value of #QNAN. I guess that means that only the first ‘block’ is affected.

Curiously, it affects the value of ‘dp’ first, look at this output…

6.29947e-004 -1.74322e-004 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
5.87786e-004 -3.20012e-004 0.00000e+000 0.00000e+000 -5.37863e-009 1.#QNANe+000
1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000
1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000

As you can see, column 5 is affected first, and then this spreads to the rest. Basically, this kernel acts upon the particles many times but with different values of R and T each time.

dp is not changed in your kernel (you don’t write to it), so the bug will be in other code …

Ahh, I see. If I were to use fast shared memory, and then split the operation into a loop, that would minimize global memory usage.

Well, back to the program. DenisR, you are right, it is not the kernel causing this. I output the value of dp to the screen before and after the kernel and it was OK then. Somewhere in the main code it is getting corrupted.

Curiously, if I run the same code, with the same data input, two or more times, each time, the #QNAN appears in a different place. It seems to be randomly determined. This is very odd behaviour since no part of my code uses a random number generator. And ideas what this could be?

EDIT: thought it might have been time related, but it appears to be completely random.

We’re almost clueless without full source code. Can you post it here?

UPD:
5 sec limit… You can detect if you’re hitting it by checking return value from cudaThreadSynchronize() after kernel invocation.

UPD2: Ah, you’re editing messages too fast =) 5 sec is not an issue, yes? :)

I’m afraid the full source code is a huge monstrous mess, too big to sensibly post on the forums. It seems like the problem now lies in the C++ of the original code rather than the CUDA, or at least its interaction with CUDA. I’ll play with it a bit, see what I can deduce.

Thanks for the help.

No, the 5 second limit did not seem to be the issue. It seems completely random as to where it fails.