Ok, what am I doing wrong here? Some kind of overwrite?

If you are only touching the data once (for example, if you want to add two vectors) and the access pattern gives you good memory transfers ( coalesced ), there is no need to use shared memory.

I suppose I had better explain what I am trying to do. Basically, I want the kernel to take the values x_in, y_in, z_in, do some computation and then write out to x_out, y_out, z_out. However, I wish to use shared memory in the following way…

A_IN( tid) = x_in[ tid]; // copy to shared memory
B_IN( tid) = y_in[ tid];
C_IN( tid) = z_in[ tid];

A_OUT( tid) = R11A_IN( tid) + R12B_IN( tid) + R13C_IN( tid); //perform computation
B_OUT( tid) = R21
A_IN( tid) + R22B_IN( tid) + R23C_IN( tid);
C_OUT( tid) = R31A_IN( tid) + R32B_IN( tid) + R33*C_IN( tid);

//where R** is a floating point number

x_out[ tid] = A_OUT( tid); //copy back to global
y_out[ tid] = B_OUT( tid);
z_out[ tid] = C_OUT( tid);

So, basically, I need six variables in shared memory in order to be able to complete this. How would I create these within the kernel? They will all be the same size as each other. Obviously I can create A_IN using…

extern shared a_in;

But how do I create the other five to be the same size, but not overwrite the values of A_IN. Sorry for being such an amateur at the moment.

Many thanks,

Mike

Do you know size of your arrays at compile time? If you do then you can simply declare shared float a_in[A_SIZE];

Unfortunately not. The size of the array is determined by an external data file at runtime.

Then you can index one big array with some array-specific index, i.e. to access a_in[10] you simply access S_DATA(10), to access b_in[7] you access S_DATA(B_IN_OFFSET + 7) etc. Here you’ll need to calculate these offsets at runtime.

MIke,

The amount of dynamic shared memory to be allocatd needs to be passed as an argument in your execution configuration apart from Grid and Block sizes. This is an optional argument. Check out the manual under the heading “execution configuration”

So, if you need lets say 7 dynamic floats in shared memry, then you woud call your kernel like this

mykernel <<< grid, block, 7*sizeof(float) >>> (arguments)

In your kernel, you would access them as an external array (not a pointer)

Therez a difference in the way the compiler generates code for an array and a pointer. So, be careful. Just declare your extern structure as “extern shared float input

Hope that helped.

Thank you AndreiB. I was thinking something along those lines, but not sure if it would work. I’ll give it a go.

Ok, I’ve got that all working now and it is great. Just a quick question. How big is the shared memory? My program seems perfectly happy when there are a few hundred elements (single precision) in the x, y and z arrays but 40,000 is not so good. I get the following error in Debug…

First-chance exception at 0x7c812a5b in cppIntegration.exe: Microsoft C++ exception: cudaError at memory location 0x0012f6ec…

For reference, I am using a GeForce 8600m GT with 512 MB dedicated video RAM. Thanks

16 kb per multiprocessor, it is in the programming guide and also in the output of devicequery

Pants, this code is supposed to handle in the region of a million particles.

What I need to do is to create a temporary holding array within the kernel, the is dynamic in size, but is not shared memory. Is this possible at all?

But why use shared memory at all? In your example code I see no interaction between particles. So you can just keep your data in a register like this:

float A_IN = x_in[ tid]; // copy to shared memory

float B_IN = y_in[ tid];

float C_IN = z_in[ tid];

float A_OUT = R11*A_IN + R12*B_IN + R13*C_IN;

float B_OUT = R21*A_IN + R22*B_IN + R23*C_IN;

float C_OUT = R31*A_IN + R32*B_IN + R33*C_IN;

//where R** is a floating point number

x_out[ tid] = A_OUT;

y_out[ tid] = B_OUT;

z_out[ tid] = C_OUT;

Ok, I see now, you would only use shared memory where you are likely to be indexing from one thread to another.

I converted the kernel to the proposed format, and it works very nicely thank you. However, when I put 40,000 particles through, I get the same error as before. Any ideas what might be causing that? I assumed shared memory, but it looks like that is not the case.

The function is described as “shared” as given in the template example. Basically, this is just a modification of the template example.

Cheers,

Mike

Could you post your kernel-code and the call with which you call it?
One thing I could think of going wrong still is having mem_size still included in the kernel-call

Ok, here goes, it is a bit messy at the moment…

The call…

// setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( num_threads, 1);

    

    // execute the kernel

    testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,

 R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,

 T111, T112, T122, T116, T126, T166, T133, T134, T144,

 T211, T212, T222, T216, T226, T266, T233, T234, T244,

 T313, T314, T323, T324, T336, T346,

 T413, T414, T423, T424, T436, T446,

 T511, T512, T522, T516, T526, T566, T533, T534, T544);

The kernel…

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#include <stdio.h>

testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,

 float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,

 float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,

 float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,

 float T313, float T314, float T323, float T324, float T336, float T346,

 float T413, float T414, float T423, float T424, float T436, float T446,

 float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544) 

{

// access thread id

  const unsigned int tid = threadIdx.x;

//copy host memory to device

  

  float XDATA=x[tid];

  

  float PXDATA=px[tid];

  

  float YDATA=y[tid];

  

  float PYDATA=py[tid];

  

  float TDATA=t[tid];

  

  float DPDATA=dp[tid];

  

  __syncthreads();

  

 // perform some computations

  

  float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA

   + T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA

    + T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;

    

    

  

  float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA

   + T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA

    + T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;

    

    

  

  float _YDATA = R33*YDATA + R34*PYDATA

   + T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;

   

   

  

  float _PYDATA = R43*YDATA + R44*PYDATA

   + T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;

   

   

  

  float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA

   + T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA

    + T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;

    

  

  

   

  __syncthreads();

  

  

	

	x[tid] = _XDATA;

	px[tid] = _PXDATA;

	y[tid] = _YDATA;

	py[tid] = _PYDATA;

	t[tid] = _TDATA;

  

	

  // write data to global memory

  

  

}

#endif // #ifndef _TEMPLATE_KERNEL_H_

Thanks for all the help guys. I wonder if I have the parameters for ‘grid’ or ‘threads’ wrong?

I would do it as below.

Observations :

  • you probably tried to have 40000 threads in a block. The maximum is 512 (or less, dependend on how many registers your kernel uses. Add --ptxas-options=-v to your nvcc command line to find out how many your kernel uses

  • You do not need syncthreads, as you have no communications between threads.

  • In a kernel you load global memory, not host memory.

  • If R* and T* are constant (and used in other kernels) it is probably better to use constant memory for them.

// setup execution parameters

   num_threads = 256; (you should check with the occupancy calculator what is the best number here)

    num_blocks = num_particles / num_threads; (make sure that num_particles = N*num_threads, otherwise you overwrite memory)

   dim3  grid( num_blocks, 1, 1);

    dim3  threads( num_threads, 1);

    

    // execute the kernel

    testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,

 R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,

 T111, T112, T122, T116, T126, T166, T133, T134, T144,

 T211, T212, T222, T216, T226, T266, T233, T234, T244,

 T313, T314, T323, T324, T336, T346,

 T413, T414, T423, T424, T436, T446,

 T511, T512, T522, T516, T526, T566, T533, T534, T544);

The kernel…

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#include <stdio.h>

testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,

 float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,

 float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,

 float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,

 float T313, float T314, float T323, float T324, float T336, float T346,

 float T413, float T414, float T423, float T424, float T436, float T446,

 float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544) 

{

// access index

  const unsigned int tid = __mul24(blockIdx.x * blockDim.x) + threadIdx.x;

//copy device memory to registers

  

  float XDATA=x[tid];

  

  float PXDATA=px[tid];

  

  float YDATA=y[tid];

  

  float PYDATA=py[tid];

  

  float TDATA=t[tid];

  float DPDATA=dp[tid];

  // perform some computations

  

  float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA

   + T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA

    + T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;

    

    

  

  float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA

   + T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA

    + T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;

    

    

  

  float _YDATA = R33*YDATA + R34*PYDATA

   + T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;

   

   

  

  float _PYDATA = R43*YDATA + R44*PYDATA

   + T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;

   

   

  

  float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA

   + T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA

    + T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;

  

	x[tid] = _XDATA;

	px[tid] = _PXDATA;

	y[tid] = _YDATA;

	py[tid] = _PYDATA;

	t[tid] = _TDATA;

}

#endif // #ifndef _TEMPLATE_KERNEL_H_

Thank you DenisR, this is very useful information.

Just one quick query, __mul24( , ) takes two arguements, not one. would

__mul24(blockIdx.x * blockDim.x , 1)

be the correct way to define it?

This program will be used to track particles in a particle accelerator simulation. It is our intention to produce a research paper for the European Particle Accelerator Conference 2008. Both DenisR and AndreiB have been particularly helpful. I would like to mention you in the paper for your helpfulness. If you would like that, please let me know your names so that I may include them. The conference will be in June.

Thanks ever so much,

Mike

I guess it is __mul24( blockIdx.x, blockDim.x), but you can replace it with blockIdx.x * blockDim.x.

And one more thought on performance. It may run faster if you’ll put calculation of _XDATA, _PXDATA, etc. inside a loop and perform several iterations. Benefit here is that you don’t need to store results in slow global memory after each iteration and this may boost performance quite a bit.

Ok, I’m nearly there, just one last complication. For particles 256 to 40000, everything is OK. However, for the first 256 particles, they are showing #QNAN for all six values. It is some kind of offset I believe, and I am trying to trace it by systematically offsetting values. However, if anyone has a potential solution, this would be much appreciated.

Thanks,

Mike

THis means there’s something wrong with your kernel.
Please attach relevant files so we can take a closer look.

Yes post again your code. Looks like you have some value offset by 1, which is multiplied by your #threads per block if I would guess.