Ok, what am I doing wrong here? Some kind of overwrite?

Ok, using the template example, it works no problem at all. However, I wanted more than just one value in my equation. So, I added, after the equivalent SDATA statements…

  #define SDATA2( index)   CUT_BANK_CHECKER( sdata2, index)

  ...

  extern __shared__ float sdata2;

  ...

  SDATA2( tid) = g_idata2[ tid];

Where g_data2 is passed as an arguement, and is twice the value of g_idata.

Everything else in the code is untouched, including the mathematics. However, when I run the program with this in it, I assumed SDATA would be unaffected by this. However, it is affected, and returns values at twice what they should be, as if I had used the value of SDATA2 instead.

If I comment out…

  SDATA2( tid) = g_idata2[ tid];

Then it runs normally. I assume that somehow, the value for SDATA is being overwritten by the value of SDATA2. How do I go about preventing this? I haven’t yet got my head around all the concepts of the GPU memory model yet.

Thanks,

Mike :thumbup:

All definitions of type

extern __shared__

will point to the same memory, i.e. if you declare

extern __shared__ float a;

extern __shared__ int b;

then both a and b will point to exactly the same memory location. This is documented in somewhere in Programming Manual.

Thank you very much for that AndreiB. That gives me some insight into what is going on. First off, I’ll try to just take out the externs and see what happens.

What exactly is the purpose of the use of SDATA in the template example? Just taking out the extern declaration leads to an error.

More directly, you could write…

  g_odata[tid] = g_idata[tid]*(float) num_threads

Which does not use SDATA at all. Why would you not do this? I’m sure there must be a good reason.

Thanks

It’s about performance. In this particular kernel difference may be not very big, but in general it is recommended to load data you’re going to work with into shared memory because it’s much faster than global memory.

You can’t just remove “extern”. If you do so you’ll need to specify array size and it should work then.

If you are only touching the data once (for example, if you want to add two vectors) and the access pattern gives you good memory transfers ( coalesced ), there is no need to use shared memory.

I suppose I had better explain what I am trying to do. Basically, I want the kernel to take the values x_in, y_in, z_in, do some computation and then write out to x_out, y_out, z_out. However, I wish to use shared memory in the following way…

A_IN( tid) = x_in[ tid]; // copy to shared memory
B_IN( tid) = y_in[ tid];
C_IN( tid) = z_in[ tid];

A_OUT( tid) = R11A_IN( tid) + R12B_IN( tid) + R13C_IN( tid); //perform computation
B_OUT( tid) = R21
A_IN( tid) + R22B_IN( tid) + R23C_IN( tid);
C_OUT( tid) = R31A_IN( tid) + R32B_IN( tid) + R33*C_IN( tid);

//where R** is a floating point number

x_out[ tid] = A_OUT( tid); //copy back to global
y_out[ tid] = B_OUT( tid);
z_out[ tid] = C_OUT( tid);

So, basically, I need six variables in shared memory in order to be able to complete this. How would I create these within the kernel? They will all be the same size as each other. Obviously I can create A_IN using…

extern shared a_in;

But how do I create the other five to be the same size, but not overwrite the values of A_IN. Sorry for being such an amateur at the moment.

Many thanks,

Mike

Do you know size of your arrays at compile time? If you do then you can simply declare shared float a_in[A_SIZE];

Unfortunately not. The size of the array is determined by an external data file at runtime.

Then you can index one big array with some array-specific index, i.e. to access a_in[10] you simply access S_DATA(10), to access b_in[7] you access S_DATA(B_IN_OFFSET + 7) etc. Here you’ll need to calculate these offsets at runtime.

MIke,

The amount of dynamic shared memory to be allocatd needs to be passed as an argument in your execution configuration apart from Grid and Block sizes. This is an optional argument. Check out the manual under the heading “execution configuration”

So, if you need lets say 7 dynamic floats in shared memry, then you woud call your kernel like this

mykernel <<< grid, block, 7*sizeof(float) >>> (arguments)

In your kernel, you would access them as an external array (not a pointer)

Therez a difference in the way the compiler generates code for an array and a pointer. So, be careful. Just declare your extern structure as “extern shared float input

Hope that helped.

Thank you AndreiB. I was thinking something along those lines, but not sure if it would work. I’ll give it a go.

Ok, I’ve got that all working now and it is great. Just a quick question. How big is the shared memory? My program seems perfectly happy when there are a few hundred elements (single precision) in the x, y and z arrays but 40,000 is not so good. I get the following error in Debug…

First-chance exception at 0x7c812a5b in cppIntegration.exe: Microsoft C++ exception: cudaError at memory location 0x0012f6ec…

For reference, I am using a GeForce 8600m GT with 512 MB dedicated video RAM. Thanks

16 kb per multiprocessor, it is in the programming guide and also in the output of devicequery

Pants, this code is supposed to handle in the region of a million particles.

What I need to do is to create a temporary holding array within the kernel, the is dynamic in size, but is not shared memory. Is this possible at all?

But why use shared memory at all? In your example code I see no interaction between particles. So you can just keep your data in a register like this:

float A_IN = x_in[ tid]; // copy to shared memory

float B_IN = y_in[ tid];

float C_IN = z_in[ tid];

float A_OUT = R11*A_IN + R12*B_IN + R13*C_IN;

float B_OUT = R21*A_IN + R22*B_IN + R23*C_IN;

float C_OUT = R31*A_IN + R32*B_IN + R33*C_IN;

//where R** is a floating point number

x_out[ tid] = A_OUT;

y_out[ tid] = B_OUT;

z_out[ tid] = C_OUT;

Ok, I see now, you would only use shared memory where you are likely to be indexing from one thread to another.

I converted the kernel to the proposed format, and it works very nicely thank you. However, when I put 40,000 particles through, I get the same error as before. Any ideas what might be causing that? I assumed shared memory, but it looks like that is not the case.

The function is described as “shared” as given in the template example. Basically, this is just a modification of the template example.

Cheers,

Mike

Could you post your kernel-code and the call with which you call it?
One thing I could think of going wrong still is having mem_size still included in the kernel-call

Ok, here goes, it is a bit messy at the moment…

The call…

// setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( num_threads, 1);

    

    // execute the kernel

    testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,

 R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,

 T111, T112, T122, T116, T126, T166, T133, T134, T144,

 T211, T212, T222, T216, T226, T266, T233, T234, T244,

 T313, T314, T323, T324, T336, T346,

 T413, T414, T423, T424, T436, T446,

 T511, T512, T522, T516, T526, T566, T533, T534, T544);

The kernel…

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#include <stdio.h>

testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,

 float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,

 float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,

 float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,

 float T313, float T314, float T323, float T324, float T336, float T346,

 float T413, float T414, float T423, float T424, float T436, float T446,

 float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544) 

{

// access thread id

  const unsigned int tid = threadIdx.x;

//copy host memory to device

  

  float XDATA=x[tid];

  

  float PXDATA=px[tid];

  

  float YDATA=y[tid];

  

  float PYDATA=py[tid];

  

  float TDATA=t[tid];

  

  float DPDATA=dp[tid];

  

  __syncthreads();

  

 // perform some computations

  

  float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA

   + T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA

    + T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;

    

    

  

  float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA

   + T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA

    + T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;

    

    

  

  float _YDATA = R33*YDATA + R34*PYDATA

   + T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;

   

   

  

  float _PYDATA = R43*YDATA + R44*PYDATA

   + T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;

   

   

  

  float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA

   + T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA

    + T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;

    

  

  

   

  __syncthreads();

  

  

	

	x[tid] = _XDATA;

	px[tid] = _PXDATA;

	y[tid] = _YDATA;

	py[tid] = _PYDATA;

	t[tid] = _TDATA;

  

	

  // write data to global memory

  

  

}

#endif // #ifndef _TEMPLATE_KERNEL_H_

Thanks for all the help guys. I wonder if I have the parameters for ‘grid’ or ‘threads’ wrong?

I would do it as below.

Observations :

  • you probably tried to have 40000 threads in a block. The maximum is 512 (or less, dependend on how many registers your kernel uses. Add --ptxas-options=-v to your nvcc command line to find out how many your kernel uses

  • You do not need syncthreads, as you have no communications between threads.

  • In a kernel you load global memory, not host memory.

  • If R* and T* are constant (and used in other kernels) it is probably better to use constant memory for them.

// setup execution parameters

   num_threads = 256; (you should check with the occupancy calculator what is the best number here)

    num_blocks = num_particles / num_threads; (make sure that num_particles = N*num_threads, otherwise you overwrite memory)

   dim3  grid( num_blocks, 1, 1);

    dim3  threads( num_threads, 1);

    

    // execute the kernel

    testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,

 R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,

 T111, T112, T122, T116, T126, T166, T133, T134, T144,

 T211, T212, T222, T216, T226, T266, T233, T234, T244,

 T313, T314, T323, T324, T336, T346,

 T413, T414, T423, T424, T436, T446,

 T511, T512, T522, T516, T526, T566, T533, T534, T544);

The kernel…

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#include <stdio.h>

testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,

 float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,

 float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,

 float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,

 float T313, float T314, float T323, float T324, float T336, float T346,

 float T413, float T414, float T423, float T424, float T436, float T446,

 float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544) 

{

// access index

  const unsigned int tid = __mul24(blockIdx.x * blockDim.x) + threadIdx.x;

//copy device memory to registers

  

  float XDATA=x[tid];

  

  float PXDATA=px[tid];

  

  float YDATA=y[tid];

  

  float PYDATA=py[tid];

  

  float TDATA=t[tid];

  float DPDATA=dp[tid];

  // perform some computations

  

  float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA

   + T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA

    + T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;

    

    

  

  float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA

   + T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA

    + T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;

    

    

  

  float _YDATA = R33*YDATA + R34*PYDATA

   + T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;

   

   

  

  float _PYDATA = R43*YDATA + R44*PYDATA

   + T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;

   

   

  

  float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA

   + T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA

    + T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;

  

	x[tid] = _XDATA;

	px[tid] = _PXDATA;

	y[tid] = _YDATA;

	py[tid] = _PYDATA;

	t[tid] = _TDATA;

}

#endif // #ifndef _TEMPLATE_KERNEL_H_