16 kb per multiprocessor, it is in the programming guide and also in the output of devicequery
Pants, this code is supposed to handle in the region of a million particles.
What I need to do is to create a temporary holding array within the kernel, the is dynamic in size, but is not shared memory. Is this possible at all?
But why use shared memory at all? In your example code I see no interaction between particles. So you can just keep your data in a register like this:
float A_IN = x_in[ tid]; // copy to shared memory
float B_IN = y_in[ tid];
float C_IN = z_in[ tid];
float A_OUT = R11*A_IN + R12*B_IN + R13*C_IN;
float B_OUT = R21*A_IN + R22*B_IN + R23*C_IN;
float C_OUT = R31*A_IN + R32*B_IN + R33*C_IN;
//where R** is a floating point number
x_out[ tid] = A_OUT;
y_out[ tid] = B_OUT;
z_out[ tid] = C_OUT;
Ok, I see now, you would only use shared memory where you are likely to be indexing from one thread to another.
I converted the kernel to the proposed format, and it works very nicely thank you. However, when I put 40,000 particles through, I get the same error as before. Any ideas what might be causing that? I assumed shared memory, but it looks like that is not the case.
The function is described as “shared” as given in the template example. Basically, this is just a modification of the template example.
Cheers,
Mike
Could you post your kernelcode and the call with which you call it?
One thing I could think of going wrong still is having mem_size still included in the kernelcall
Ok, here goes, it is a bit messy at the moment…
The call…
// setup execution parameters
dim3 grid( 1, 1, 1);
dim3 threads( num_threads, 1);
// execute the kernel
testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,
R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,
T111, T112, T122, T116, T126, T166, T133, T134, T144,
T211, T212, T222, T216, T226, T266, T233, T234, T244,
T313, T314, T323, T324, T336, T346,
T413, T414, T423, T424, T436, T446,
T511, T512, T522, T516, T526, T566, T533, T534, T544);
The kernel…
#ifndef _TEMPLATE_KERNEL_H_
#define _TEMPLATE_KERNEL_H_
#include <stdio.h>
testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,
float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,
float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,
float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,
float T313, float T314, float T323, float T324, float T336, float T346,
float T413, float T414, float T423, float T424, float T436, float T446,
float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544)
{
// access thread id
const unsigned int tid = threadIdx.x;
//copy host memory to device
float XDATA=x[tid];
float PXDATA=px[tid];
float YDATA=y[tid];
float PYDATA=py[tid];
float TDATA=t[tid];
float DPDATA=dp[tid];
__syncthreads();
// perform some computations
float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA
+ T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA
+ T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;
float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA
+ T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA
+ T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;
float _YDATA = R33*YDATA + R34*PYDATA
+ T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;
float _PYDATA = R43*YDATA + R44*PYDATA
+ T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;
float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA
+ T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA
+ T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;
__syncthreads();
x[tid] = _XDATA;
px[tid] = _PXDATA;
y[tid] = _YDATA;
py[tid] = _PYDATA;
t[tid] = _TDATA;
// write data to global memory
}
#endif // #ifndef _TEMPLATE_KERNEL_H_
Thanks for all the help guys. I wonder if I have the parameters for ‘grid’ or ‘threads’ wrong?
I would do it as below.
Observations :

you probably tried to have 40000 threads in a block. The maximum is 512 (or less, dependend on how many registers your kernel uses. Add ptxasoptions=v to your nvcc command line to find out how many your kernel uses

You do not need syncthreads, as you have no communications between threads.

In a kernel you load global memory, not host memory.

If R* and T* are constant (and used in other kernels) it is probably better to use constant memory for them.
// setup execution parameters
num_threads = 256; (you should check with the occupancy calculator what is the best number here)
num_blocks = num_particles / num_threads; (make sure that num_particles = N*num_threads, otherwise you overwrite memory)
dim3 grid( num_blocks, 1, 1);
dim3 threads( num_threads, 1);
// execute the kernel
testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,
R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,
T111, T112, T122, T116, T126, T166, T133, T134, T144,
T211, T212, T222, T216, T226, T266, T233, T234, T244,
T313, T314, T323, T324, T336, T346,
T413, T414, T423, T424, T436, T446,
T511, T512, T522, T516, T526, T566, T533, T534, T544);
The kernel…
#ifndef _TEMPLATE_KERNEL_H_
#define _TEMPLATE_KERNEL_H_
#include <stdio.h>
testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,
float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,
float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,
float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,
float T313, float T314, float T323, float T324, float T336, float T346,
float T413, float T414, float T423, float T424, float T436, float T446,
float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544)
{
// access index
const unsigned int tid = __mul24(blockIdx.x * blockDim.x) + threadIdx.x;
//copy device memory to registers
float XDATA=x[tid];
float PXDATA=px[tid];
float YDATA=y[tid];
float PYDATA=py[tid];
float TDATA=t[tid];
float DPDATA=dp[tid];
// perform some computations
float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA
+ T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA
+ T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;
float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA
+ T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA
+ T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;
float _YDATA = R33*YDATA + R34*PYDATA
+ T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;
float _PYDATA = R43*YDATA + R44*PYDATA
+ T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;
float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA
+ T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA
+ T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;
x[tid] = _XDATA;
px[tid] = _PXDATA;
y[tid] = _YDATA;
py[tid] = _PYDATA;
t[tid] = _TDATA;
}
#endif // #ifndef _TEMPLATE_KERNEL_H_
Thank you DenisR, this is very useful information.
Just one quick query, __mul24( , ) takes two arguements, not one. would
__mul24(blockIdx.x * blockDim.x , 1)
be the correct way to define it?
This program will be used to track particles in a particle accelerator simulation. It is our intention to produce a research paper for the European Particle Accelerator Conference 2008. Both DenisR and AndreiB have been particularly helpful. I would like to mention you in the paper for your helpfulness. If you would like that, please let me know your names so that I may include them. The conference will be in June.
Thanks ever so much,
Mike
I guess it is __mul24( blockIdx.x, blockDim.x), but you can replace it with blockIdx.x * blockDim.x.
And one more thought on performance. It may run faster if you’ll put calculation of _XDATA, _PXDATA, etc. inside a loop and perform several iterations. Benefit here is that you don’t need to store results in slow global memory after each iteration and this may boost performance quite a bit.
Ok, I’m nearly there, just one last complication. For particles 256 to 40000, everything is OK. However, for the first 256 particles, they are showing #QNAN for all six values. It is some kind of offset I believe, and I am trying to trace it by systematically offsetting values. However, if anyone has a potential solution, this would be much appreciated.
Thanks,
Mike
THis means there’s something wrong with your kernel.
Please attach relevant files so we can take a closer look.
Yes post again your code. Looks like you have some value offset by 1, which is multiplied by your #threads per block if I would guess.
#ifndef _TEMPLATE_KERNEL_H_
#define _TEMPLATE_KERNEL_H_
#include <stdio.h>
__global__ void testKernel( float* x, float* px, float* y, float* py, float* t, float* dp,
float R11, float R12, float R16, float R21, float R22, float R26, float R33, float R34, float R43, float R44, float R51, float R52, float R55, float R56, float R66,
float T111, float T112, float T122, float T116, float T126, float T166, float T133, float T134, float T144,
float T211, float T212, float T222, float T216, float T226, float T266, float T233, float T234, float T244,
float T313, float T314, float T323, float T324, float T336, float T346,
float T413, float T414, float T423, float T424, float T436, float T446,
float T511, float T512, float T522, float T516, float T526, float T566, float T533, float T534, float T544)
{
// access thread id
const unsigned int tid = __mul24(blockIdx.x , blockDim.x) + threadIdx.x;
// read in input data from global memory
//copy host memory to device
float XDATA=x[tid];
float PXDATA=px[tid];
float YDATA=y[tid];
float PYDATA=py[tid];
float TDATA=t[tid];
float DPDATA=dp[tid];
// perform some computations
float _XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA
+ T111*XDATA*XDATA + T112*XDATA*PXDATA + T122*PXDATA*PXDATA + T116*XDATA*DPDATA + T126*PXDATA*DPDATA
+ T166*DPDATA*DPDATA + T133*YDATA*YDATA + T134*YDATA*PYDATA + T144*PYDATA*PYDATA;
float _PXDATA = R21*XDATA + R22*PXDATA + R26*DPDATA
+ T211*XDATA*XDATA + T212*XDATA*PXDATA + T222*PXDATA*PXDATA + T216*XDATA*DPDATA + T226*PXDATA*DPDATA
+ T266*DPDATA*DPDATA + T233*YDATA*YDATA + T234*YDATA*PYDATA + T244*PYDATA*PYDATA;
float _YDATA = R33*YDATA + R34*PYDATA
+ T313*XDATA*YDATA + T314*XDATA*PYDATA + T323*PXDATA*YDATA + T324*PXDATA*PYDATA + T336*YDATA*DPDATA + T346*PYDATA*DPDATA;
float _PYDATA = R43*YDATA + R44*PYDATA
+ T413*XDATA*YDATA + T414*XDATA*PYDATA + T423*PXDATA*YDATA + T424*PXDATA*PYDATA + T436*YDATA*DPDATA + T446*PYDATA*DPDATA;
float _TDATA = R51*XDATA + R52*PXDATA + R55*TDATA + R56*DPDATA
+ T511*XDATA*XDATA + T512*XDATA*PXDATA + T522*PXDATA*PXDATA + T516*XDATA*DPDATA + T526*PXDATA*DPDATA
+ T566*DPDATA*DPDATA + T533*YDATA*YDATA + T534*YDATA*PYDATA + T544*PYDATA*PYDATA;
// write data to global memory
x[tid] = _XDATA;
px[tid] = _PXDATA;
y[tid] = _YDATA;
py[tid] = _PYDATA;
t[tid] = _TDATA;
}
#endif // #ifndef _TEMPLATE_KERNEL_H_
…and the kernel call…
// setup execution parameters
dim3 grid( num_blocks, 1, 1);
dim3 threads( num_threads, 1, 1);
// execute the kernel
testKernel<<< grid, threads >>>( d_i_x, d_i_px, d_i_y, d_i_py, d_i_t, d_i_dp,
R11, R12, R16, R21, R22, R26, R33, R34, R43, R44, R51, R52, R55, R56, R66,
T111, T112, T122, T116, T126, T166, T133, T134, T144,
T211, T212, T222, T216, T226, T266, T233, T234, T244,
T313, T314, T323, T324, T336, T346,
T413, T414, T423, T424, T436, T446,
T511, T512, T522, T516, T526, T566, T533, T534, T544);
I have also changed the C++ code that ensures the number of particles is a multiple of 256 using padding. Basically, if I change the number of threads to a different number, then it that many particles that have a value of #QNAN. I guess that means that only the first ‘block’ is affected.
Curiously, it affects the value of ‘dp’ first, look at this output…
6.29947e004 1.74322e004 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
5.87786e004 3.20012e004 0.00000e+000 0.00000e+000 5.37863e009 1.#QNANe+000
1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000
1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000 1.#QNANe+000
As you can see, column 5 is affected first, and then this spreads to the rest. Basically, this kernel acts upon the particles many times but with different values of R and T each time.
dp is not changed in your kernel (you don’t write to it), so the bug will be in other code …
Ahh, I see. If I were to use fast shared memory, and then split the operation into a loop, that would minimize global memory usage.
Well, back to the program. DenisR, you are right, it is not the kernel causing this. I output the value of dp to the screen before and after the kernel and it was OK then. Somewhere in the main code it is getting corrupted.
Curiously, if I run the same code, with the same data input, two or more times, each time, the #QNAN appears in a different place. It seems to be randomly determined. This is very odd behaviour since no part of my code uses a random number generator. And ideas what this could be?
EDIT: thought it might have been time related, but it appears to be completely random.
We’re almost clueless without full source code. Can you post it here?
UPD:
5 sec limit… You can detect if you’re hitting it by checking return value from cudaThreadSynchronize() after kernel invocation.
UPD2: Ah, you’re editing messages too fast =) 5 sec is not an issue, yes? :)
I’m afraid the full source code is a huge monstrous mess, too big to sensibly post on the forums. It seems like the problem now lies in the C++ of the original code rather than the CUDA, or at least its interaction with CUDA. I’ll play with it a bit, see what I can deduce.
Thanks for the help.
No, the 5 second limit did not seem to be the issue. It seems completely random as to where it fails.