Density map Getting mad with cuda

Hi all,
I’m trying to develop a simple CUDA application to compute a density map of profiler soudings, ie: know how many points are within a given radius of a given point.
So idea is to load points UTM coordinates (done and working), transfer them to GPU memory (done and working too), launch one thread per point and let it compute its distance to all the other points. Here is the kernel code:

float pts: flat table with points coordinates (x0 y0 z0 x1 y1 z1 … xn yn zn)
float * res: res[idx] contains the number of neighbours of the point with index “idx”. This table should be of type int but I was trying with floats to see if it helps.
global void
lgzeDist(float pts, float res, ulong nb_line)
int idx = (blockDim.x
float dist;
float j;
while(idx < nb_line){
for (int i=0; i<nb_line; i++){
} //for i

Each thread is reading (and reading only) the float *pts table and writing in its dedicated cell of the float *res table so there should not be any shared data conflict.
nb_line is around 160000 with my test file, dist>0 means j should reach the value of nb_line-1 (ie number of neighbours with a distance to the current point strictly positive). As there are too many points I launch “only” 256 threads and each of them is applying the algorithm to several points (the while loop).
I have several problems with this code:

  • If “i” gets too big, say 100000, sometimes pts[i] gives 0 whereas pts[100000] gives the right value
  • If I remove the “if (dist>0)” line, j ends up with the nb_line value (that’s OK) but as soon as I start to add an “if” condition j is stuck to 0 even if the condition is something like if(dist>=0) which should be always true.
  • If instead of res[idx]=j i put res{[dx]=dist (to store the last computed distance) I get always zero, but if I replace “dist” with a cut’n paste of the “sqrtf(pow(…)” line I get the right distance from the current point to the last one in the pts table.

I must be missing something but I don’t know what and I’m becoming a bit mad about that. Do you have any idea ?
I’m using the last available version of the SDK for linux (2.0 beta2), kernel 2.6.24, nvidia drivers 177.13.

Thanks in advance.


You do a while (idx …) in your code. I think that should be an if (idx < nb_lines) You should be running a grid big enough to cover all idx’s. Also I don’t see any radius in your software. I would think you need to do a if(dist <= radius) instead of a if(dist > 0.0f)

Here is a version that uses shared data and other optimizations (valid for 256 threads per block)

__global__ void lgzeDist(float *pts, unsigned int *res, float radius, ulong nb_line)


    unsigned int idx = (blockDim.x*blockDim.y)* blockIdx.x*gridDim.y+blockIdx.y)+threadIdx.x*blockDim.y+threadIdx.y;

   __shared__ float s_pts[256*3];

   // load x,y,z of this particle into registers

    float x= pts[3*idx];

    float y= pts[3*idx+1];

    float z= pts[3*idx+2];

   float dist;

    if(idx < nb_line) {

        unsigned int j=0;

        for (int i=0; i<nb_line; i+=256) {

           if (i+threadIdx.x < nb_line) { // parallel load of data into shared memory

                s_pts[threadIdx.x]     = pts[3*i+threadIdx.x];

                s_pts[threadIdx.x+256] = pts[3*i+threadIdx.x+256];

                s_pts[threadIdx.x+512] = pts[3*i+threadIdx.x+512];



            // use the data from shared memory to update j

            for (int k = 0; (k < 256) && (k+i < nb_line); k++) { 


                if(dist < radius)




        } // for i

        res[idx]= j;



I think that you can find example code like this in the SDK in the particles example I believe. I also think you could use a kd-tree or equivalent, because now you are doing an exhaustive search.

Echoing what has been said above me. Also the last line of your code is not usually seen in cuda apps. As was said, you should use enough threads to cover all points so that each thread only runs the kernel once.

I would just like to say that i believe a __syncthreads() i necessary if you want to use Riedijk’s code, right after the loads in shared memory.

yup, I forgot the syncthreads()…

And another __syncthreads() is needed before the write to shared memory too. You don’t want a thread finishing the calculation and then going on to clobber the shared memory that another thread is reading :blink:

To the OP: E.D. Riedijk’s code shows you how to do a nice sliding window technique to use shared memory to avoid too many memory reads (often the bottleneck on the GPU). But you should read up on coalescing in the programming guide and change your data format so that you can read the data in a coalesced manner, this can make a huge difference in memory performance.

Hi all,

Thanks for all these answers and for the sample code. I was wrong to try to have one thread computing for several base points, I modified my code to take this into account using E.D. Riedijk code. I noticed the two missing __syncthreads() almost immediately so they were not an issue (that’s the kind of tricks I use every year to trap students :biggrin: ).

So now my code looks like that:

__global__ void

lgzeDist(float *pts, unsigned int *res, unsigned int nb_line)


    unsigned int idx = (blockDim.x*blockDim.y)*(blockIdx.x*gridDim.y+blockIdx.y)+threadIdx.x*blockDim.y+threadIdx.y;

    unsigned int thr_id=threadIdx.x*blockDim.y+threadIdx.y;

    unsigned int max=nb_line;

    __shared__ float s_pts[BLOCK_SIZE*BLOCK_SIZE*3];

    unsigned int j=0;

    float x,y,z;


        float dist=0.0f;




        for(int i=0; i<max; i+=NB_THR){








            for(int k=0; (k<NB_THR) && (k+i<max); k++){









Grid size is modified to run as many blocks of threads as needed.

Trouble is that if I run this code as is my screen starts blinking and after a while I get only zeros as result.

Instead of that if I use a last line such as res[idx]=idx I do get number “n” as nth result which is nice but quite useless.

So what’s the matter with copying the j value in the result table ? :argh:

I have to admit I’m a bit discouraged…

P.S: thanks MisterAnderson42 for the memory advice, I’ll do that once this code works.

Cant say i really put my head up to this, but the way you find idx and thr_id seems a bit odd to me at first glance.

This is how i find idx:

const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

As for thr_id, i would do the opposite:

thr_id = threadIdx.y*blockDim.x+threadIdx.x;

That second one could be equivalent in the end, i doubt idx is equivalent though (but then again, maybe it is!).

Also, did you try emudebugging your code to see if j++ is actually hit at least once?

Yes I did, j value goes up to several hundreds so it definitely works.

I’m emudebugging the full software right now to see if something happens, but it takes quite a lot of time… wait’n see.




I think it needs to be 3*i + thr_id + [0 1 2], otherwise your reads will not be coalesced (and also they might be getting out of bounds), but I have to admit I am not very lucid after a day of CUDA debugging ;)

Same problem here :yawn:

I’ll check that on monday morning and search the meaning of “coalesced” in french :smile:

Hmm, I don’t even know how to translate that in dutch, let alone in french ;)

Ok guys,
I can’t get that damn simple thing to work :argh:
I modified a bit friday’s code to avoid getting out of bounds and checked it worked in emudebug. I’m absolutely never getting out of bounds.
Still if I try to return j value with res[idx]=j I’ll get only zeros whereas res[idx]=idx for example will return “n” for nth line.
I tried with removing the if condition before j++ in order to be 100% sure to reach it (it should have always returned the number of lines in the file) but that didn’t change anything.
I works with shorter files up to 17k/18k lines so I guess I must somehow be messing up with memory when using longer data files.
I’m truly desperate so if you have any idea…


It works like a charm without X running. I guess it must be eating too much memory as some other tests do. For example the alignedTypes test fails completely when X is running. I’ll try to find a GPU card with more RAM on it as it seems my Quadro FX 570 is a bit underpowered.
However I’m a bit surprised, I’m not storing that many data:
~ 170k points, 3 coords per point, 4 bytes each => ~ 2 MB
256 threads per blocks => ~ 650 blocks, each of them using a shared 768 floats long table=> 3 Ko/block => ~ 2MB total.
The board is supposed to have 256 MB, it should be enough to drive my two monitors (1280102424 bits each) and store these data.
Am I missing something here ?
Thanks for everything.