Density map Getting mad with cuda

I.P · June 27, 2008, 9:14am

Hi all,
I’m trying to develop a simple CUDA application to compute a density map of profiler soudings, ie: know how many points are within a given radius of a given point.
So idea is to load points UTM coordinates (done and working), transfer them to GPU memory (done and working too), launch one thread per point and let it compute its distance to all the other points. Here is the kernel code:

float pts: flat table with points coordinates (x0 y0 z0 x1 y1 z1 … xn yn zn)
float * res: res[idx] contains the number of neighbours of the point with index “idx”. This table should be of type int but I was trying with floats to see if it helps.
[i]
global void
lgzeDist(float pts, float res, ulong nb_line)
{
int idx = (blockDim.xblockDim.y)(blockIdx.xgridDim.y+blockIdx.y)+threadIdx.xblockDim.y+threadIdx.y;
float dist;
float j;
while(idx < nb_line){
j=0;
for (int i=0; i<nb_line; i++){
dist=sqrtf(powf(pts[3idx]-pts[3i],2)+powf(pts[3idx+1]-pts[3i+1],2)+powf(pts[3idx+2]-pts[3i+2],2));
if(dist>0)
j++;
} //for i
res[idx]=j;
idx+=blockDim.xblockDim.y;
}
}
[/i]

Each thread is reading (and reading only) the float *pts table and writing in its dedicated cell of the float *res table so there should not be any shared data conflict.
nb_line is around 160000 with my test file, dist>0 means j should reach the value of nb_line-1 (ie number of neighbours with a distance to the current point strictly positive). As there are too many points I launch “only” 256 threads and each of them is applying the algorithm to several points (the while loop).
I have several problems with this code:

If “i” gets too big, say 100000, sometimes pts[i] gives 0 whereas pts[100000] gives the right value
If I remove the “if (dist>0)” line, j ends up with the nb_line value (that’s OK) but as soon as I start to add an “if” condition j is stuck to 0 even if the condition is something like if(dist>=0) which should be always true.
If instead of res[idx]=j i put res{[dx]=dist (to store the last computed distance) I get always zero, but if I replace “dist” with a cut’n paste of the “sqrtf(pow(…)” line I get the right distance from the current point to the last one in the pts table.

I must be missing something but I don’t know what and I’m becoming a bit mad about that. Do you have any idea ?
I’m using the last available version of the SDK for linux (2.0 beta2), kernel 2.6.24, nvidia drivers 177.13.

Thanks in advance.

–
Irvin

E.D_Riedijk · June 27, 2008, 10:14am

You do a while (idx …) in your code. I think that should be an if (idx < nb_lines) You should be running a grid big enough to cover all idx’s. Also I don’t see any radius in your software. I would think you need to do a if(dist <= radius) instead of a if(dist > 0.0f)

Here is a version that uses shared data and other optimizations (valid for 256 threads per block)

__global__ void lgzeDist(float *pts, unsigned int *res, float radius, ulong nb_line)

{

    unsigned int idx = (blockDim.x*blockDim.y)* blockIdx.x*gridDim.y+blockIdx.y)+threadIdx.x*blockDim.y+threadIdx.y;

   __shared__ float s_pts[256*3];

   // load x,y,z of this particle into registers

    float x= pts[3*idx];

    float y= pts[3*idx+1];

    float z= pts[3*idx+2];

   float dist;

    if(idx < nb_line) {

        unsigned int j=0;

        for (int i=0; i<nb_line; i+=256) {

           if (i+threadIdx.x < nb_line) { // parallel load of data into shared memory

                s_pts[threadIdx.x]     = pts[3*i+threadIdx.x];

                s_pts[threadIdx.x+256] = pts[3*i+threadIdx.x+256];

                s_pts[threadIdx.x+512] = pts[3*i+threadIdx.x+512];

            }

            

            // use the data from shared memory to update j

            for (int k = 0; (k < 256) && (k+i < nb_line); k++) { 

                dist=sqrtf(powf(x-spts[3*k],2)+powf(y-s_pts[3*k+1],2)+powf(z-s_pts[3*k+2],2));

                if(dist < radius)

                    j++;

            }

            

        } // for i

        res[idx]= j;

    }

}

I think that you can find example code like this in the SDK in the particles example I believe. I also think you could use a kd-tree or equivalent, because now you are doing an exhaustive search.

Ailleur · June 27, 2008, 12:01pm

Echoing what has been said above me. Also the last line of your code is not usually seen in cuda apps. As was said, you should use enough threads to cover all points so that each thread only runs the kernel once.

I would just like to say that i believe a __syncthreads() i necessary if you want to use Riedijk’s code, right after the loads in shared memory.

E.D_Riedijk · June 27, 2008, 1:37pm

yup, I forgot the syncthreads()…

MisterAnderson42 · June 27, 2008, 1:40pm

And another __syncthreads() is needed before the write to shared memory too. You don’t want a thread finishing the calculation and then going on to clobber the shared memory that another thread is reading :blink:

To the OP: E.D. Riedijk’s code shows you how to do a nice sliding window technique to use shared memory to avoid too many memory reads (often the bottleneck on the GPU). But you should read up on coalescing in the programming guide and change your data format so that you can read the data in a coalesced manner, this can make a huge difference in memory performance.

I.P · June 27, 2008, 2:27pm

Hi all,

Thanks for all these answers and for the sample code. I was wrong to try to have one thread computing for several base points, I modified my code to take this into account using E.D. Riedijk code. I noticed the two missing __syncthreads() almost immediately so they were not an issue (that’s the kind of tricks I use every year to trap students :biggrin: ).

So now my code looks like that:

__global__ void

lgzeDist(float *pts, unsigned int *res, unsigned int nb_line)

{

    unsigned int idx = (blockDim.x*blockDim.y)*(blockIdx.x*gridDim.y+blockIdx.y)+threadIdx.x*blockDim.y+threadIdx.y;

    unsigned int thr_id=threadIdx.x*blockDim.y+threadIdx.y;

    unsigned int max=nb_line;

    __shared__ float s_pts[BLOCK_SIZE*BLOCK_SIZE*3];

    unsigned int j=0;

    float x,y,z;

   if(idx<max){

        float dist=0.0f;

        x=pts[idx*3];

        y=pts[idx*3+1];

        z=pts[idx*3+2];

        for(int i=0; i<max; i+=NB_THR){

            __syncthreads();

            if(thr_id+i<3*max){

                s_pts[3*thr_id]=pts[3*thr_id+i];

                s_pts[3*thr_id+1]=pts[3*thr_id+i+1];

                s_pts[3*thr_id+2]=pts[3*thr_id+i+2];

            }

            __syncthreads();

            for(int k=0; (k<NB_THR) && (k+i<max); k++){

                dist=sqrtf(powf(x-s_pts[k],2)+powf(y-s_pts[k+1],2)+powf(z-s_pts[k+2],2));

                if(dist<42.0f)

                    j++;

            }

        }

        res[idx]=j;

    }

}

Grid size is modified to run as many blocks of threads as needed.

Trouble is that if I run this code as is my screen starts blinking and after a while I get only zeros as result.

Instead of that if I use a last line such as res[idx]=idx I do get number “n” as nth result which is nice but quite useless.

So what’s the matter with copying the j value in the result table ? External Image

I have to admit I’m a bit discouraged…

P.S: thanks MisterAnderson42 for the memory advice, I’ll do that once this code works.

Ailleur · June 27, 2008, 2:36pm

Hi all,

Thanks for all these answers and for the sample code. I was wrong to try to have one thread computing for several base points, I modified my code to take this into account using E.D. Riedijk code. I noticed the two missing __syncthreads() almost immediately so they were not an issue (that’s the kind of tricks I use every year to trap students :biggrin: ).

So now my code looks like that:
__global__ void

lgzeDist(float *pts, unsigned int *res, unsigned int nb_line)

{

    unsigned int idx = (blockDim.x*blockDim.y)*(blockIdx.x*gridDim.y+blockIdx.y)+threadIdx.x*blockDim.y+threadIdx.y;

    unsigned int thr_id=threadIdx.x*blockDim.y+threadIdx.y;

    unsigned int max=nb_line;

    __shared__ float s_pts[BLOCK_SIZE*BLOCK_SIZE*3];

    unsigned int j=0;

    float x,y,z;

   if(idx<max){

        float dist=0.0f;

        x=pts[idx*3];

        y=pts[idx*3+1];

        z=pts[idx*3+2];

        for(int i=0; i<max; i+=NB_THR){

            __syncthreads();

            if(thr_id+i<3*max){

                s_pts[3*thr_id]=pts[3*thr_id+i];

                s_pts[3*thr_id+1]=pts[3*thr_id+i+1];

                s_pts[3*thr_id+2]=pts[3*thr_id+i+2];

            }

            __syncthreads();

            for(int k=0; (k<NB_THR) && (k+i<max); k++){

                dist=sqrtf(powf(x-s_pts[k],2)+powf(y-s_pts[k+1],2)+powf(z-s_pts[k+2],2));

                if(dist<42.0f)

                    j++;

            }

        }

        res[idx]=j;

    }

}
Grid size is modified to run as many blocks of threads as needed.

Trouble is that if I run this code as is my screen starts blinking and after a while I get only zeros as result.

Instead of that if I use a last line such as res[idx]=idx I do get number “n” as nth result which is nice but quite useless.

So what’s the matter with copying the j value in the result table ? External Media

I have to admit I’m a bit discouraged…

P.S: thanks MisterAnderson42 for the memory advice, I’ll do that once this code works.

[snapback]401185[/snapback]

Cant say i really put my head up to this, but the way you find idx and thr_id seems a bit odd to me at first glance.

This is how i find idx:

const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

As for thr_id, i would do the opposite:

thr_id = threadIdx.y*blockDim.x+threadIdx.x;

That second one could be equivalent in the end, i doubt idx is equivalent though (but then again, maybe it is!).

Also, did you try emudebugging your code to see if j++ is actually hit at least once?

I.P · June 27, 2008, 3:27pm

Yes I did, j value goes up to several hundreds so it definitely works.

I’m emudebugging the full software right now to see if something happens, but it takes quite a lot of time… wait’n see.

E.D_Riedijk · June 27, 2008, 6:07pm

               s_pts[3*thr_id]=pts[3*thr_id+i];

                s_pts[3*thr_id+1]=pts[3*thr_id+i+1];

                s_pts[3*thr_id+2]=pts[3*thr_id+i+2];

I think it needs to be 3*i + thr_id + [0 1 2], otherwise your reads will not be coalesced (and also they might be getting out of bounds), but I have to admit I am not very lucid after a day of CUDA debugging ;)

I.P · June 27, 2008, 6:12pm

Same problem here External Media

I’ll check that on monday morning and search the meaning of “coalesced” in french :smile:

E.D_Riedijk · June 28, 2008, 8:39am

Hmm, I don’t even know how to translate that in dutch, let alone in french ;)

I.P · June 30, 2008, 10:29am

Ok guys,
I can’t get that damn simple thing to work External Image
I modified a bit friday’s code to avoid getting out of bounds and checked it worked in emudebug. I’m absolutely never getting out of bounds.
Still if I try to return j value with res[idx]=j I’ll get only zeros whereas res[idx]=idx for example will return “n” for nth line.
I tried with removing the if condition before j++ in order to be 100% sure to reach it (it should have always returned the number of lines in the file) but that didn’t change anything.
I works with shorter files up to 17k/18k lines so I guess I must somehow be messing up with memory when using longer data files.
I’m truly desperate so if you have any idea…

Thanks.

I.P · July 1, 2008, 12:38pm

It works like a charm without X running. I guess it must be eating too much memory as some other tests do. For example the alignedTypes test fails completely when X is running. I’ll try to find a GPU card with more RAM on it as it seems my Quadro FX 570 is a bit underpowered.
However I’m a bit surprised, I’m not storing that many data:
~ 170k points, 3 coords per point, 4 bytes each => ~ 2 MB
256 threads per blocks => ~ 650 blocks, each of them using a shared 768 floats long table=> 3 Ko/block => ~ 2MB total.
The board is supposed to have 256 MB, it should be enough to drive my two monitors (1280102424 bits each) and store these data.
Am I missing something here ?
Thanks for everything.

Topic		Replies	Views
Simple Kernel, but having problems...ideas CUDA Programming and Performance	16	8127	April 24, 2011
Is this a task in which CUDA could speed up things? CUDA Programming and Performance	11	11969	September 29, 2009
modifying simulator to work with CUDA.?. CUDA Programming and Performance	36	18856	April 15, 2009
Urgent help with threads please! CUDA Programming and Performance	21	11066	March 6, 2008
Using shared memory issues CUDA Programming and Performance	3	417	April 27, 2019
Same Code (really, it is) - Much Different Results CUDA Programming and Performance	38	14314	September 30, 2010
Error in Textue CUDA Programming and Performance	17	7675	September 13, 2009
When i use CUDA, my program gets slower CUDA Programming and Performance	5	626	April 2, 2019
Outrageous speedups over serial code: advice needed CUDA Programming and Performance	17	3018	November 18, 2012
Speeding up kernel with atomicAdd and results do not match CUDA Programming and Performance	26	434	July 29, 2024

Density map Getting mad with cuda

Related topics