 # Is this a task in which CUDA could speed up things?

Hi. I am new to CUDA, and would really appreciate if you could help me with two questions…

Given a 3D point named POINT, and an array of 3D points named COMP_POINTS_ARRAY, I need to determine which points in COMP_POINTS_ARRAY are close to POINT, i.e., points for which the euclidean distance to POINT is smaller than MAXDISTANCE.

QUESTION 1 - Is this the type of task where a CUDA implementation could be faster than a CPU implementation?

I have this coded in Java, working on the CPU. Given POINT and an array of 15,000,000 points, it does not take long to perform the calculations, . However, since I need to compare thousands of points to the ones in the array, time adds up and the computations can take days.

The euclidean distance is given by:

distance = SQRT( (x2 - x1)^2 + (y2-y1)^2 + (z2-z1)^2 ), where (x1,y1,z1) are the coordinates of one point and (x2,y2,z2) are the coordinates of the second point

So here’s my CUDA strategy:

• load into the GPU memory 3 float arrays compX, compY, and compZ with 15,000,000 elements each; these arrays contain the coordinates for all the points to compare (for example, coordinates for the first point are compX, compY and compZ)

• compute which are close to the POINT with coordinates pointX, pointY and pointZ using the kernel shown in the bottom, and put the comparison results in an array of 15,000,000 booleans

• copy the array with the booleans to the host; leave the other arrays (compX, compY, compZ) in memory so that comparison of the next point does not require loading them to the GPU memory again

QUESTION 2 - the cuda version takes longer than the cpu version when comparing thousands of points. Am I doing something wrong or was this to be expected?

``````#include <Math.h>

extern "C" __global__ void isClose(float maxDistance, float pointX, float pointY, float pointZ, float *compX, float *compY, float *compZ, bool *isClose)

{

int index = blockIdx.x*blockDim.x + threadIdx.x;

if(index>=0) {

float diffX = compX[index] - pointX;

float diffY = compY[index] - pointY;

float diffZ = compZ[index] - pointZ;

diffX = diffX * diffX;

diffY = diffY * diffY;

diffZ = diffZ * diffZ;

float distance = sqrt(diffX + diffY + diffZ);

if(distance < maxDistance) {

isClose[index] = true;

}

else {

isClose[index] = false;

}

}

}
``````

I forgot question number 3 :)

When I’m running the CUDA version to compare the points, my monitors start having “hiccups”: they freeze for a second, start working for 1 second, freeze again, and so on… Is this also something I should expect?

My NVIDIA card is a 9600 GT.

This should be a perfect problem for the GPU. It is completely data-parallel. Using each thread to compute the results for several elements might be better than using one thread per element.

If the 9600 GT is also the video card you use for display then “hiccups” can occur.

How does the performance of the 9600 GT compare to the one on the CPU, and what kind of CPU do you have? The 9600 GT does not have that much computational power. On the other hand your problem suits a GPU very well, so it is not clear to me wether this should be faster on a CPU than on a 9600 GT.

Well, on compute capability <1.2 devices (like your 9600) one byte memory accesses are uncoalesced and turn into 16 separate trips to memory. I’m talking about isClose[index] = true; - this is a write of a single bool per thread. You could try either using 32bit words instead of bools (memory wasted) or have each thread compare four distances and write back a bool4 (decreases parallelism a bit but since you already have 15M elements it shouldn’t be a big problem). Or you could get a GT200 :)

Does your whole algorithm require checking only a single point against N others? Or do you intend to launch this kernel in a loop many times with different float pointX, float pointY & float pointZ?

That’s what I thought! :D

Can you suggest how this could be done? In the kernel I only know how to get 1 index in the array… It’s my first CUDA program, so I still find all that stuff about blocks and threads kind of confusing, I guess I need to read the programming guide Yes, it is… Time to buy a C1060.

The CPU is a Q6600 (quad-core, 2.40GHz). Comparing 100 points to the other 15,000,000 using the CPU takes around 7 minutes. That’s using only one core; I can change the algorithm to use all four, which should speed things up quite a bit. On the GPU I get around 8 minutes. I’ve tried using jCuda (Java bindings for CUDA that works through JNI), and also tried doing it directly in the command line, to avoid the JNI overhead, but the results do not differ much.

Just as a side note, I have profiled the algorithm when running on the CPU, and as expected the greatest waste of time is in the function that calculates the distance between two points, in particular the multiplication of the differences, but also the functions to get the point coordinates for each point (so the memory access is also a problem with the CPU).

Thanks for the help, jjp!

This is a problem that is basically O(N^2), but can be easily re-conducted to O(N)… The number of comparison per point still remain high, so it can be worthwhile the gpu optimization. But using the trick was enough for my case, on CPU, serially.

The trick to re-conduct to O(N) is known as “binning”:

• build a 3D cubic grid (this can be a dense matrix, for comp chem, or a sparse one, for astrophysics) with grid step maxDistance, capable to contain the whole system.

• assign your points to the grid bins. Each bin will contains the points (or a list to their references) that are spatially there.

• evaluate your condition for each particle in each grid cell, comparing just with the particles in the 26 close cells (imagine 3x3x3 cells: you are evaluating distances of particle in the center cell, with respect to particles in itself and in all the surroundings cells)

• enjoy yourself finding the correct way to avoid evaluating distances twice.

This starts being worthwhile with systems larger than 3*MAXDISTANCE obviously. Doing this, I have been able to perform tasks taking hours, e.g. geometry checking, in seconds, just serially. For reference, you can find this trick well documented in a molecular dynamics package: LAMMPS.

Oh, and please, do NOT take the sqrt! This is slow on both CPU and GPU.

``````float distance = sqrt(diffX + diffY + diffZ);

if(distance < maxDistance) {

isClose[index] = true;

}
``````

do:

``````float maxDistance2 = maxDistance*maxDistance; // to be evaluated just once!

...

float distance2 = diffX + diffY + diffZ

if(distance2 < maxDistance2) {

isClose[index] = true;

}
``````

If you have not already done it (that’s quite a typical optimization), it is normal that the profiler do not show the sqrt call time (it is in libc or libm, and by default is not profiled), but you can have evidence of it seeing an important difference in the real elapsed time and in the profiled time.

IHTH, regards,

Sigismondo

heheh I was using a bool cause I thought it would make things faster. I had no idea. :D

However, I also tried making the kernel write the euclidean distances (floats) instead of the bools, and did not see much difference in speed.

Actually I’m thinking about getting the C1060, but first I wanted to make sure I can actually use the GPU correctly. Even if the 9600 GT is nothing special, I was still thinking I’d see a significant performance increase dude to its 64 cores, even if the CPU’s clock is significantly higher…

Yes, and that is the problem that made me look at CUDA. I actually need to compare something like 300,000 points to the 15,000,000 in the array, and I need to do it 3 times :D If it takes me 7 minutes to compare 100, then comparing 300,000 takes… too long :D

So what I have right now is the kernel being called 300,000 times; the 15,000,000 points are kept in memory in-between calls, so I don’t need to copy them from host to gpu memory for each call.

I know what you’re thinking, maybe I’m tackling the problem wrong and maybe I could do it with comparing so many points… But assuming that that is the only way to do it… I was expecting this kind of problem to be the type where CUDA would result in a 1,000x speed improvement ;)

Thanks for helping, Big Mac.

8 minutes? That’s crazy! I had a similar algorithm going the other way around, a million points check which of 200 centroids is the closest (discrete Voronoi diagram), and it took 12ms on a GF 8800. It scaled linearly in respect to both variables, so 15M and 100 centroids would take less than 100ms. I got a 100x speed-up out of the box, you should be able to do great with CUDA. Something is really weird here.

If you’re checking 100 points, it might be better to have them all copied into constant memory and do a loop over them within the kernel (instead of looping over kernels, extra launch overhead). But even your implementation shouldn’t take 8 minutes :)

EDIT:

Just read about 300k points. Well, you could put them into constant memory as well, only in smaller batches (5000 points should fit, so you’d launch 60 kernels each looping over 5000 points).

/EDIT

By the way, I think you don’t need to do sqrt there in the distance calculation. Instead supply maxDistance^2 in the condition.

``````sqrt(diffX + diffY + diffZ) < maxDistance <=> (diffX + diffY + diffZ) < maxDistance^2

//assuming everything is >= 0
``````

Hi sigismondo. Thank you very much for your suggestions. I work in the field of data mining, so I am aware of those sort of optimizations. Partitioning the 3D space is a common strategy for k-nearest neighbor algorithms, and there are several strategies for that (kd-trees, ball trees, etc.). Also, I am already using the SQRT optimization, which by itself represents an improvement of 15% in speed. ^_^

The thing is, I am using the brute force approach because I wanted to see how that would work with CUDA. Had it been fast enough, it could prove useful if one day if I needed to calculate all the distances to all the points, instead of just determining which points are close enough.

Anyway, I do intend to use the strategy you suggested. But as of right now, I want to find a reason to invest in a new NVIDIA card ;)

Thanks again!

My feelings exactly :D

That’s why I believe I’m doing something wrong in the CUDA implementation, which is, without a doubt, do to me not knowing enough about this technology.

I guess I need to read the guides…

Thanks!

I would have said much more… probably your main bottleneck are memory transfers - the GPU is really happy it can perform some more computation in the while…

I see. Well there are many easy ways to transform an O(N^2) algorithm in O(N^3), for this target Seriously speaking, in your CUDA implementation, then, you need to cache the positions in shared memory.

The point is that you are going to evaluate N distances per point in the brute force approach. You shoud preload blocks of positions in shared memory, taking care of coalescing and of bank conflicts (that is, interleaved), to avoid too much traffic to global memory. For this you absolutely need to study the CUDA programming guide. This architecture is quite different to CPUs - the optimization in memory transfers is performed letting adjacent threads load work on close (4 bytes apart) data, both for global and shared memory.

Actually shared memory is a sort of L1 cache where transfers must be programmed by hand: if you need to reuse data, you need to move them to shared memory by hand.

No, no, it’s ok. It already takes 10 days to do it with the O(n^2) algorithm. ;)

Yes, all these different types of memory are going way over my head… Time to read the guide. Anyway, I believe you guys have answered my main question, by assuring me that this is the type of problem in which CUDA should excel… Now it’s up to me to make it work… B)