Is this a task in which CUDA could speed up things?

faxizel · September 28, 2009, 11:41pm

Hi. I am new to CUDA, and would really appreciate if you could help me with two questions…

Given a 3D point named POINT, and an array of 3D points named COMP_POINTS_ARRAY, I need to determine which points in COMP_POINTS_ARRAY are close to POINT, i.e., points for which the euclidean distance to POINT is smaller than MAXDISTANCE.

QUESTION 1 - Is this the type of task where a CUDA implementation could be faster than a CPU implementation?

I have this coded in Java, working on the CPU. Given POINT and an array of 15,000,000 points, it does not take long to perform the calculations, . However, since I need to compare thousands of points to the ones in the array, time adds up and the computations can take days.

The euclidean distance is given by:

distance = SQRT( (x2 - x1)^2 + (y2-y1)^2 + (z2-z1)^2 ), where (x1,y1,z1) are the coordinates of one point and (x2,y2,z2) are the coordinates of the second point

So here’s my CUDA strategy:

load into the GPU memory 3 float arrays compX, compY, and compZ with 15,000,000 elements each; these arrays contain the coordinates for all the points to compare (for example, coordinates for the first point are compX[0], compY[0] and compZ[0])
compute which are close to the POINT with coordinates pointX, pointY and pointZ using the kernel shown in the bottom, and put the comparison results in an array of 15,000,000 booleans
copy the array with the booleans to the host; leave the other arrays (compX, compY, compZ) in memory so that comparison of the next point does not require loading them to the GPU memory again

QUESTION 2 - the cuda version takes longer than the cpu version when comparing thousands of points. Am I doing something wrong or was this to be expected?

Thanks in advance.

#include <Math.h>

extern "C" __global__ void isClose(float maxDistance, float pointX, float pointY, float pointZ, float *compX, float *compY, float *compZ, bool *isClose)

{

	int index = blockIdx.x*blockDim.x + threadIdx.x;

	if(index>=0) {

		float diffX = compX[index] - pointX;

		float diffY = compY[index] - pointY;

		float diffZ = compZ[index] - pointZ;	

		diffX = diffX * diffX;

		diffY = diffY * diffY;

		diffZ = diffZ * diffZ;

		float distance = sqrt(diffX + diffY + diffZ);

		if(distance < maxDistance) {

			isClose[index] = true;

		}

		else {

			isClose[index] = false;

		}

	}

}

faxizel · September 28, 2009, 11:46pm

I forgot question number 3 :)

When I’m running the CUDA version to compare the points, my monitors start having “hiccups”: they freeze for a second, start working for 1 second, freeze again, and so on… Is this also something I should expect?

My NVIDIA card is a 9600 GT.

jjp · September 29, 2009, 7:58am

This should be a perfect problem for the GPU. It is completely data-parallel. Using each thread to compute the results for several elements might be better than using one thread per element.

If the 9600 GT is also the video card you use for display then “hiccups” can occur.

How does the performance of the 9600 GT compare to the one on the CPU, and what kind of CPU do you have? The 9600 GT does not have that much computational power. On the other hand your problem suits a GPU very well, so it is not clear to me wether this should be faster on a CPU than on a 9600 GT.

_Big_Mac · September 29, 2009, 10:55am

Well, on compute capability <1.2 devices (like your 9600) one byte memory accesses are uncoalesced and turn into 16 separate trips to memory. I’m talking about isClose[index] = true; - this is a write of a single bool per thread. You could try either using 32bit words instead of bools (memory wasted) or have each thread compare four distances and write back a bool4 (decreases parallelism a bit but since you already have 15M elements it shouldn’t be a big problem). Or you could get a GT200 :)

Does your whole algorithm require checking only a single point against N others? Or do you intend to launch this kernel in a loop many times with different float pointX, float pointY & float pointZ?

faxizel · September 29, 2009, 1:11pm

That’s what I thought! :D

Can you suggest how this could be done? In the kernel I only know how to get 1 index in the array… It’s my first CUDA program, so I still find all that stuff about blocks and threads kind of confusing, I guess I need to read the programming guide External Media

Yes, it is… Time to buy a C1060.

The CPU is a Q6600 (quad-core, 2.40GHz). Comparing 100 points to the other 15,000,000 using the CPU takes around 7 minutes. That’s using only one core; I can change the algorithm to use all four, which should speed things up quite a bit. On the GPU I get around 8 minutes. I’ve tried using jCuda (Java bindings for CUDA that works through JNI), and also tried doing it directly in the command line, to avoid the JNI overhead, but the results do not differ much.

Just as a side note, I have profiled the algorithm when running on the CPU, and as expected the greatest waste of time is in the function that calculates the distance between two points, in particular the multiplication of the differences, but also the functions to get the point coordinates for each point (so the memory access is also a problem with the CPU).

Thanks for the help, jjp!

sigismondo · September 29, 2009, 1:27pm

This is a problem that is basically O(N^2), but can be easily re-conducted to O(N)… The number of comparison per point still remain high, so it can be worthwhile the gpu optimization. But using the trick was enough for my case, on CPU, serially.

The trick to re-conduct to O(N) is known as “binning”:

build a 3D cubic grid (this can be a dense matrix, for comp chem, or a sparse one, for astrophysics) with grid step maxDistance, capable to contain the whole system.
assign your points to the grid bins. Each bin will contains the points (or a list to their references) that are spatially there.
evaluate your condition for each particle in each grid cell, comparing just with the particles in the 26 close cells (imagine 3x3x3 cells: you are evaluating distances of particle in the center cell, with respect to particles in itself and in all the surroundings cells)
enjoy yourself finding the correct way to avoid evaluating distances twice.

This starts being worthwhile with systems larger than 3*MAXDISTANCE obviously. Doing this, I have been able to perform tasks taking hours, e.g. geometry checking, in seconds, just serially. For reference, you can find this trick well documented in a molecular dynamics package: LAMMPS.

Oh, and please, do NOT take the sqrt! This is slow on both CPU and GPU.

I mean, instead of:

float distance = sqrt(diffX + diffY + diffZ);

		if(distance < maxDistance) {

			isClose[index] = true;

		}

do:

float maxDistance2 = maxDistance*maxDistance; // to be evaluated just once!

...

float distance2 = diffX + diffY + diffZ

		if(distance2 < maxDistance2) {

			isClose[index] = true;

		}

If you have not already done it (that’s quite a typical optimization), it is normal that the profiler do not show the sqrt call time (it is in libc or libm, and by default is not profiled), but you can have evidence of it seeing an important difference in the real elapsed time and in the profiled time.

IHTH, regards,

Sigismondo

faxizel · September 29, 2009, 1:34pm

heheh I was using a bool cause I thought it would make things faster. I had no idea. :D

However, I also tried making the kernel write the euclidean distances (floats) instead of the bools, and did not see much difference in speed.

Actually I’m thinking about getting the C1060, but first I wanted to make sure I can actually use the GPU correctly. Even if the 9600 GT is nothing special, I was still thinking I’d see a significant performance increase dude to its 64 cores, even if the CPU’s clock is significantly higher…

Yes, and that is the problem that made me look at CUDA. I actually need to compare something like 300,000 points to the 15,000,000 in the array, and I need to do it 3 times :D If it takes me 7 minutes to compare 100, then comparing 300,000 takes… too long :D

So what I have right now is the kernel being called 300,000 times; the 15,000,000 points are kept in memory in-between calls, so I don’t need to copy them from host to gpu memory for each call.

I know what you’re thinking, maybe I’m tackling the problem wrong and maybe I could do it with comparing so many points… But assuming that that is the only way to do it… I was expecting this kind of problem to be the type where CUDA would result in a 1,000x speed improvement ;)

Thanks for helping, Big Mac.

_Big_Mac · September 29, 2009, 1:52pm

8 minutes? That’s crazy! I had a similar algorithm going the other way around, a million points check which of 200 centroids is the closest (discrete Voronoi diagram), and it took 12ms on a GF 8800. It scaled linearly in respect to both variables, so 15M and 100 centroids would take less than 100ms. I got a 100x speed-up out of the box, you should be able to do great with CUDA. Something is really weird here.

If you’re checking 100 points, it might be better to have them all copied into constant memory and do a loop over them within the kernel (instead of looping over kernels, extra launch overhead). But even your implementation shouldn’t take 8 minutes :)

EDIT:

Just read about 300k points. Well, you could put them into constant memory as well, only in smaller batches (5000 points should fit, so you’d launch 60 kernels each looping over 5000 points).

/EDIT

By the way, I think you don’t need to do sqrt there in the distance calculation. Instead supply maxDistance^2 in the condition.

sqrt(diffX + diffY + diffZ) < maxDistance <=> (diffX + diffY + diffZ) < maxDistance^2 

//assuming everything is >= 0

faxizel · September 29, 2009, 2:05pm

This is a problem that is basically O(N^2), but can be easily re-conducted to O(N)… The number of comparison per point still remain high, so it can be worthwhile the gpu optimization. But using the trick was enough for my case, on CPU, serially.

The trick to re-conduct to O(N) is known as “binning”:

build a 3D cubic grid (this can be a dense matrix, for comp chem, or a sparse one, for astrophysics) with grid step maxDistance, capable to contain the whole system.

assign your points to the grid bins. Each bin will contains the points (or a list to their references) that are spatially there.

evaluate your condition for each particle in each grid cell, comparing just with the particles in the 26 close cells (imagine 3x3x3 cells: you are evaluating distances of particle in the center cell, with respect to particles in itself and in all the surroundings cells)

enjoy yourself finding the correct way to avoid evaluating distances twice.

This starts being worthwhile with systems larger than 3*MAXDISTANCE obviously. Doing this, I have been able to perform tasks taking hours, e.g. geometry checking, in seconds, just serially. For reference, you can find this trick well documented in a molecular dynamics package: LAMMPS.

Oh, and please, do NOT take the sqrt! This is slow on both CPU and GPU.

I mean, instead of:
float distance = sqrt(diffX + diffY + diffZ);

		if(distance < maxDistance) {

			isClose[index] = true;

		}
do:
float maxDistance2 = maxDistance*maxDistance; // to be evaluated just once!

...

float distance2 = diffX + diffY + diffZ

		if(distance2 < maxDistance2) {

			isClose[index] = true;

		}
If you have not already done it (that’s quite a typical optimization), it is normal that the profiler do not show the sqrt call time (it is in libc or libm, and by default is not profiled), but you can have evidence of it seeing an important difference in the real elapsed time and in the profiled time.

IHTH, regards,

Sigismondo

Hi sigismondo. Thank you very much for your suggestions. I work in the field of data mining, so I am aware of those sort of optimizations. Partitioning the 3D space is a common strategy for k-nearest neighbor algorithms, and there are several strategies for that (kd-trees, ball trees, etc.). Also, I am already using the SQRT optimization, which by itself represents an improvement of 15% in speed. ^_^

The thing is, I am using the brute force approach because I wanted to see how that would work with CUDA. Had it been fast enough, it could prove useful if one day if I needed to calculate all the distances to all the points, instead of just determining which points are close enough.

Anyway, I do intend to use the strategy you suggested. But as of right now, I want to find a reason to invest in a new NVIDIA card ;)

Thanks again!

faxizel · September 29, 2009, 2:14pm

My feelings exactly :D

That’s why I believe I’m doing something wrong in the CUDA implementation, which is, without a doubt, do to me not knowing enough about this technology.

I guess I need to read the guides…

Thanks!

sigismondo · September 29, 2009, 2:39pm

I would have said much more… probably your main bottleneck are memory transfers - the GPU is really happy it can perform some more computation in the while…

I see. Well there are many easy ways to transform an O(N^2) algorithm in O(N^3), for this target External Media

Seriously speaking, in your CUDA implementation, then, you need to cache the positions in shared memory.

The point is that you are going to evaluate N distances per point in the brute force approach. You shoud preload blocks of positions in shared memory, taking care of coalescing and of bank conflicts (that is, interleaved), to avoid too much traffic to global memory. For this you absolutely need to study the CUDA programming guide. This architecture is quite different to CPUs - the optimization in memory transfers is performed letting adjacent threads load ~~work on~~ close (4 bytes apart) data, both for global and shared memory.

Actually shared memory is a sort of L1 cache where transfers must be programmed by hand: if you need to reuse data, you need to move them to shared memory by hand.

faxizel · September 29, 2009, 3:00pm

No, no, it’s ok. It already takes 10 days to do it with the O(n^2) algorithm. ;)

Seriously speaking, in your CUDA implementation, then, you need to cache the positions in shared memory.

The point is that you are going to evaluate N distances per point in the brute force approach. You shoud preload blocks of positions in shared memory, taking care of coalescing and of bank conflicts (that is, interleaved), to avoid too much traffic to global memory. For this you absolutely need to study the CUDA programming guide. This architecture is quite different to CPUs - the optimization in memory transfers is performed letting adjacent threads load ~~work on~~ close (4 bytes apart) data, both for global and shared memory.

Actually shared memory is a sort of L1 cache where transfers must be programmed by hand: if you need to reuse data, you need to move them to shared memory by hand.

Yes, all these different types of memory are going way over my head… Time to read the guide. External Media

Anyway, I believe you guys have answered my main question, by assuring me that this is the type of problem in which CUDA should excel… Now it’s up to me to make it work… B)

Topic		Replies	Views
What are you guys doing with cuda? just wanna find a way to go CUDA Programming and Performance	81	56087	February 7, 2013
CUDA & Smoothed Particle Hydrodynamics Best approach? CUDA Programming and Performance	19	9916	June 24, 2009
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009
CUDA Laptop A discussion on Benefit-Cost Ratio. CUDA Programming and Performance	42	37259	July 2, 2009
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134546	May 26, 2010
Cuda code performance CUDA Programming and Performance	14	3099	December 16, 2014
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	699	September 29, 2018
Using Shared Memory in CUDA C/C++ Technical Blog	36	1921	October 8, 2020
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7112	December 6, 2007
When i use CUDA, my program gets slower CUDA Programming and Performance	5	521	April 2, 2019

Is this a task in which CUDA could speed up things?

Related topics