k-nearest neighbour tweaking, hints, etc...

Hi, I am doing a program to classify some point, by searching k-nearest neighbours and i got stucked (maybe because of slow hardware or some bugs in my program).

I deceided to try to speed-up only parts where distance is calculated (Manhattan distance, Euclidian distance). Soo at this moment i am measuring only kernel execution and on CPU part where distance is calculated.
The problem is that, no matter which metric i choose, i get only up to 15% of speed-up :( and if i add another dimension to be calculated i get even slow-down.
I dont know if this is due to slow graphic card (8400 gs, 256mb memory), some bugs, or bad performance of my program.
Anyway at 20971522 elements, using 32 threads and 256256 blocks speed-up seems to be about 15%. If there was such speed-up at this point then i would expect even biger at larger data, but at 83886082 elements, using 32 threads and 512512 blocks the percentage is about the same.
Any advice how to gain higher % speed-up?

And then second problem:
I tried to run at even larger data, but weird things started to happen (sometimes even screen got messy untill restart). It worked fine in emulation mode, but not anymore executing program on graphic card. By running with 64 threads and 512512 blocks (that is 167772162 elements in database), i get wrong result back from the device, however when i use emulation mode, it works fine.
By my caluculations 16777216*(2+1) elements(+1 is for output result) should use 201 326 592 bytes of memory, soo this shouldnt be an issue.
Any ideas what could be wrong?

Any other hints how to tweek my program are also welcome.


Attached code: [attachment=5271:attachment]

Hardware on which i was testing:
Ubuntu 7.10
AMD 64 X2 Dual Core Processor 4000+
2GB RAM, dual channel
GeForce 8400 GS, Total amount of global memory: 267 714 560 bytes
nearestNeighborEng.tar (20 KB)

Your memory reads from dataBase are not coalesced (see the programming guide for details). Coalescing can increase performance by a factor of 10 or more if you are memory bound.

Don’t worry about having thread 1 load the x & y values of thread 1, etc. Just get the data from global memory into your shared array in a coalesced load and write without bank conflicts into the shared memory:

i = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x
sdata[threadIdx.x] = dataBase[i*DIMENSION+threadIdx.x]
sdata[threadIdx.x + THREADS] = dataBase[i*DIMENSION + THREADS + threadIdx.x]
/* Do what you want to */

The other thing is you want more than 32 threads per block. You want to have as few blocks as possible, provided you are achieving good occupancy and can fit what you need into shared memory. For example, target having something like 32 blocks or so, so that each processor gets two blocks running on it to hide latency of syncs & global loads.

  • Paul

thanks to all for help

one more question:
is it possible to get about 200x speed-up on 8800gts 320mb, 2.4 core duo, 2gb ram
and about 18x speed-up on 8400gs, 4000+ 64 x2, 2gb ram ?

speed-up is measured: timeCPU/timeGPU

i am timing only the part where kernel is executed and not where memory is being uploaded, also i dont have any ‘for’ or ‘if’(except 1) sentences in my kernel code

thanks again for answers

I had a 50x speedup on 8800GTX vs QuadCore 3.66 Ghz, where my kernel had a for loop, not a lot of calculation (so memory bound) and including the data transfer. So I think yes.