Hi, I am doing a program to classify some point, by searching k-nearest neighbours and i got stucked (maybe because of slow hardware or some bugs in my program).
I deceided to try to speed-up only parts where distance is calculated (Manhattan distance, Euclidian distance). Soo at this moment i am measuring only kernel execution and on CPU part where distance is calculated.
The problem is that, no matter which metric i choose, i get only up to 15% of speed-up :( and if i add another dimension to be calculated i get even slow-down.
I dont know if this is due to slow graphic card (8400 gs, 256mb memory), some bugs, or bad performance of my program.
Anyway at 20971522 elements, using 32 threads and 256256 blocks speed-up seems to be about 15%. If there was such speed-up at this point then i would expect even biger at larger data, but at 83886082 elements, using 32 threads and 512512 blocks the percentage is about the same.
Any advice how to gain higher % speed-up?
And then second problem:
I tried to run at even larger data, but weird things started to happen (sometimes even screen got messy untill restart). It worked fine in emulation mode, but not anymore executing program on graphic card. By running with 64 threads and 512512 blocks (that is 167772162 elements in database), i get wrong result back from the device, however when i use emulation mode, it works fine.
By my caluculations 16777216*(2+1) elements(+1 is for output result) should use 201 326 592 bytes of memory, soo this shouldnt be an issue.
Any ideas what could be wrong?
Any other hints how to tweek my program are also welcome.
Attached code: [attachment=5271:attachment]
Hardware on which i was testing:
AMD 64 X2 Dual Core Processor 4000+
2GB RAM, dual channel
GeForce 8400 GS, Total amount of global memory: 267 714 560 bytes
nearestNeighborEng.tar (20 KB)