My current graphics card is for experiments only: it is 8500 GT (16 processors, 128 bit memory bus, moreover - it is inserted into the second PCI-e slot which is slower e t c).
I’ve implemented the kernel for my needs that actually works (and I hope that it works not too slow - it does not diverge, it uses coalesced memory access, it works with shared memory e t c). As my kernel uses a lot of shared memory per thread (for stacks), I can run only about 32-42 threads per block. No matter how many test data sets I provide, the pure time of kernel run (without malloc/memcopy and other stuff like that) is always bigger than the time of the same calculations on CPU (including CPU malloc and other service stuff).
When workload is not huge (about 100000 test cases) the GPU kernel is about 2.5 times slower than CPU (Athlon X2 4800+, calculations done in single thread for correct timing). When it becomes really huge (about 10 millions test cases) the CPU is about 20% faster than GPU, but it is still faster. However, typical workload for my tasks is about 10000-100000 testcases.
What I’m trying to figure out is the estimation of my GPU kernel productivity on modern nvidia solutions. Specs of 9800-series cards are known, but I’m not sure if the speed of kernel run comparing to 8500 will grow linearly or may be better than linearly or how … The main problem (I gues) is in the number of threads per block that is low in my case, but it is limited by the amount of shared memory.
I have to decide if GPU’s may speed my tasks up … may be somebody faced the same choice some time, any opinion would be greatly appreciated.
Thanks in advance.