I’m hoping someone with some CUDA experience can point me in the right direction. I’ve got an application currently running on the CPU that I’d like to port to CUDA to gain some performance. My problem is before I even get too deep into porting the algorithm to CUDA it’s already much slower and I can’t figure out why. I’m hoping it’s something I don’t understand architecturally that’s biting me.
My data is a graphics frame 1280x1024 8bit gray scale, that I’m acquiring from a CMOS array. For the purposes of performance testing CUDA I’ve saved a frame to disk and I load it before I start any timing so that I’m only looking at the true performance of the processing.
I allocate two buffers using cudaMalloc, in my simple example I use identical sized buffers for input and output. I cudaMemcpy my image into the source buffer, call my function which calls the CUDA kernel function I’ve written and once it returns I cudaMemcpy the result out of the device hosted buffer into my application buffer and stop my timer. I do a quick verification that the output is correct, but that’s outside the timer, since I won’t be doing that in the production application. Now, if I effectively NOP the CUDA kernel function I’ve written, this all runs about 10-12ms, which I can live with as overhead, if the rest of the processing is fast.
My test CUDA kernel function just iterates against the 1.3M pixels, totals them up and copies them into the output buffer. Totally useless, but not entirely dissimilar to what I’ll be performing in the actual routine (though obviously very much simpler). Running this with 1 thread gives me a time of about 1 second, 2 threads is about 330ms, 4 threads about 195ms, 8 threads 155ms and, oddly enough, 16 threads 260ms. It certainly seems to me that I’m handling the threading at least partially right, since I do see performance gains as I scale up the thread count, but two things bother me. One, why am I seeing such a fall off in performance between 8 and 16 threads? I’m guessing there’s overhead associated with the spawning of the thread on the device and once I hit 16 threads I’m no longer performance enough calculations to offset.
The second question is really the important one to me. On my notebook, Core 2 Duo 2.5Ghz I can iterate against the 1.3M pixels and total them up in approximately 4 ms. Why in the world would such an operation take 155ms with 8 threads running in parallel on the device?
Background info, if anyone cares. The actual process is shining a laser (visible red) through some reduction filters onto a CMOS array, finding the center point of the laser spot (definitely not round) and then back feeding that information to the device pointing the laser for accuracy correction. The problem that leads me to try this on CUDA is my final hardware will have 3-6 CMOS array’s, each running 20+fps and the processing time on a single frame (via CPU code) is about 75ms. Bar napkin math says the 3 arrays @ 20fps needs a more than a 4 core machine (running similar Ghz to my notebook) and 6 needs a 9 core machine. Obviously this is pretty steep hardware requirements and the OS prices will be ridiculous (the customer will definitely insist on Windows, unfortunately). My hope was to port over to CUDA and buy a high end video card and be done with it.
Thanks for any suggestions or help.
Hardware info from deviceQuery:
CUDA Capability Minor revision number: 1
Total amount of global memory: 268435456 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.25 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads
can use this device simultaneously)