Odd performance problem/question

I’m hoping someone with some CUDA experience can point me in the right direction. I’ve got an application currently running on the CPU that I’d like to port to CUDA to gain some performance. My problem is before I even get too deep into porting the algorithm to CUDA it’s already much slower and I can’t figure out why. I’m hoping it’s something I don’t understand architecturally that’s biting me.

My data is a graphics frame 1280x1024 8bit gray scale, that I’m acquiring from a CMOS array. For the purposes of performance testing CUDA I’ve saved a frame to disk and I load it before I start any timing so that I’m only looking at the true performance of the processing.

I allocate two buffers using cudaMalloc, in my simple example I use identical sized buffers for input and output. I cudaMemcpy my image into the source buffer, call my function which calls the CUDA kernel function I’ve written and once it returns I cudaMemcpy the result out of the device hosted buffer into my application buffer and stop my timer. I do a quick verification that the output is correct, but that’s outside the timer, since I won’t be doing that in the production application. Now, if I effectively NOP the CUDA kernel function I’ve written, this all runs about 10-12ms, which I can live with as overhead, if the rest of the processing is fast.

My test CUDA kernel function just iterates against the 1.3M pixels, totals them up and copies them into the output buffer. Totally useless, but not entirely dissimilar to what I’ll be performing in the actual routine (though obviously very much simpler). Running this with 1 thread gives me a time of about 1 second, 2 threads is about 330ms, 4 threads about 195ms, 8 threads 155ms and, oddly enough, 16 threads 260ms. It certainly seems to me that I’m handling the threading at least partially right, since I do see performance gains as I scale up the thread count, but two things bother me. One, why am I seeing such a fall off in performance between 8 and 16 threads? I’m guessing there’s overhead associated with the spawning of the thread on the device and once I hit 16 threads I’m no longer performance enough calculations to offset.

The second question is really the important one to me. On my notebook, Core 2 Duo 2.5Ghz I can iterate against the 1.3M pixels and total them up in approximately 4 ms. Why in the world would such an operation take 155ms with 8 threads running in parallel on the device?

Background info, if anyone cares. The actual process is shining a laser (visible red) through some reduction filters onto a CMOS array, finding the center point of the laser spot (definitely not round) and then back feeding that information to the device pointing the laser for accuracy correction. The problem that leads me to try this on CUDA is my final hardware will have 3-6 CMOS array’s, each running 20+fps and the processing time on a single frame (via CPU code) is about 75ms. Bar napkin math says the 3 arrays @ 20fps needs a more than a 4 core machine (running similar Ghz to my notebook) and 6 needs a 9 core machine. Obviously this is pretty steep hardware requirements and the OS prices will be ridiculous (the customer will definitely insist on Windows, unfortunately). My hope was to port over to CUDA and buy a high end video card and be done with it.

Thanks for any suggestions or help.

Hardware info from deviceQuery:

CUDA Capability Minor revision number: 1
Total amount of global memory: 268435456 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.25 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads
can use this device simultaneously)

I am going to take a wild guess that your kernel contains a loop around the per pixel operation and as you increase the thread count, you are just adjusting the number of per loop iterations?

That sort of approach (treating the GPU like a very wide SMP machine with an OpenMP style of programming) is basically incompatible with the CUDA programming model. You don’t want to launch a handful of threads, you want thousands, mostly likely using a kernel where one thread is launched to performs the “atomic” operations on a single pixel (ie a 1Mp image would use 1 million threads). In CUDA, thread creation, scheduling and context switching is done in hardware and is basically free. What isn’t free is memory access. It is very slow and optimised for only and handful of access patterns with no cache. The slow performance you are seeing is related to memory latency. It is only by launching a large number of threads that you can hide the memory latency and get reasonable performance.

You might find reading Chapters 5 and 6 of the CUDA programming guide instructive.

Ok, I think I understand what you’re saying, but I’m not sure that my application is going to be applicable to that. Ultimately I need to walk down each column and across each row, totally values as I go and noting down the first and last pixel in the row or column to have a value above my threshold. I guess I could try creating one thread per row and one per column and see what happens then.

Well, thanks for pointing me in a direction at least, I appreciate it.

That sort of algorithm can probably be done using a form of parallel reduction. There are a couple of examples in the CUDA SDK that might be worth a look.