Limitations of a CUDA kernel reached?

Hello, I am trying to code a 3x3-median filter for 2D image data in CUDA and observed some strange problems.

First, I’ve written and tested a sequential host code version of the filter, say
void median(float *data_in, float *filtered_data,int X-dim, int Y-dim)
where the function operates on given data arrays of given dimensions I pass in. This sequential code works.

Then, I’ve created a with a function
void global cuda_median(float *data_in, float *filtered_data,int X-dim, int Y-dim)
where the arrays are allocated in the device memory and the data is copied from host to device before execution (these steps give CUDA_SUCCESS). Actually, the code inside my kernel is the same as in the host version.

Everything compiles&links without any error in Visual Studio 2008 on Windows 7 64bit.

Now if I invoke the kernel in a wrapper-function inside the same .cu-file via
that is a single thread on the GPU is doing the entire work, it works for small data up to 300x60 pixels and the result have zero absolute error in comparison with the host code.

  1. Problem - CUDA failure:
    For bigger data (>300x600 pixels) the programme fails and Windows 7 even resets the display driver (266.58 64bit). This is the case on a 8400 GS 512 MB and a Quadro FX3700M 1 GB. Why??!!
    Does CUDA reach its limitations?

  2. Problem - CUDA speed very low:
    The pure execution time of the CUDA-kernel on 300x60 pixel data (without data copying etc.) is about 2s on 8400 GS and on Quadro FX3700M whereas the CPU needs only 0.004 S (Quad Core Extreme) resp 0.0115 seconds (Athlon X2) - why is the CUDA thread so slow?

I would really like to know why such behaviour takes place!

Without seeing your code, setting up the GPU (getting the context, allocating the data, copying the data to the GPU and/or back) and so on takes some time. Probabyl thats the reason why it is so much slower. Typically a compute limited serial execution of code (without any compiler optimization) is about 10-20 times slower than on a CPU depending on the CPU / GPU clock frequencies. The timeout is because of the 5s runtime limit of each kernel, if the GPU is attached to a display.


Thank you very much, I was able to enforce longer kernel execution time limits in the OS and indeed, the kernel worked without problems.

With regard to the execution time, starting the 3D median filter on Geforce with 256 instead of 1 threads per block resulted in 6x speedup compared with the sequential version of the filter on the CPU. I guess, more speedup is possible after memory optimization…

First of all, having one active thread on the GPU is really really really bad.

The GPU relies on having many threads to hide latency rather than using caches (i.e several thousand threads). That means that you should have at least as many blocks as multi-cores (preferably at least a magnitude more) and at least 192 threads per block (there are exceptions either way, but that is a beginners rule of thumb). For the Quadro FX3700M we are talking 16 multi-cores if memory serves. This is unlike the CPU where you want as many threads as cores.

What the GPU calls a core is actually a FPU (floating point unit) which are grouped into multi-cores with either 8 (your case), 32 (high end Fermi) or 48 (low end fermi) cores per multi-core. Each multi core then runs the threads in warps of 32 threads, so whatever you do you are always running at least 32 threads, 31 of which may be doing nothing.

What happens in your case is that you are using 1/16 of the multi-cores of the GPU, and each operation takes 4 clock cycles rather than 1 (as you are running a full warp even though you are trying to run a single thread), on a core that is running on a lower clock rate than your CPU (should be around a third), which in turn (the CPU) also does dual issue on the instructions, so you need to multiply that by another two. Next, you are probably running directly to global memory which means no caches (you need to be using shared memory here) compared to the CPU which is running through the cache (although probably not as efficiently as it can if you are running line order rather than block order).

Bottom line, your GPU never stood a chance the way you are working. It’s not designed to work single threaded …