Limitations of a CUDA kernel reached?

Michael_M · March 2, 2011, 4:13pm

Hello, I am trying to code a 3x3-median filter for 2D image data in CUDA and observed some strange problems.

First, I’ve written and tested a sequential host code version of the filter, say
void median(float *data_in, float *filtered_data,int X-dim, int Y-dim)
where the function operates on given data arrays of given dimensions I pass in. This sequential code works.

Then, I’ve created a CUDA-kernel.cu-File with a function
void global cuda_median(float *data_in, float *filtered_data,int X-dim, int Y-dim)
where the arrays are allocated in the device memory and the data is copied from host to device before execution (these steps give CUDA_SUCCESS). Actually, the code inside my kernel is the same as in the host version.

Everything compiles&links without any error in Visual Studio 2008 on Windows 7 64bit.

Now if I invoke the kernel in a wrapper-function inside the same .cu-file via
cuda_median<<<1,1>>>(…);
that is a single thread on the GPU is doing the entire work, it works for small data up to 300x60 pixels and the result have zero absolute error in comparison with the host code.

Problem - CUDA failure:
For bigger data (>300x600 pixels) the programme fails and Windows 7 even resets the display driver (266.58 64bit). This is the case on a 8400 GS 512 MB and a Quadro FX3700M 1 GB. Why??!!
Does CUDA reach its limitations?
Problem - CUDA speed very low:
The pure execution time of the CUDA-kernel on 300x60 pixel data (without data copying etc.) is about 2s on 8400 GS and on Quadro FX3700M whereas the CPU needs only 0.004 S (Quad Core Extreme) resp 0.0115 seconds (Athlon X2) - why is the CUDA thread so slow?

I would really like to know why such behaviour takes place!

ceearem · March 2, 2011, 5:25pm

Without seeing your code, setting up the GPU (getting the context, allocating the data, copying the data to the GPU and/or back) and so on takes some time. Probabyl thats the reason why it is so much slower. Typically a compute limited serial execution of code (without any compiler optimization) is about 10-20 times slower than on a CPU depending on the CPU / GPU clock frequencies. The timeout is because of the 5s runtime limit of each kernel, if the GPU is attached to a display.

Ceearem

Michael_M · March 3, 2011, 2:45pm

Thank you very much, I was able to enforce longer kernel execution time limits in the OS and indeed, the kernel worked without problems.

With regard to the execution time, starting the 3D median filter on Geforce with 256 instead of 1 threads per block resulted in 6x speedup compared with the sequential version of the filter on the CPU. I guess, more speedup is possible after memory optimization…

laughingrice · March 7, 2011, 1:15pm

Hello, I am trying to code a 3x3-median filter for 2D image data in CUDA and observed some strange problems.

First, I’ve written and tested a sequential host code version of the filter, say

void median(float *data_in, float *filtered_data,int X-dim, int Y-dim)

where the function operates on given data arrays of given dimensions I pass in. This sequential code works.

Then, I’ve created a CUDA-kernel.cu-File with a function

void global cuda_median(float *data_in, float *filtered_data,int X-dim, int Y-dim)

where the arrays are allocated in the device memory and the data is copied from host to device before execution (these steps give CUDA_SUCCESS). Actually, the code inside my kernel is the same as in the host version.

Everything compiles&links without any error in Visual Studio 2008 on Windows 7 64bit.

Now if I invoke the kernel in a wrapper-function inside the same .cu-file via

cuda_median<<<1,1>>>(…);

that is a single thread on the GPU is doing the entire work, it works for small data up to 300x60 pixels and the result have zero absolute error in comparison with the host code.

Problem - CUDA failure:

For bigger data (>300x600 pixels) the programme fails and Windows 7 even resets the display driver (266.58 64bit). This is the case on a 8400 GS 512 MB and a Quadro FX3700M 1 GB. Why??!!

Does CUDA reach its limitations?

Problem - CUDA speed very low:

The pure execution time of the CUDA-kernel on 300x60 pixel data (without data copying etc.) is about 2s on 8400 GS and on Quadro FX3700M whereas the CPU needs only 0.004 S (Quad Core Extreme) resp 0.0115 seconds (Athlon X2) - why is the CUDA thread so slow?

I would really like to know why such behavior takes place!

First of all, having one active thread on the GPU is really really really bad.

The GPU relies on having many threads to hide latency rather than using caches (i.e several thousand threads). That means that you should have at least as many blocks as multi-cores (preferably at least a magnitude more) and at least 192 threads per block (there are exceptions either way, but that is a beginners rule of thumb). For the Quadro FX3700M we are talking 16 multi-cores if memory serves. This is unlike the CPU where you want as many threads as cores.

What the GPU calls a core is actually a FPU (floating point unit) which are grouped into multi-cores with either 8 (your case), 32 (high end Fermi) or 48 (low end fermi) cores per multi-core. Each multi core then runs the threads in warps of 32 threads, so whatever you do you are always running at least 32 threads, 31 of which may be doing nothing.

What happens in your case is that you are using 1/16 of the multi-cores of the GPU, and each operation takes 4 clock cycles rather than 1 (as you are running a full warp even though you are trying to run a single thread), on a core that is running on a lower clock rate than your CPU (should be around a third), which in turn (the CPU) also does dual issue on the instructions, so you need to multiply that by another two. Next, you are probably running directly to global memory which means no caches (you need to be using shared memory here) compared to the CPU which is running through the cache (although probably not as efficiently as it can if you are running line order rather than block order).

Bottom line, your GPU never stood a chance the way you are working. It’s not designed to work single threaded …

Topic		Replies	Views
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
CUDA kernels keep on crashing CUDA Programming and Performance	6	3646	October 27, 2008
CUDA thread in background? CUDA Programming and Performance	10	16007	February 19, 2010
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8622	December 18, 2008
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	251	July 7, 2024
Slow down with multiple CUDA files CUDA Programming and Performance	8	4716	September 7, 2010
Cuda code performance CUDA Programming and Performance	14	3156	December 16, 2014
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6042	December 8, 2008
Inconsistent kernel run times CUDA Programming and Performance	12	5797	August 5, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	680	April 4, 2017

Limitations of a CUDA kernel reached?

Related topics