Hello, I am trying to code a 3x3-median filter for 2D image data in CUDA and observed some strange problems.
First, I’ve written and tested a sequential host code version of the filter, say
void median(float *data_in, float *filtered_data,int X-dim, int Y-dim)
where the function operates on given data arrays of given dimensions I pass in. This sequential code works.
Then, I’ve created a CUDA-kernel.cu-File with a function
void global cuda_median(float *data_in, float *filtered_data,int X-dim, int Y-dim)
where the arrays are allocated in the device memory and the data is copied from host to device before execution (these steps give CUDA_SUCCESS). Actually, the code inside my kernel is the same as in the host version.
Everything compiles&links without any error in Visual Studio 2008 on Windows 7 64bit.
Now if I invoke the kernel in a wrapper-function inside the same .cu-file via
that is a single thread on the GPU is doing the entire work, it works for small data up to 300x60 pixels and the result have zero absolute error in comparison with the host code.
Problem - CUDA failure:
For bigger data (>300x600 pixels) the programme fails and Windows 7 even resets the display driver (266.58 64bit). This is the case on a 8400 GS 512 MB and a Quadro FX3700M 1 GB. Why??!!
Does CUDA reach its limitations?
Problem - CUDA speed very low:
The pure execution time of the CUDA-kernel on 300x60 pixel data (without data copying etc.) is about 2s on 8400 GS and on Quadro FX3700M whereas the CPU needs only 0.004 S (Quad Core Extreme) resp 0.0115 seconds (Athlon X2) - why is the CUDA thread so slow?
I would really like to know why such behaviour takes place!