Roughly the same processing time for global and shared mem

Hello. I modified the simpleTexture program included in the SDK browser to perform a very simple blur. I’m using a block size of 15x15 and I’m returning an image without the borders (receive a 512x512 picture, return a 510x510).

NOT USING SHARED MEMORY: 738 ms

USING SHARED MEMORY: 733 ms

What’s wrong? Shouldn’t the texture mem → shared mem → global mem data transfers be faster than texture mem → global mem? How can I circumvent this problem?

Thank you.

No, there is no reason for that to be true. Shared memory is only an advantage if it allows you avoid unnecessary repetition of global memory reads, or if it helps you convert uncoalesced global memory reads into coalesced global memory reads. In the example you show, the texture cache is already doing this for you, so the storage in shared memory is completely unnecessary.

What’s the size of the filter and what kind of GPU do you use? 730 ms sounds very slow for such a small image.

Uh, I think I might have forgotten to delete a loop calling the kernel. Now the processing time is about 611.9 ms.

My GPU has a compute capability of 1.1. Aside from the kernel, I’ve only modified this, from the simpleTexture example program:

What bemused me is that the processing time for the equivalent blur program executed only on the CPU was 11.7 ms. These were the matlab commands (the fourth line compiles the code):

Although I used the matcuda plugin for matlab on the last one, the difference is still appalling. Both programs’ codes are attached.

Previously, I had written two programs to denoise a 946x946 picture that called a function 100 times, both using matcuda and one of them 22x22 thread blocks. The one that was executed exclusively on the CPU ran 1.88 times more faster than the other (1311 ms vs 2461 ms). The memory allocation/deallocation and data transfer times were relatively close to each other (23 ms vs 32 ms), so it was really the denoise functions that made the difference. These were the matlab commands:

Both are also attached (as outspockCPU.cu and outspock.cu).

Any help in what’s causing this will be appreciated.
blur_plus_denoise.rar (552 KB)

You have to check the model of your GPU, if it does not have that many processor cores it can’t beat the CPU…

By the way, what is matcuda?

It thought being a CUDA-enabled GPU always guaranteed some sort of speedup.

My CPU is an Intel Pentium D @ 3.0 GHz and I have 2GB of RAM.

These are my DeviceQuery results:

Sorry, it’s not matcuda. It’s the MATLAB plug-in for CUDA, which you can find here: http://developer.nvidia.com/object/matlab_cuda.html. It uses the mex interface as a gateway to pass values through MATLAB to your C code. Matcuda , now called GPUmat, is a toolbox that allows standard MATLAB code to run on GPUs (http://gp-you.org/), but it’s still in an early stage of development.

This is definitely not true. Although the 8600 GT is not the slowest CUDA-enabled GPU, it is pretty close. In addition, if your task is dominated by the time required to copy the data to the GPU, then no matter how fast your GPU is, the CPU could be faster.

It does sound like something odd is going on, but even once that is solved, it is quite possible that CUDA will still be slower than the CPU for this task.

I see. But, as said, the data transfer is not what’s causing the bottleneck: «The memory allocation/deallocation and data transfer times were relatively close to each other (23 ms vs 32 ms), so it was really the denoise functions that made the difference». I guess I’ll have to try the tile division approach described here: http://www.nvidia.com/content/nvision2008/…o_with_CUDA.pdf.

And you are sure that this isn’t CUDA initialization included in the timing? Initializing the CUDA context can be kind of slow, and the first time you run a particular kernel, the driver does some further one-time initialization. Also, which OS is this? I believe the CUDA context initialization takes longer on Windows Vista/7, but I don’t have the posts handy.

Well, I didn’t use a warm-up kernel on the denoise programs, but I did on the blur. Or maybe I misunderstood your question.

I’m using Windows XP 32bits.

Also, at the “simple filter example” on the pdf from my last post, is g_idata on texture memory?