__shared__ memory offers no performance increase Also, using GPUs to display video while running pro

Hi all,

Here is my problem: I’m running a (complex) 1D PDE grid solver which I rebuilt to run on CUDA, but which I want to go faster. Right now the main integrator loop launches a few kernels in a row (to enforce block synchronization) and stores the GPU data in global 1D vectors (the size of the grid) so it remains between kernel launches. The last kernel calls this global data a lot… each element i (which gets its own CUDA thread) accesses the i and i+1 elements in one of the vectors about 10 times. For a 250,000 element grid over 1000 cycles I’m guessing that this would add up (Total simulation runtimes now are about 53 seconds). So I gave each block a shared vector of size THREADSPERBLOCK+1, had each thread fill its element [and took care of the extra i+1 at the end of the vector – blockDim.x -> THREADSPERBLOCK] at the beginning of the kernel, and changed all the references in that kernel to the shared vector. The program took exactly the same amount of time to run. Does this make any sense?

Also, I am running on a GTX 460 which I am also using as a graphics adapter (Ubuntu 10.04 with desktop cube, but no HD movies or Crysis)… Would this do something like fill up shared memory or otherwise cause the code to perform more slowly than it would otherwise? I have not seen much information on the web about the effects caused by running CUDA like this.



GTX 460 has a Level 1 cache which may hide the benefit that you are expecting.

I would expect to see speed increases on Compute Capabilities 1.x devices from shared memory (unless for some weird reason you induced bank conflicts).


To reinforce the comment of cbuchner1, compile your code with -Xptxas -dlcm=cg to disable the L1 cache and check the execution times of your both implementations again.

Best regards.