Performance drop on CUDA 2.3 When input size increased, the computing time of single step increased

Hi:

We had implemented an Euler solver to simulate shock wave on CUDA with double precision, we fodun that the sample program running on CUDA 2.3 is worse than CUDA 2.0 as the grid size increased. We are running on GTX280 with 1G on board memory, and the host CPU is Core 2 Duo E8500 with 4GB memory with OS Windows XP 64.
Attached is the single step scaling chart, the grid size is 128x512, 256x1024, 512x2048 and 1024x4096. We could find out that the performance of CUDA 2.0 is better than CUDA 2.3. Does any body encounter the same problem?

What is the y-axis on that graph supposed to indicate?

At a guess it is probably occupancy. In my experience, the 2.2 and 2.3 compiler releases actually generate considerably more efficient code, but it does so using more registers. It is possible that the register usage on your kernel has risen and your occupancy has taken a hit as a result. I would suggest profiling or running nvcc with the --ptxas-options=“-v -mem” on the two builds to get the kernel resource usage, and then see what the occupancy calculator says about the expected active blocks per MP. You might find it is lower in 2.3 than in 2.0.

Thanks for replying.

The y-axis is the computing time of single grid point in ms.

We had found the problem, it’s screen saver annoying the execution. After shut down the screen saver, now the program run on 2.3 is slightly faster than on 2.0, with fewer registers used.

In my case I also experienced a significant drop in performance by upating CUDA 2.1 to 2.3. Before the framerate of my renderer was like 30 fps (v2.1) and now its like 20-25 fps (v2.3). I yet have to figure out the source of the slowdown… Hopefully its the screen-saver as well : -)

Now I found the source of the slowdown which happens in OpenGL when changing from CUDA 2.1 to 2.3: It was not the screensaver but the PBO configuration!

If the PBO was created with GL_DYNAMIC_DRAW, then the TexSubImage was terribly slow. Using GL_DYNAMIC_COPY however turned out to be as fast as before.

So when creating a PBO, be sure to use GL_DYNAMIC_COPY as follows:

glBufferData( GL_ARRAY_BUFFER, image_width * image_height * (bpp/8), data, GL_DYNAMIC_COPY);

Perhaps somebody can explain to me why this happens …