Slow down with multiple CUDA files

I am working with a NVIDIA 9800GX2 and CUDA.

I have developed some CUDA processing algorithms that take roughly 5ms when performed alone. Transfers between GPU and host take only about 1-2 ms. When I combine several CUDA processing routines together, however, the transfer and processing take something like 220 ms, which doesn’t make sense given the separate transfer and processing times. It should take about 30 ms at the most. I am transferring and processing continuously, and was hoping to get 30Hz rate.

I am printing out GPU memory information using GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX, and it seems like there is plenty of memory left. There are still around 3MB available.

Please let me know what I might do to speed this up, or some ways to investigate this problem.

Thanks a lot!

There’s not enough information to really help. Post some code, and perhaps someone can spot an issue.

A common explanation for “my code is fast, but when I add just a little more computation to the function, it becomes slow!” is the effect of compiler dead code elimination. If you’re computing values but don’t use them, the compiler unwinds and removes potentially lots of code and you run very quickly. Then you add one more computation to your function (using the previously unused values) and suddenly your runtime goes way up since a big chunk of previously dead code becomes alive again.

There’s not enough information to really help. Post some code, and perhaps someone can spot an issue.

A common explanation for “my code is fast, but when I add just a little more computation to the function, it becomes slow!” is the effect of compiler dead code elimination. If you’re computing values but don’t use them, the compiler unwinds and removes potentially lots of code and you run very quickly. Then you add one more computation to your function (using the previously unused values) and suddenly your runtime goes way up since a big chunk of previously dead code becomes alive again.

I am doing image processing and am trying now to do real time video processing. When I process a single image, it takes about 5 ms. Now, I am capturing a video frame, sending it to the GPU, which takes about 1 ms, using one CUDA function to to an initial processing step, which should take about 5ms, then my CUDA processing on the output, 5 ms. However, the next cycle, when the initial CUDA processing step is called, the function takes an incredibly long time, like 220 ms. It seems like there is a problem transferring between the memory space of the first and second CUDA kernels. My CUDA processing step does use a lot of memory, but as far as I can tell there is still memory free on the GPU.

I don’t think it has to do with the dead code, since I have verified the output of my image processing by printing it out to screen or to a file. Also, I have compiled the image processing to a DLL so the compiler can’t remove dead code.

I am doing image processing and am trying now to do real time video processing. When I process a single image, it takes about 5 ms. Now, I am capturing a video frame, sending it to the GPU, which takes about 1 ms, using one CUDA function to to an initial processing step, which should take about 5ms, then my CUDA processing on the output, 5 ms. However, the next cycle, when the initial CUDA processing step is called, the function takes an incredibly long time, like 220 ms. It seems like there is a problem transferring between the memory space of the first and second CUDA kernels. My CUDA processing step does use a lot of memory, but as far as I can tell there is still memory free on the GPU.

I don’t think it has to do with the dead code, since I have verified the output of my image processing by printing it out to screen or to a file. Also, I have compiled the image processing to a DLL so the compiler can’t remove dead code.

Next potential issue is that the kernel returns, but hasn’t finished (asynchrone), so you underestimate the time it needs.

Assuming you make several kernelcalls, use cudaThreadSynchronize() after the (first) kernel-call to see if this is the case.

Next potential issue is that the kernel returns, but hasn’t finished (asynchrone), so you underestimate the time it needs.

Assuming you make several kernelcalls, use cudaThreadSynchronize() after the (first) kernel-call to see if this is the case.

Thanks for your help. Yes, this is the error. I don’t think I’ll be able to do this processing real-time.

Thanks for your help. Yes, this is the error. I don’t think I’ll be able to do this processing real-time.