I have a strange problem. I’ve written a kernel which is called from a multithreaded c++ application. When I compile and run the kernel alone in a separate project in visual studio, it runs 2x faster than when it’s run as part of the larger program.
I’ve looked at the command line arguments for both projects and they are the exact same. They both use the same amount of shared mem, registers, etc. and are run on the same machine and graphics card. I’m using the nvidia profiler in visual studio to time them and the standalone version runs 2x as fast.
Any ideas what to look for to get both versions to run at the same speed? Thanks!!
I’m just going to throw out some possible ideas I have:
If your GPU is heating up in the large project, it is likely that speeds will be reduced significantly. Even something as simple as GPU-Z could potentially show you if it is getting too hot or not.
If you queue up many kernels on the system, I’m not sure if the profiler will accurately show the start/end time for when each kernel is running on the system (I haven’t used the profiler much). One solution is to use cudaDeviceSynchronize() to explicitly force all previous kernels to finish before queueing up this kernel in question.
I don’t have any experience with streaming but I do know that kernels can be called asynchronously. If these kernels are being called asynchronously, each kernel would individually notice a speed decrease as neither gets all the resources they would get when called alone.
If I were in your shoes, I would make a copy of the large project and begin commenting out sections nearby the kernel call (replacing them with constant value inputs/outputs and what not). Do note that code that is run “after” your kernel could still impact the kernel in question; be sure to suspect code that is run before and after this kernel.
Best of luck to you and hopefully someone who has experienced this phenomena can throw out a more precise answer.