I’m just going to throw out some possible ideas I have:
If your GPU is heating up in the large project, it is likely that speeds will be reduced significantly. Even something as simple as GPU-Z could potentially show you if it is getting too hot or not.
If you queue up many kernels on the system, I’m not sure if the profiler will accurately show the start/end time for when each kernel is running on the system (I haven’t used the profiler much). One solution is to use cudaDeviceSynchronize() to explicitly force all previous kernels to finish before queueing up this kernel in question.
I don’t have any experience with streaming but I do know that kernels can be called asynchronously. If these kernels are being called asynchronously, each kernel would individually notice a speed decrease as neither gets all the resources they would get when called alone.
If I were in your shoes, I would make a copy of the large project and begin commenting out sections nearby the kernel call (replacing them with constant value inputs/outputs and what not). Do note that code that is run “after” your kernel could still impact the kernel in question; be sure to suspect code that is run before and after this kernel.
Best of luck to you and hopefully someone who has experienced this phenomena can throw out a more precise answer.