I ask for your help. My Cuda application executes four thread, each has its own cuda stream. The kernel execute duration is the same in four thread. When I reduce to one thread in my application, The kernel execute duration is the same as running four thread situation. That's the point I confused. Because when my cuda application execute in one thread, nvidia-smi show my GPU utilization is 100%. According to this situation, when my cuda application execute four thread, I assume the kernel execute duration should be increased, but the result is different I thought. what's the reason. By the way, the nvidia-smi tool always show two utilization 0 or 100%, except these two numbers, there is no more number like 10% or 20% and so on. I am confused about this?
please no hesitate to correct me if i make some mistakes.