load balance is very benefitial parallel load balance

Hi, I use several random threads for thousands of heavy reads (from device memory) and other threads for only a few reads. This is slow.
Then I balanced the load, s.t. all threads read about the same amount, but the total workload of all threads are the same as before. This is fast.
Why? One possible reason is that if one thread is slow, other threads in the same warp will wait for it even they finished. Is it so? Thank you!

I think you pretty much answered your question. There reason is a combination of two causes:

  • A read from device memory takes up to 300 cycles, so many reads are expensive. The impact severity depends on how many cycles you spend reading and how many are used for computation. Judging by your description, I’d guess that your application is memory-bound.

  • It also depends on your code. If your threads diverge to allow some threads to do many reads, the threads that don’t perform many reads still “wait around,” since processing elements within a multiprocessor execute same instructions in lock-step.