Is possible to create a bunch of tasks, a lot of them, i make CUDA executes like a master/slave. It means, i have thousands of jobs to do, and hundrens of GPU threads that are able to compute. Each thread will run, complete, and as soon as possible take another job, until there is no job left to run. I was thinking about this approach, because every job has different execution complexity. Because of this, the faster threads are always waiting the small ones that are running the hard tasks, and i dont use the GPU efficiently.
You can probably make almost anything “work”, so nearly any work distribution strategy is probably “possible” on a CUDA GPU.
CUDA GPU threads generally need to be following the same code path, and loading and storing adjacent data, in order to get useful performance out of the GPU, at least up to the warp level (groups of 32 threads). It sounds like what you are describing is task-level parallelism rather than the kind of “data parallelism” that I described.
If your actual work fits the “data parallelism” that I described, then it may be a candidate, but having a large work discrepancy between threads doesn’t always indicate “data parallelism”. If you are asking about task parallelism, i.e. disparate work between threads, with no commonality/similarity even at the warp level, its probably not a good fit for (CUDA) GPUs.