dynamic load balancing within blocks

Hi, everyone,

I am new to CUDA. In my application, the data cannot be easily partitioned evenly among threads. For example, a sparse matrix multiplies a vector. But the non-zero elements have a power-law distribution, i.e, many rows have few non-zeros, and few rows have many non-zeros.
If such a sparse matrix is assigned to a block to compute the multiplication with a vector and we assign each thread to compute each row times the vector, then many threads will finish the work soon, the rest few threads will keep busy.

Is there a way to let the idle threads steal work from the busy threads in run time?