how to balance work load amount all threads

I found for some algorithms, because the processing iterations or processing complexity depend on the input data, processing time is much various between threads, which means a big performance loss on current GPU architecture. is there any general solution or suggestion on this topic?

I am trying to solve this by dividing the GPU program into 2 parts for improving the performance, for instance,

in part1, launch a kernel with input data, each thread will compute how many ‘basic’ works are generated from one input data slot, and prepare the input for each of these ‘basic’ works. These ‘basic’ works are expected to have similar processing complexity and processing time.

in part2, launch a new kernel with more threads, each thread will do a ‘basic’ work.

in OpenCL coding style, it is like:
part 1 kernel:
__kernel void divide_to_basic_work(__global MyStruct* input_data,
__global BasicWork *basic_work,
__global int *basic_work_nums)
int global_id=get_global_id(0);
MyStruct data_slot=input_data[global_id];
int basic_work_num=my_get_basic_work_num(data_slot);
int offset=[b]global_prefix_sum/b;
for(int i=0; i<basic_work_num; i++) {
basic_work[offset+i]=my_get_basic_work(data_slot, i);

CPU code can get the values in basic_work_nums and know how many ‘basic’ work generated by part 1, say A. then starting part 2 kernel using A threads, each of the threads will read a slot of basic_work and do a ‘basic’ work. according the previous assumption, each thread of kernel in part 2 will have similar processing time.

I am trying to make this pattern more general to solve the ‘load unbalanced’ problem amount GPU threads. But what I am missing in the approach is global_prefix_sum() method, is there any implementation for this kind of function? Thanks.

Since you are posting in the CUDA forums, you get a CUDA answer. There are two commonly used libraries for fast prefix sums: thrust and CUDPP.

Thanks, these 2 libraries are brilliant. but for multi-block prefix sum, they seems just using multi-pass of kernels. is there some idea of implementing prefix sum in a global scope (multi-blocks) in one device function?