Hello everyone,

When writing a Cuda program, I used ``cudaEvent_t start, stop;‘’ to calculate the time and found that it took a very long time to launch a kernel.

Therefore, I tried to write all the kernels together and use the largest number of threads of the original kernels.

Like:

Original:

kernel_a <<< 2, 192, 0, 0 >>> (…);

kernel_b <<< 4, 576, 0, 0 >>> (…);

New:

__global void kernel_ab () {

/*calculate for part A*/

if (threadIdx.x < 192) {

(…)

}

__syncthreads();

/*calculate for part B*/

}

kernel_ab <<< 4, 576, 0, 0 >>> (…);

But here comes a problem, in this example, for threadIdx.x = [192, 575], they have no works.

However, I cannot return them directly because those threads need to be used in the next part of the calculation.

This action causes a surge in the time spent on __syncthreads(); in part A, since the time spent on __syncthreads(); becomes really long.

I think this is cuased by the increases of the #thread.

Hence, I would like to ask, can the number of threads be adjusted during the kernel calculation process?

Or is there any other way to solve this problem?

Thanks for the reply!