Adjust thread number during the computation

Hello everyone,

When writing a Cuda program, I used ``cudaEvent_t start, stop;‘’ to calculate the time and found that it took a very long time to launch a kernel.
Therefore, I tried to write all the kernels together and use the largest number of threads of the original kernels.
Like:
Original:
kernel_a <<< 2, 192, 0, 0 >>> (…);
kernel_b <<< 4, 576, 0, 0 >>> (…);

New:
__global void kernel_ab () {
/calculate for part A/
if (threadIdx.x < 192) {
(…)
}
__syncthreads();
/calculate for part B/
}

kernel_ab <<< 4, 576, 0, 0 >>> (…);

But here comes a problem, in this example, for threadIdx.x = [192, 575], they have no works.
However, I cannot return them directly because those threads need to be used in the next part of the calculation.
This action causes a surge in the time spent on __syncthreads(); in part A, since the time spent on __syncthreads(); becomes really long.
I think this is cuased by the increases of the #thread.
Hence, I would like to ask, can the number of threads be adjusted during the kernel calculation process?
Or is there any other way to solve this problem?

Thanks for the reply!

The number of kernel threads cannot be adjusted in the kernel.

Do you mind showing your kernels?
Does kernel B depend on the results of kernel A ?
Would it be valid to run kernel B with <<< 16, 192 >>> ?
How do you know that __syncthreads() limits the performance? What is “really long”? Does kernel AB run slower than A + B?

As the blocks of kernel A may be executed serially (and other needed blocks are not finished, when kernel B starts), this would only work, if one block of kernel B only reads data from the respective block of kernel A.

You could possibly change kernel B into a for loop over 3 groups of 192 threads to match the numbers of threads per block.