Hello everyone,
When writing a Cuda program, I used ``cudaEvent_t start, stop;‘’ to calculate the time and found that it took a very long time to launch a kernel.
Therefore, I tried to write all the kernels together and use the largest number of threads of the original kernels.
Like:
Original:
kernel_a <<< 2, 192, 0, 0 >>> (…);
kernel_b <<< 4, 576, 0, 0 >>> (…);
New:
__global void kernel_ab () {
/calculate for part A/
if (threadIdx.x < 192) {
(…)
}
__syncthreads();
/calculate for part B/
}
kernel_ab <<< 4, 576, 0, 0 >>> (…);
But here comes a problem, in this example, for threadIdx.x = [192, 575], they have no works.
However, I cannot return them directly because those threads need to be used in the next part of the calculation.
This action causes a surge in the time spent on __syncthreads(); in part A, since the time spent on __syncthreads(); becomes really long.
I think this is cuased by the increases of the #thread.
Hence, I would like to ask, can the number of threads be adjusted during the kernel calculation process?
Or is there any other way to solve this problem?
Thanks for the reply!