I have an optimization problem with my code.
With the advice you gave me I managed to optimize my kernel.
Now I’m struggling with the optimization of the kernel launch
I set up different streams, data is managed with cudaMallocManaged
I have a loop that selects the arrays that has to be passed to the first kernel
The first kernel produce a temporary array that is summed up afterwards by another kernel
for (unsigned i=0;i<N;i+=NSTREAMS)
{
for (unsigned int w=0;w<NSTREAMS;w++)
{
kernel1<<<blocks,threads,0,stream[w]>>>(parms)
}
for (unsigned int w=0;w<NSTREAMS;w++)
{
kernel2<<<blocks,threads,0,stream[w]>>>(parms)
}
}
I obviously obtained a great improvement compared to the “streamless” version of this code but the profiler still says
that “the multiprocessor of the GPU are almos idle”.
And actually they are
concurrent kernel execution requires a number of requirements to be satisfied, and can be difficult to achieve in practice. You haven’t shown any kernel code or kernel launch parameters, but if, for example, your kernel launches are consisting of a large number of blocks, these will typically “fill” the GPU and prevent any significant concurrency.
The first kernel takes three vectors n1 sized and three vectors n2 sized.
The processing is done at warp level with shfl function, and each kernel uses at most 192 blocks.
The actual number of blocks is min(n1/16,192) as there are 512 threads x block.
The second kernel takes the output vector from the first kernel (192 elements at most),
and sums them up.
I’m using a 8-multiprocessor card, so I thought that by using 8 streams the card would be
100% occupied.
What am I missing?
What kind of GPU are you using? With older GPUs (generally speaking pre-Kepler), there can be an issue with false dependencies as all kernel launches go through a single work queue. The following article describes this in the context of CUDA Fortran, but it applies to CUDA C just the same:
Make sure your code does not use any blocking API calls. Make sure you do not accidentally set the environment variable CUDA_LAUNCH_BLOCKING=1. If you are on one of the newer GPUs, you may want to increase the number of concurrent streams supported by the driver by setting environment variable CUDA_DEVICE_MAX_CONNECTIONS to a higher value than the default (which is 8, I think). The hardware maximum is 32.
It certainly should be possible to run multiple kernels in parallel (concurrently), if each of those kernels consist of only a single threadblock. After all the necessary prerequisites for concurrent kernel launch are dealt with (such as the necessity to use streams, cc2.x or higher, etc.), then there are a number of resource limits that must be satisfied, concurrently, in order to observe concurrency. Many of those resource limits are covered in the table 11/12 that I linked.
cudaStream_t working_streams[NWORKINGSTREAMS];
for (unsigned int i=0;i<NWORKINGSTREAMS;i++){
//cudaStreamCreate(&working_streams[i]);
cudaStreamCreateWithFlags(&working_streams[i], cudaStreamNonBlocking);
}