I am trying to parallelize over the reductions of multiple arrays with multiple concurrent kernels launched in separate threads parallelized with OpenMP. This looks like
#pragma omp parallel for num_threads (tot_num_thrds)
for(int i = 0; i < n; i++)
{
reduce<blockDim2><<< gridDim2, blockDim2, blockDim2 * sizeof(double) >>>(&redArray1[i* gridDim1], &redArray2[i * gridDim2]);
gpuERR( cudaPeekAtLastError() ); gpuERR( cudaStreamSynchronize(0) );
reduce<blockDim3><<< 1, blockDim3, blockDim3 * sizeof(double) >>>(&redArray2[i * gridDim2], &redArray3[i]);
gpuERR( cudaPeekAtLastError() ); gpuERR( cudaStreamSynchronize(0) );
}
I am compiling with --default-stream per-thread, which I believe means each separate CPU thread should run on its own default stream. I do include -Xcompiler -fopenmp and -Xlinker -lgomp. gpuErr is just the usual error-checking wrapper macro. I use cudaStreamSynchronize(0) so that the entire device isn’t synchronized, which would defeat the purpose, I think.
I want to check that this is a correct way to accomplish what I’m trying to do, because the result is in fact marginally slower…