Parallel kernel launches via OpenMP with --default-stream per-thread

I am trying to parallelize over the reductions of multiple arrays with multiple concurrent kernels launched in separate threads parallelized with OpenMP. This looks like

#pragma omp parallel for num_threads (tot_num_thrds)
	for(int i = 0; i < n; i++)
		reduce<blockDim2><<< gridDim2, blockDim2, blockDim2 * sizeof(double) >>>(&redArray1[i* gridDim1], &redArray2[i * gridDim2]);
		gpuERR( cudaPeekAtLastError() ); gpuERR( cudaStreamSynchronize(0) );
		reduce<blockDim3><<< 1, blockDim3, blockDim3 * sizeof(double) >>>(&redArray2[i * gridDim2], &redArray3[i]);
		gpuERR( cudaPeekAtLastError() ); gpuERR( cudaStreamSynchronize(0) );

I am compiling with --default-stream per-thread, which I believe means each separate CPU thread should run on its own default stream. I do include -Xcompiler -fopenmp and -Xlinker -lgomp. gpuErr is just the usual error-checking wrapper macro. I use cudaStreamSynchronize(0) so that the entire device isn’t synchronized, which would defeat the purpose, I think.

I want to check that this is a correct way to accomplish what I’m trying to do, because the result is in fact marginally slower…

why you expect speedup? gpu uses all its ALUs to run each kernel invocation, so you may see improvements only due to overlapping of tails or overlapping of pci-e transfer with computations