Concurrent streams and hyperQ for K20

Hi,

I coded a simple hyperQ module. However hyperQ does not work. I get the same perfornmance as i would on a compute_30 capability card (GTX 680). Any possible clues on what could go wrong?

It’s 32 streams i run on a for loop. There is some global memory array data that they all read at some point, would that cause any drawbacks?

Cheers,
Thanasio

This is the code snipet :

for (int i = 0; i < NPROCS; ++i) {

		kF<<<numBlocks, 256, 0, nstreams[i * 4 + 0]>>>(st, cSim, cGridSim.octDists,i); 
		kE<<<numBlocks, 256, 0, nstreams[i * 4 + 1]>>>(st, cSim, cGridSim.octDists,i);
		mbfsGrads<<<blocksPerGrid, nThr, 0, nstreams[i * 4 + 2]>>>(st, devMbfs,i);
		eMBFs<<<blocksPerGrid, nThr, sMemSize, nstreams[i * 4 + 3]>>>(st, devMbfs,numBlocks,i);

}

The first two kernels use 10k of shared memory, they use constant memory and they read from the same global memory addresses, but write in different global memory locations.

The last two read from the same global memory locations and write to different global memory locations…

Am i doing something wrong…???

Attached is the profiler output. The only trully overlaping kernels are kF/kE

the question has been answered here :

http://stackoverflow.com/questions/14980254/cuda-stream-processing-for-multiple-kernels-disambiguation