Using cudaStream, but nvprof shows sequential launch

Hi,
I am using cuda stream to launch kernel in parallel. (cuda version 7.5).
My code looks like this:

//create stream array
//loop
//launch kernel one with stream[0]
//launch kernel two with stream[1]

//deviceSynchronize();
//end loop

nvrpof result:
lauch kernel 1: 90.76 duration: .02 gpu occupancy: 12% (37 grid, 128 block)
lauch kernel 2: 90.78 duration: .02 gpu occupancy: 27% (72 grid, 128 block)

So, none of kernel launches concurrently though nvprof says they are in different stream. I assume I have enough resources to launch multiple kernel in parallel. I dont see any performance improvement either. Not sure what might cause this behavior.

Thank you in advance.

show me your code :) and compilation command

What card are you using?

k40c.

//sample code
for (int i = 0; i < 10; i++){
cudaStreamCreate(&(stream[i]));
}
kernel1<<<grid, block, 0, stream[0]>>>(a,b);
kernel2<<<grid, block, 0, stream[1]>>>(a,b);

compiling with regular command : nvcc -O3 -w -gencode arch=compute_35,code=sm_35 -rdc=true -Xcompiler test.cu
I also tried adding #define CUDA_API_PER_THREAD_DEFAULT_STREAM 1 at the top. It didn’t change anything.

Is that what you were asking for? Please let me know. My original code is huge and includes 3 header file. That’s why didn’t add it.

It’s possible that you don’t.