I am using cuda stream to launch kernel in parallel. (cuda version 7.5).
My code looks like this:
//create stream array
//launch kernel one with stream
//launch kernel two with stream
lauch kernel 1: 90.76 duration: .02 gpu occupancy: 12% (37 grid, 128 block)
lauch kernel 2: 90.78 duration: .02 gpu occupancy: 27% (72 grid, 128 block)
So, none of kernel launches concurrently though nvprof says they are in different stream. I assume I have enough resources to launch multiple kernel in parallel. I dont see any performance improvement either. Not sure what might cause this behavior.
Thank you in advance.