I am trying to use streams on cuda 10.1 to parallelize blocks of 16x16 threads in a grid of 4x1
I call createStream on the elements of an array of cudaStream_t and get a separate value for each.
The code runs slow but the NSight Systems 2019.5.1 is a bit vauge.
For some reason there are 2 contexts and it looks like the single variable streams exist and the ones in the array are all given the same number. I will try and type the text, there is no cut-n-paste…the v and > are the tree arrows
> CPU(56) v processes(2) v mtcnn_p2.exe > threads (144) v CUDA (GeForce RTX 2080 Ti, 0000:02:00:0) v 1.4% Context 1 v 82.0% Stream 10 v 100% Memory 100.0% HtoD memcpy //32 bytes - 2us. Should be before each kernel call on each stream but all show stream 10 v 7.4% Stream 14 100.0% Kernels BGRAsurfaceWriteKernel 36us v 7.4% Default stream (7) > 82.0% Kernels > 17.9% Memory 11.2% HtoD 88.8% DtoD v 3.2% Stream 151 v 100% memory 100.0% DtoA memcpy v 98.6% Context 2 with tooltip "Combined view with less than 1% impact" 100.0% other kernels
This second context is the work that needs to be parallelized accros streams
Each kernel launch of <<< [4,1],[16,16],0, stream>>> has a unique id on the call.
But in this view they all show the same stream ID???
In the events view there is a column called Context with “Stream 2147483647” on every kernel and they are shown run completely one after the other stretching the work out.
Now obviously 2147483647 is -1 on signed MAXINT :-(
I have spent many, many hours and tried the defines and flags for different stream default thread stuff and none seems to work.
If it wasn’t running so damn slow, I might think the tool is just collapsing the stream view and hiding all the stream id’s into a -1.
Also the stream is a memcpy and a launch and instead of 1 of each on each stream, I see all the memcpy colected into stream 10 and all the GetBlockImagePatchesFromTextureKernel on the -1 stream in “Context 2”
Any insight.(no pun) would be greatly appreciated!