I am trying to run CUDA kernels inside two different pthreads what i read is if we don’t mention SM it goes into default SM0…
i have tried by running it on default SM and specific SM by mentioning SM by nvcc –default-stream per-thread flag as mentioned in link https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
But when i run it it doesn’t get executed at all…
Can somebody guide me how to run CUDA kernels into two different pthreads on two different SMs…
Any example would help.
Thanking in advance…
What do you mean by “two different SMs”. What do you mean by SM specifically?
In my understanding, SM means streaming multiprocessor, which is a CUDA GPU hardware building block (component).
CUDA programs (whether multi-threaded or not) don’t normally target specific SMs for execution, and using CUDA Streams also does not target specific SMs.
Thanks for your quick response…
Yes i do mean SM by streaming multiprocessors…
ok got your point… i misunderstood concept of streams and SMs but,
-
How many streams we can create ??
-
Do we have any control over SMs ?? how to optimally utilize SMs??
You can create a lot of streams. I don’t know how many, but there is no practical limit AFAIK. In most cases, it should not be necessary to create more than 3-4 streams per device, since streams can be reused.
You have very little control over scheduling of work onto SMs. For those who are trying to do unusual things, there are some tricks that can be played:
[url]https://devtalk.nvidia.com/default/topic/932044/cuda-programming-and-performance/how-to-limit-number-of-cuda-cores/[/url]
but these would not be ordinary CUDA programming techniques. In general, the machine handles scheduling of work onto SMs, and you have essentially no control over the detail scheduling.
kepler+ gpus can execute grids from up to 32 streams simultaneously, so it’s better to limit streams to that amount
Can someone point to an example of multiple cpu processes interacting with multiple GPU threads over multiple streams?
I am running two pthreads thread1 and thread2 in which thread1 does have 5 to 6 kernels and thread2 has only one kernel.
inside thread1 kernel launches i am not using streams and in thread2 i am using stream while launching kernel.
after start of application everything works fine for a while but after that no kernel is being launched and application stucks…
what could be preventing launch of kernel in this case ?