Hi,
1.) I have a numba kernel
my_kernel[blockspergrid,threadsperblock](data_d)
Is it possible to execute mutiple instances of ‘my_kernel’ with different blockspergrid,threadsperblock and data_d in parallel. Or is there an alternative method to achieve such parallelism?
Can it be done using numba cuda streams.
2.) Also, if i have function with mutiple numba kernels. How can it be done. Can i be able to run mutiple instances of ‘my_task’ in parallel with python threads. Do python threads work with streams?
For example,
def my_task(stream,data):
data_d=cuda.to_device(data)
my_kernel1[bgrids,tblocks,stream](data_d)
stream.synchronize()
my_kernel2[bgrids,tblocks,stream](data_d)
stream.synchronize()
my_kernel3[bgrids,tblocks,stream](data_d)
stream.synchronize()
return data_d.copy_to_host()
stream1=cuda.stream()
stream2=cuda.stream()
t1=Thread(target=my_task,args=(stream1,data1,))
t2=Thread(target=my_task,args=(stream2,data2,))
t1.start()
t2.start()
t1.join()
t2.join()
Will the above code work.
Thanks,
Yes, something like:
my_kernel[blockspergrid,threadsperblock,stream1](data_d1)
my_kernel[blockspergrid,threadsperblock,stream2](data_d2)
(its not really necessary to have separate data, but presumably that would the sane approach)
With the “launch” into separate streams (stream1
, stream2
) it presents the possibility for the two instances of my_kernel
to run simultaneously or “in parallel”. However actually witnessing concurrent kernel execution can be somewhat difficult. CUDA never guarantees simultaneity in this way. To have a reasonable chance to actually witness it, the kernel must:
- not use so many resources that it “fills up” the GPU, leaving no space for another kernel to run
- run for a long enough period of time to overcome launch latency and other timing uncertainties
You should be able to use threads and streams both, approximately as you have shown. Remember, kernels launched into the same stream (so those in a particular thread, in your example) will run sequentially. Only kernels launched into separate streams have the opportunity (not guarantee) to run concurrently.
This may also be of interest.