Default Stream per Thread - Driver API

Hi,
I was looking into using –default-stream per-thread or equivalent #define CUDA_API_PER_THREAD_DEFAULT_STREAM for driver API calls (specifically FFMpeg cuvid decoder).
But couldn’t find any documentation about it.

The program loads the DLL dynamically - there is no call to nvcc in which I can add --default-stream per-thread and adding CUDA_API_PER_THREAD_DEFAULT_STREAM will not affect the loaded DLL functions.

Looking into cuda.h I’ve seen the following macro being used when CUDA_API_PER_THREAD_DEFAULT_STREAM is defined:

#if defined(__CUDA_API_VERSION_INTERNAL) || defined(CUDA_API_PER_THREAD_DEFAULT_STREAM)
    #define __CUDA_API_PER_THREAD_DEFAULT_STREAM
    #define __CUDA_API_PTDS(api) api ## _ptds
    #define __CUDA_API_PTSZ(api) api ## _ptsz

And used in some of the APIs as follows:

#define cuMemcpyHtoD                        __CUDA_API_PTDS(cuMemcpyHtoD_v2)
...
#define cuMemcpy2D                          __CUDA_API_PTDS(cuMemcpy2D_v2)
...
#define cuStreamSynchronize                 __CUDA_API_PTSZ(cuStreamSynchronize)

Does it mean that by dynamic loading the ptds / ptsz versions of the APIs used in FFMpeg I would be able to achieve “default stream per thread” behaviour?

As indicated here:

One possible approach is to explicitly access the per-thread default stream.

In the CUDA runtime API, that stream has a particular handle:

cudaStreamPerThread

In the CUDA driver API, that handle is:

CU_STREAM_PER_THREAD

[url]CUDA Driver API :: CUDA Toolkit Documentation

Thanks!
I don’t know how I missed it :)