curand host APIs hurt the performance of the following cublas APIs

I found cublasDgemm() slowing down 2 times after invoking curandGenerateUniformDouble().

It seems curand host APIs change the Shared Mem from 48KB to 16KB and never change it back,
which hurt the performance of all callings following the curand APIs.

Any idea to solve these issues ?

I would suggest filing a bug via the registered developer website, attaching a self-contained repro case that demonstrates the issue, so the library team can have a look. Thanks!