I found cublasDgemm() slowing down 2 times after invoking curandGenerateUniformDouble().
It seems curand host APIs change the Shared Mem from 48KB to 16KB and never change it back,
which hurt the performance of all callings following the curand APIs.
Any idea to solve these issues ?