multi-thread multi stream optimization with cublas

I just want to know if there is any performance optimization point for this multi-thread arch. I have 32 or more thread (thread number configurable) to do the same task. Each thread created a single stream and has its unique memory resource(pairs of matrix A, B, C). Assume I have 8 GPUs, so for this case, each gpu services 32/8 = 4 threads.

For this case, it has a Producer/Consumer model, all the 32 threads are all consumers which means 32 threads will be called alot of times. And there is no H2D memory transfer, there are only GemmBatched and D2H copy.

Each thread arch written like below:

Thread(threadCtx *dataPtr)
{
    //dataPtr->deviceId is the 0~7 GPU
    //dataPtr->handle is the cublasHandler_t created before, each thread has one cublasHandler_t
    //dataPtr->stream is the cudaStream_t created before, each thread has one cudaStream.

cudaSetDevice(dataPtr->deviceId);
    //because each GPU servers 4 threads, so there are 4 streams on each GPU. 
    cublasSetStream(dataPtr->handle, dataPtr->stream);

    for (i = 0 ; i < n; i++) {
        //each thread do alot GemmBatchedEx at its own stream.
        cublasGemmBatchedEx(dataPtr->handle, A, B, C.....) ;
    }

    //copy results to host
    cudaMemcpyAsync();

    //do something else

    //sync for this thread stream.
    cudaStreamSynchronize(dataPtr->stream);

}