The cublasGemmGroupedBatchedEx API results in an additional cudaMemcpyAsync H2D

As the title says, I ran the official test benchmark on the latest cuda version cuda 12.8 CUDALibrarySamples/cuBLAS/Extensions/GemmGroupedBatchedEx/cublas_GemmGroupedBatchedEx_example.cu at master · NVIDIA/CUDALibrarySamples · GitHub
Found via nsys that there is an additional cudaMemcpyAsync HToD. If this is caused by the cublas api, can it be avoided? Because copying pagable memory will affect performance