Why Cudamemcpyasync has different behaviors on different CPU platforms?

liuchengan2020 · October 20, 2022, 9:03am

I wrote a simple asynchronous copy test program：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <cuda.h>
#include <cusolverDn.h>
#include <cublas_v2.h>
#include <npp.h>
#define MAX_LINEPOINTS  650
#define MAX_BEAMS       32
#define MAX_TXNUMBER    17
#define MAX_ENSEMBLE    40
#define MAX_FRAMESIZE   (MAX_LINEPOINTS * MAX_BEAMS * MAX_TXNUMBER)
int main()
{
    cudaError_t cudaStatus;
    cudaStream_t cudaStream1 = NULL;        // CUDA stream
    cudaStream_t cudaStream2 = NULL;        // CUDA stream
    cudaStreamCreateWithFlags(&cudaStream1, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&cudaStream2, cudaStreamNonBlocking);
    //
    cudaStatus = cudaSetDevice(0);
    //
    cuDoubleComplex *h_tmp;
    cudaStatus = cudaMallocHost(&h_tmp, MAX_FRAMESIZE * sizeof(cuDoubleComplex));

    for (int i = 0; i < MAX_FRAMESIZE; i++)
        h_tmp[i] = { rand() / (double)RAND_MAX, rand() / (double)RAND_MAX };
    //
    cuDoubleComplex *d_tmp;
    cudaStatus = cudaMalloc(&d_tmp, MAX_FRAMESIZE * MAX_ENSEMBLE * sizeof(cuDoubleComplex));
    //
    for (int i = 0; i < MAX_ENSEMBLE; i++) {
        cudaStatus = cudaMemcpyAsync(d_tmp + i * MAX_FRAMESIZE, h_tmp, MAX_FRAMESIZE * sizeof(cuDoubleComplex),
            cudaMemcpyHostToDevice, cudaStream1);
        _sleep(5);
    }
    
    cudaStatus = cudaStreamSynchronize(cudaStream1); 
    //
    cudaFreeHost(h_tmp);
    cudaFree(d_tmp);
    cudaStatus = cudaDeviceReset();

    return 0;
}

When I run on different CPU platforms, I have different behaviors：
platform1: QuadCore Intel Xeon E5-1620 v4, 3800 MHz (38 x 100)
platform2: QuadCore Intel Xeon E3-1270 v5, 4000 MHz (40 x 100)
The two platform have the same graphics card (nVIDIA Quadro RTX 4000), the same operating system and drivers.
Use Nsight to observe the running status：
platform1:

download

The running process of platform1 is the expected way, and every asynchronous copy instruction will be followed by an actual transmission behavior. However, there is something wrong with the running of the code in platform2. There is no actual transmission behavior after every asynchronous copy instruction is issued, and the transmission does not start until cudaStreamSynchronize.
At the same time, I also found that the query results of asyncEngineCount attribute value of graphics card on two platforms are different.(platform1 is 2, platform2 is 6)
I want to know why this is the result. If I want my program to execute in the way of platform1, what should I pay attention to when choosing CPU?

njuffa · October 20, 2022, 9:48am

Which are?

liuchengan2020 · October 20, 2022, 12:45pm

Microsoft Windows 10 Pro 10.0.19044.2130 (Win10 21H2 November 2021 Update)
The graphics card driver is the latest version of nvidia : Quadro RTX 4000 (517.40) WHQL

Robert_Crovella · October 20, 2022, 1:39pm

Do both systems have the same setting for hardware accelerated GPU scheduling?

liuchengan2020 · October 21, 2022, 2:22am

Thank you for solving my problem. I checked this setting of the two systems, and it’s really different. E3 platform did not turn on this setting.

Topic		Replies	Views
cudaMemcpyAsync not behaving asynchronously CUDA Programming and Performance	5	2537	July 4, 2008
performance variation when using asynchronous calls CUDA Programming and Performance	1	675	February 11, 2011
cudaMemcpyAsync CUDA Programming and Performance	10	22123	October 16, 2015
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	982	January 14, 2013
Ordering of cudaMemcpyAsync issued to separate streams CUDA Programming and Performance	4	674	February 5, 2019
cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1 CUDA Programming and Performance	7	2062	July 12, 2010
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	470	November 18, 2024
cudaMemcpy behaving asynchronously with drivers 471.11+ CUDA Programming and Performance cuda , nvbugs	7	1511	July 21, 2021
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1308	February 4, 2021
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1288	December 15, 2022

Why Cudamemcpyasync has different behaviors on different CPU platforms?

Related topics