Issue with cudaMemPrefetchAsync on drive orin device

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
2.1.0
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Issue Description
I tried the simple sample of cuda in my drive orin device and tries to use cudaMemPrefetchAsync. But even though i can detect my cuda device, it still has error of cudaMemPrefetchAsync. Did you make something wrong in my code?

$ ./vectorAdd
Device count: 1
Using device 0: Orin (CC 8.7)
[UM VectorAdd N=50000000 (190.73 MiB per array), reps=5]
vectorAdd.cu:305 invalid device ordinal

#define CHECK(x) do{ cudaError_t e=(x); if(e){fprintf(stderr,"%s:%d %s\n",__FILE__,__LINE__,cudaGetErrorString(e)); exit(1);} }while(0)

int main(int argc, char** argv){
     int count = 0;
    CHECK(cudaGetDeviceCount(&count));
    printf("Device count: %d\n", count);


    int dev = 0;  // Or parse from argv, then clamp to [0, dev_count)
    cudaSetDevice(dev);

    cudaDeviceProp prop{};
    cudaGetDeviceProperties(&prop, dev);
    printf("Using device %d: %s (CC %d.%d)\n", dev, prop.name, prop.major, prop.minor);


    // --- problem size ---
    int N    = (argc>1)? atoi(argv[1]) : 50000000; // default 50M elements
    int reps = (argc>2)? atoi(argv[2]) : 5;
    size_t bytes = (size_t)N * sizeof(float);
    printf("[UM VectorAdd N=%d (%.2f MiB per array), reps=%d]\n",
           N, bytes/1024.0/1024.0, reps);

    // --- Unified Memory allocation ---
    float *A, *B, *C;
    CHECK(cudaMallocManaged(&A, bytes));
    CHECK(cudaMallocManaged(&B, bytes));
    CHECK(cudaMallocManaged(&C, bytes));

    // --- initialize on host ---
    for (int i=0;i<N;i++){ A[i]=rand()/(float)RAND_MAX; B[i]=rand()/(float)RAND_MAX; }

    // --- prefetch to GPU 0 ---
    CHECK(cudaMemPrefetchAsync(A, bytes, dev));
    CHECK(cudaMemPrefetchAsync(B, bytes, dev));
    CHECK(cudaMemPrefetchAsync(C, bytes, dev));
    CHECK(cudaDeviceSynchronize());

     return 0;
}

Dear @rlu1 ,
Could you check if /usr/local/cuda/samples/1_Utilities/UnifiedMemoryPerf sample working and see the usage of API for reference?

okay, thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.