cudaMallocHost caching behavior

wenzel.jakob · February 28, 2019, 1:53pm

Dear all,

I had the following question about host-pinned memory allocated via cudaMallocHost:

From what I’ve gathered so far, read accesses to host-pinned memory issued within the GPU will be dispatched to CPU memory via PCI express as that memory will never migrate to the GPU. That said, I was wondering if the GPU provides any caching over host-pinned memory? (e.g. do two subsequent reads to the same address generate two PCIE transactions?)

My use case is an application with a few scattered arrays allocated on the CPU that may occasionally be read by the GPU as well. I could just use managed memory, but this seemed like potential overkill to me (accesses are read-only, and caching would likely also make this work with host-pinned memory)

Thoughts?

Thanks,
Wenzel

Robert_Crovella · February 28, 2019, 9:33pm

I think this should be easy to determine with a profiler. As far as I know, there is no caching, and 2 reads to the same location in system memory will result in 2 transactions over PCIE. See below.

wenzel.jakob · March 1, 2019, 12:32am

I ended up running a benchmark, and it seems to indicate the opposite. The following snippet reads from a 128-entry float array in host-pinned memory:

#include <cuda.h>
#include <stdio.h>

__global__ void mysum(size_t size, float *data, float *out) {
    float accum = 0.f;
    for (size_t i = (size_t) blockIdx.x * blockDim.x + threadIdx.x; i < size;
         i += (size_t) blockDim.x * gridDim.x) {
        accum += data[(i*3)% 128];
    }
    atomicAdd(out, accum);
}

int main(int argc, char **argv) {
    float *data = nullptr, *out = nullptr;
    cudaMallocHost(&data, sizeof(float)*128);
    cudaMalloc(&out, sizeof(float));
    for (int i = 0; i<128; ++i)
        data[i] = i;

    for (int i = 0; i<2; ++i) {
        cudaMemset(out, 0, 4);
        mysum<<<128,128>>>((size_t) 1024*1024*1024*1024, data, out);
        float out_h = 0;
        cudaMemcpy(&out_h, out, 4, cudaMemcpyDeviceToHost);
        printf("%f\n", out_h);
    }
}

The program is quite simple, so I checked the PTX to ensure that the “ld.global.f32” in the inner loop doesn’t get optimized away (it’s there).

BB0_2:
        mul.lo.s64      %rd13, %rd18, 3;
        and.b64         %rd14, %rd13, 127;
        shl.b64         %rd15, %rd14, 2;
        add.s64         %rd16, %rd4, %rd15;
        ld.global.f32   %f6, [%rd16];
        add.f32         %f9, %f9, %f6;
        add.s64         %rd18, %rd3, %rd18;
        setp.lt.u64     %p2, %rd18, %rd7;
        @%p2 bra        BB0_2;

This takes about 5.5 sec to run on my RTX card. nvvp doesn’t show anything interesting related to bus traffice, so that part was unconclusive. More interestingly, there is no change in performance when I replace cudaMallocHost by cudaMallocManaged. (I would expect the performance to change significantly if one variant has caching and the other doesn’t)

Do you think that makes sense as benchmark, and that the conclusion is correct? I’d also be curious to hear from the CUDA gurus whether this is something one can generally expect to hold true, or if it’s just the case for compute capability > X.

Thanks,
Wenzel

Robert_Crovella · March 1, 2019, 3:33am

It appears to be cached in L1(*) but not in L2. So my previous comment is not correct.

analyzing PTX isn’t very reliable. The tool that converts PTX to the machine code that actually executes is an optimizing compiler.

I think a simpler test case can be used.

$ cat t403.cu
#include <stdio.h>

__global__ void k(int *d){

  int a = d[threadIdx.x];
#ifdef USE_ADD
  a += d[blockDim.x-threadIdx.x-1];
#endif
  if (a > 0) d[threadIdx.x] = a;
}

const int ds = 128;
int main(){

  int *d;
  cudaHostAlloc(&d, ds*sizeof(d[0]), cudaHostAllocDefault);
  memset(d, 0, ds*sizeof(d[0]));
  k<<<128,ds>>>(d);
  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_60 -o t403 t403.cu
$ cuobjdump -sass ./t403

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_60

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_60
                Function : _Z1kPi
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                           /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                  /* 0x4c98078000870001 */
        /*0010*/                   S2R R2, SR_TID.X ;                      /* 0xf0c8000002170002 */
        /*0018*/                   SHR.U32 R0, R2.reuse, 0x1e ;            /* 0x3828000001e70200 */
                                                                           /* 0x001ed400fc4007e6 */
        /*0028*/                   ISCADD R2.CC, R2, c[0x0][0x140], 0x2 ;  /* 0x4c18810005070202 */
        /*0030*/                   IADD.X R3, R0, c[0x0][0x144] ;          /* 0x4c10080005170003 */
        /*0038*/                   LDG.E R0, [R2] ;                        /* 0xeed4200000070200 */
                                                                           /* 0x001ff400fd4107ed */
        /*0048*/                   ISETP.GE.AND P0, PT, R0, 0x1, PT ;      /* 0x366d038000170007 */
        /*0050*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*0058*/              @!P0 EXIT ;                                  /* 0xe30000000008000f */
                                                                           /* 0x001fbc00fde007f1 */
        /*0068*/                   STG.E [R2], R0 ;                        /* 0xeedc200000070200 */
        /*0070*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*0078*/                   NOP ;                                   /* 0x50b0000000070f00 */
                                                                           /* 0x001ffc00fc6007ef */
        /*0088*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*0090*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*0098*/                   EXIT ;                                  /* 0xe30000000007000f */
                                                                           /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                              /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                    /* 0x50b0000000070f00 */
                .................

Fatbin ptx code:
================
arch = sm_60
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvprof --metrics sysmem_read_bytes ./t403
==13132== NVPROF is profiling process 13132, command: ./t403
==13132== Profiling application: ./t403
==13132== Profiling result:
==13132== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: k(int*)
          1                         sysmem_read_bytes                  System Memory Read Bytes       28672       28672       28672
$ nvcc -arch=sm_60 -o t403 t403.cu -DUSE_ADD
$ cuobjdump -sass ./t403

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_60

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_60
                Function : _Z1kPi
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                           /* 0x003fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                  /* 0x4c98078000870001 */
        /*0010*/                   S2R R2, SR_TID.X ;                      /* 0xf0c8000002170002 */
        /*0018*/                   LOP.PASS_B R0, RZ, ~R2 ;                /* 0x5c4707000027ff00 */
                                                                           /* 0x001f8800fc8207f1 */
        /*0028*/                   SHR.U32 R3, R2.reuse, 0x1e ;            /* 0x3828000001e70203 */
        /*0030*/                   ISCADD R2.CC, R2, c[0x0][0x140], 0x2 ;  /* 0x4c18810005070202 */
        /*0038*/                   IADD R0, R0, c[0x0][0x8] ;              /* 0x4c10000000270000 */
                                                                           /* 0x001fc440fe0007f4 */
        /*0048*/                   IADD.X R3, R3, c[0x0][0x144] ;          /* 0x4c10080005170303 */
        /*0050*/         {         SHR.U32 R5, R0.reuse, 0x1e ;            /* 0x3828000001e70005 */
        /*0058*/                   LDG.E R6, [R2]         }
                                                                           /* 0xeed4200000070206 */
                                                                           /* 0x001ed400fc4007e6 */
        /*0068*/                   ISCADD R4.CC, R0, c[0x0][0x140], 0x2 ;  /* 0x4c18810005070004 */
        /*0070*/                   IADD.X R5, R5, c[0x0][0x144] ;          /* 0x4c10080005170505 */
        /*0078*/                   LDG.E R4, [R4] ;                        /* 0xeed4200000070404 */
                                                                           /* 0x001ff400fda107f6 */
        /*0088*/                   IADD R0, R4, R6 ;                       /* 0x5c10000000670400 */
        /*0090*/                   ISETP.GE.AND P0, PT, R0, 0x1, PT ;      /* 0x366d038000170007 */
        /*0098*/              @!P0 EXIT ;                                  /* 0xe30000000008000f */
                                                                           /* 0x001fbc00fde007f1 */
        /*00a8*/                   STG.E [R2], R0 ;                        /* 0xeedc200000070200 */
        /*00b0*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*00b8*/                   NOP ;                                   /* 0x50b0000000070f00 */
                                                                           /* 0x001ffc00fc6007ef */
        /*00c8*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*00d0*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*00d8*/                   EXIT ;                                  /* 0xe30000000007000f */
                                                                           /* 0x001f8000fc0007ff */
        /*00e8*/                   BRA 0xe0 ;                              /* 0xe2400fffff07000f */
        /*00f0*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                    /* 0x50b0000000070f00 */
                .................

Fatbin ptx code:
================
arch = sm_60
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvprof --metrics sysmem_read_bytes ./t403
==13189== NVPROF is profiling process 13189, command: ./t403
==13189== Profiling application: ./t403
==13189== Profiling result:
==13189== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: k(int*)
          1                         sysmem_read_bytes                  System Memory Read Bytes       28672       28672       28672
$

In the above example, we see that the total number of sysmem read bytes is 56*512, which is 56 SMs in my P100, times 512 bytes (128*sizeof(float)). Therefore it’s not cached in L2. We see that the total number of bytes is unchanged if I do 2 reads in the kernel, so it is being cached in L1. In the first case, the SASS shows one LDG instruction, whereas in the second test case there are 2.

(*)Note that the behavior could possibly be different based on the GPU. Kepler devices have more limited usage of the L1 cache; sysmem transactions don’t seem to be cached by default on cc3.5 devices. On those devices, a similar metric test (sysmem_read_transactions) shows a doubling of the transaction count when we have 2 LDG instructions in the kernel.

wenzel.jakob · March 1, 2019, 4:10pm

Dear Robert,

thank you very much for the thorough analysis, and for showing how to extract the right data to draw this conclusion.

Best,
Wenzel