May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed?

May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed?Below is one possible approach that I have considered:

#include <stdio.h>

#define KB 1024
#define MB 1024 * 1024

__global__ 
void get(int *array,int i){
    int lengthMod = 8 * MB - 1;
    int tmp;
    for (int j = 0; j < 6; j++) {  
        for (unsigned int k = 0; k < 512 * MB; k++) {
            tmp += array[(k * i) & lengthMod];
        }
    }
}

void test_time(){
    int * array;
    cudaMallocHost(&array,8 * MB);
    int i;
    get<<<1,1>>>(array,1);
    get<<<1,1>>>(array,1);
    get<<<1,1>>>(array,1);
    for(i = 1; i <= 2048*2048; i *= 2){
        cudaEvent_t startEvent, stopEvent;
        cudaEventCreate(&startEvent);
        cudaEventCreate(&stopEvent);
        cudaEventRecord(startEvent, 0);
        
        get<<<1,1>>>(array,i);
        cudaEventRecord(stopEvent, 0);
        cudaEventSynchronize(stopEvent);
        float time;

        cudaEventElapsedTime(&time, startEvent, stopEvent);
        cudaEventDestroy(startEvent);
        cudaEventDestroy(stopEvent);    
        time /= 6;
        printf("For i value: %d    Time Taken:%lf\n",i,time);
    }
    cudaFreeHost(array);
}



int main() {
    test_time();
    return 0;
}

I hope to measure access granularity through the above approach, corresponding to the cache line size in the CPU. However, the results are as follows:

plaintext

For i value: 1    Time Taken:0.000853
For i value: 2    Time Taken:0.000853
For i value: 4    Time Taken:0.000848
For i value: 8    Time Taken:0.000667
For i value: 16    Time Taken:0.000709
For i value: 32    Time Taken:0.000715
For i value: 64    Time Taken:0.000667
For i value: 128    Time Taken:0.000720
For i value: 256    Time Taken:0.000683
For i value: 512    Time Taken:0.000683
For i value: 1024    Time Taken:0.000853
For i value: 2048    Time Taken:0.000677
For i value: 4096    Time Taken:0.000512
For i value: 8192    Time Taken:0.000683
For i value: 16384    Time Taken:0.000683
For i value: 32768    Time Taken:0.000683
For i value: 65536    Time Taken:0.000512
For i value: 131072    Time Taken:0.000683
For i value: 262144    Time Taken:0.000683
For i value: 524288    Time Taken:0.000683
For i value: 1048576    Time Taken:0.000512
For i value: 2097152    Time Taken:0.000683
For i value: 4194304    Time Taken:0.000683

As you can see, the time for other iterations does not show significant changes. I believe there are two possibilities: one is that the access granularity is indeed 1 byte, and the other is that the overhead caused by granularity here is overshadowed by other significant overheads. I hope someone can help clarify my confusion. Thank you very much.

Hi there @AuroraCelestial, welcome to the NVIDIA developer forums.

I honestly do not know, but i think the CDA experts might have some suggestions or answers to figure out granularity. I hope it is ok if I move this topic to the CUDA programming category.

Thanks!

Edit: Sorry, I think I misread your question. If you did not ask about host side allocation granularity, please disregard the text below.

The allocation granularity may depend on the specific system, but its likely 2MB.

You can use this test program.

//nvcc main.cu -o main
#include <iostream>
#include <cassert>

int main(){
    cudaFree(0); //init cuda context

    std::cout << "before cudaMallocHost\n";
    char* ptr;

    cudaError_t status = cudaMallocHost(&ptr, 1);
    assert(status == cudaSuccess);

    std::cout << "after cudaMallocHost\n";
    std::cout << "ptr is " << (void*)ptr << "\n";

    char* ptr2;
    std::cout << "before cudaMallocHost2\n";
    status = cudaMallocHost(&ptr2, 1);
    assert(status == cudaSuccess);

    std::cout << "after cudaMallocHost2\n";
    std::cout << "ptr2 is " << (void*)ptr2 << "\n";
}

On Linux, you can use strace ./main to see all system calls performed by your program. Initializing the cuda context up front and using std::cout before and after the cudaMallocHost, you can quickly identify the system calls relevant for you question.

write(1, "before cudaMallocHost\n", 22before cudaMallocHost
) = 22
mmap(0x7fb4b3200000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fb4b3200000
...
write(1, "after cudaMallocHost\n", 21after cudaMallocHost
)  = 21
write(1, "ptr is 0x7fb4b3200000\n", 22ptr is 0x7fb4b3200000
) = 22
write(1, "before cudaMallocHost2\n", 23before cudaMallocHost2
) = 23
write(1, "after cudaMallocHost2\n", 22after cudaMallocHost2
) = 22
write(1, "ptr2 is 0x7fb4b3200200\n", 23ptr2 is 0x7fb4b3200200
) = 23

mmap maps 2097152 bytes of data.
It then sub-allocates from this mapping, returning 512-byte aligned pointers.

I’m honored that you are willing to provide assistance.

Thank you for your response. I appreciate your clarification regarding host side allocation granularity. However, my inquiry was focused on measuring the access granularity of GPU when accessing cudamallochost memory. If you have insights or suggestions on how to perform such measurements on the GPU side, I would be grateful for your guidance. Thank you again for your assistance.

The access granularity is a single byte. This must be true, because it is legal to allocate a single byte, and any larger granularity of access would result in illegal behavior in that case.

I appreciate your explanation. However, I’m still trying to grasp the concept. Could you please share your source code to help deepen my understanding? Your assistance in this matter would be greatly valued. Thank you.

One byte example:

#include <cstdio>
__global__ void k(unsigned char *d) {
  unsigned char val = *d;  // accesses data area on host
  printf("val: %u\n", (unsigned)val);
}

int main(){
  unsigned char *data;
  cudaMallocHost(&data, 1); // allocates only 1 byte
  *data = 0;
  k<<<1,1>>>(data);
  cudaDeviceSynchronize();
}

The cudaMallocHost line formally allocates only one byte of space.
The commented line in the kernel code accesses this data region. If the access/transaction size were larger than one byte (at that point in the kernel code), it would be illegal behavior in C++.

I apologize if I didn’t express my thoughts clearly. I have updated my question. Thank you very much for your willingness to help.

The kernel get has no side effect so the compiler is reducing it to a null kernel. The result is no matter what value of i you are passing the time taken is approximately the same.

PCIe standard supports 1B reads. Requests from SM L1 to L2 are in terms of 32B sectors. L2 will likely also make a request of a 32B sector. A PCIe read request is ~3 x 4B. A PCIe read response is also approximate 3 x 4B + payload (1-32B). Given the method you are making the PCIe request I would guess the minimal read granularity is 32B. The thread will only receive 1B. The L1 and L2 will have the full 32B. SYSMEM reads are not cached in GPU L2. The data passes through the L2.

get is not likely to be a useful benchmark if you execute only 1 thread. Given (a) protocol overhead, (b) maximum in flight reads from your kernel, and (c) maximum in flight reads from GPU you are unlikely to saturate PCIe so you will not see an increase in latency.

Thank you very much for taking the time to provide such detailed insights over the weekend.I understand that the “get” kernel may not be an ideal benchmark due to PCIe protocol overhead and the specifics of in-flight reads. I truly appreciate it!

Hi Greg, I want to express my gratitude for your valuable insights. I comprehend that the “get” kernel being reduced to a null kernel results in consistent timing irrespective of the input value. Your explanation of PCIe reads and the potential minimal read granularity is much appreciated.

I have tried various methods, including reading the value of cudaMallocHost memory on the host, using nvcc -O0 to disable optimization, etc. However, even after inspecting GPU runtime with Nsight Systems, I still observed that the kernel function was not executed. I would like to inquire about how to test granularity. Regarding your speculation of 32 bytes, I highly agree, but I am still interested in a definitive statement from NVIDIA engineers. You know, this is crucial for the collaborative performance of software and hardware in high-performance computing.

Also, I’ll keep in mind the considerations for benchmarking with only one thread. Thanks again for your assistance over the weekend.

Yours

Yanru Li

从 Windows 版邮件发送

I have tried various methods, including reading the value of cudaMallocHost memory on the host, using nvcc -O0 to disable optimization,

nvcc will eliminate code even with -O0. Please review the GPU disassembly (SASS) to verify.

If you add a side effect then the code will not be eliminated.

__global__ 
void get(int* pin, int i, int* pout){
    int lengthMod = 8 * MB - 1;
    int tmp;
    for (int j = 0; j < 6; j++) {  
        for (unsigned int k = 0; k < 512 * MB; k++) {
            tmp += pin[(k * i) & lengthMod];
        }
    }
    *pout = tmp;  // side effect to make sure nvcc does not strip out all of the code
}

The minimum load size in the GPU memory subsystem is 32 bytes.

The SM L1 will mark all sectors (32 bytes) missed by the instruction. The SM L1 will read 32B sectors from the L2. If the address is in SYSMEM then the load requests will pass through the L2 (uncached) and the load request will go to the PCIe interface. The returned 32Bs will be returned to the SM L1. The SM L1 will return the requested bytes to each thread.