May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed?

AuroraCelestial · December 14, 2023, 10:08am

May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed?Below is one possible approach that I have considered:

#include <stdio.h>

#define KB 1024
#define MB 1024 * 1024

__global__ 
void get(int *array,int i){
    int lengthMod = 8 * MB - 1;
    int tmp;
    for (int j = 0; j < 6; j++) {  
        for (unsigned int k = 0; k < 512 * MB; k++) {
            tmp += array[(k * i) & lengthMod];
        }
    }
}

void test_time(){
    int * array;
    cudaMallocHost(&array,8 * MB);
    int i;
    get<<<1,1>>>(array,1);
    get<<<1,1>>>(array,1);
    get<<<1,1>>>(array,1);
    for(i = 1; i <= 2048*2048; i *= 2){
        cudaEvent_t startEvent, stopEvent;
        cudaEventCreate(&startEvent);
        cudaEventCreate(&stopEvent);
        cudaEventRecord(startEvent, 0);
        
        get<<<1,1>>>(array,i);
        cudaEventRecord(stopEvent, 0);
        cudaEventSynchronize(stopEvent);
        float time;

        cudaEventElapsedTime(&time, startEvent, stopEvent);
        cudaEventDestroy(startEvent);
        cudaEventDestroy(stopEvent);    
        time /= 6;
        printf("For i value: %d    Time Taken:%lf\n",i,time);
    }
    cudaFreeHost(array);
}



int main() {
    test_time();
    return 0;
}

I hope to measure access granularity through the above approach, corresponding to the cache line size in the CPU. However, the results are as follows:

plaintext

For i value: 1    Time Taken:0.000853
For i value: 2    Time Taken:0.000853
For i value: 4    Time Taken:0.000848
For i value: 8    Time Taken:0.000667
For i value: 16    Time Taken:0.000709
For i value: 32    Time Taken:0.000715
For i value: 64    Time Taken:0.000667
For i value: 128    Time Taken:0.000720
For i value: 256    Time Taken:0.000683
For i value: 512    Time Taken:0.000683
For i value: 1024    Time Taken:0.000853
For i value: 2048    Time Taken:0.000677
For i value: 4096    Time Taken:0.000512
For i value: 8192    Time Taken:0.000683
For i value: 16384    Time Taken:0.000683
For i value: 32768    Time Taken:0.000683
For i value: 65536    Time Taken:0.000512
For i value: 131072    Time Taken:0.000683
For i value: 262144    Time Taken:0.000683
For i value: 524288    Time Taken:0.000683
For i value: 1048576    Time Taken:0.000512
For i value: 2097152    Time Taken:0.000683
For i value: 4194304    Time Taken:0.000683

As you can see, the time for other iterations does not show significant changes. I believe there are two possibilities: one is that the access granularity is indeed 1 byte, and the other is that the overhead caused by granularity here is overshadowed by other significant overheads. I hope someone can help clarify my confusion. Thank you very much.

MarkusHoHo · December 14, 2023, 10:26am

Hi there @AuroraCelestial, welcome to the NVIDIA developer forums.

I honestly do not know, but i think the CDA experts might have some suggestions or answers to figure out granularity. I hope it is ok if I move this topic to the CUDA programming category.

Thanks!

striker159 · December 14, 2023, 10:45am

Edit: Sorry, I think I misread your question. If you did not ask about host side allocation granularity, please disregard the text below.

The allocation granularity may depend on the specific system, but its likely 2MB.

You can use this test program.

//nvcc main.cu -o main
#include <iostream>
#include <cassert>

int main(){
    cudaFree(0); //init cuda context

    std::cout << "before cudaMallocHost\n";
    char* ptr;

    cudaError_t status = cudaMallocHost(&ptr, 1);
    assert(status == cudaSuccess);

    std::cout << "after cudaMallocHost\n";
    std::cout << "ptr is " << (void*)ptr << "\n";

    char* ptr2;
    std::cout << "before cudaMallocHost2\n";
    status = cudaMallocHost(&ptr2, 1);
    assert(status == cudaSuccess);

    std::cout << "after cudaMallocHost2\n";
    std::cout << "ptr2 is " << (void*)ptr2 << "\n";
}

On Linux, you can use strace ./main to see all system calls performed by your program. Initializing the cuda context up front and using std::cout before and after the cudaMallocHost, you can quickly identify the system calls relevant for you question.

write(1, "before cudaMallocHost\n", 22before cudaMallocHost
) = 22
mmap(0x7fb4b3200000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fb4b3200000
...
write(1, "after cudaMallocHost\n", 21after cudaMallocHost
)  = 21
write(1, "ptr is 0x7fb4b3200000\n", 22ptr is 0x7fb4b3200000
) = 22
write(1, "before cudaMallocHost2\n", 23before cudaMallocHost2
) = 23
write(1, "after cudaMallocHost2\n", 22after cudaMallocHost2
) = 22
write(1, "ptr2 is 0x7fb4b3200200\n", 23ptr2 is 0x7fb4b3200200
) = 23

mmap maps 2097152 bytes of data.
It then sub-allocates from this mapping, returning 512-byte aligned pointers.

AuroraCelestial · December 14, 2023, 11:05am

I’m honored that you are willing to provide assistance.

AuroraCelestial · December 14, 2023, 11:09am

Thank you for your response. I appreciate your clarification regarding host side allocation granularity. However, my inquiry was focused on measuring the access granularity of GPU when accessing cudamallochost memory. If you have insights or suggestions on how to perform such measurements on the GPU side, I would be grateful for your guidance. Thank you again for your assistance.

Robert_Crovella · December 14, 2023, 2:25pm

The access granularity is a single byte. This must be true, because it is legal to allocate a single byte, and any larger granularity of access would result in illegal behavior in that case.

AuroraCelestial · December 14, 2023, 3:04pm

I appreciate your explanation. However, I’m still trying to grasp the concept. Could you please share your source code to help deepen my understanding? Your assistance in this matter would be greatly valued. Thank you.

Robert_Crovella · December 14, 2023, 3:22pm

One byte example:

#include <cstdio>
__global__ void k(unsigned char *d) {
  unsigned char val = *d;  // accesses data area on host
  printf("val: %u\n", (unsigned)val);
}

int main(){
  unsigned char *data;
  cudaMallocHost(&data, 1); // allocates only 1 byte
  *data = 0;
  k<<<1,1>>>(data);
  cudaDeviceSynchronize();
}

The cudaMallocHost line formally allocates only one byte of space.
The commented line in the kernel code accesses this data region. If the access/transaction size were larger than one byte (at that point in the kernel code), it would be illegal behavior in C++.

AuroraCelestial · December 15, 2023, 3:37am

I apologize if I didn’t express my thoughts clearly. I have updated my question. Thank you very much for your willingness to help.

Greg · December 16, 2023, 7:12pm

The kernel get has no side effect so the compiler is reducing it to a null kernel. The result is no matter what value of i you are passing the time taken is approximately the same.

PCIe standard supports 1B reads. Requests from SM L1 to L2 are in terms of 32B sectors. L2 will likely also make a request of a 32B sector. A PCIe read request is ~3 x 4B. A PCIe read response is also approximate 3 x 4B + payload (1-32B). Given the method you are making the PCIe request I would guess the minimal read granularity is 32B. The thread will only receive 1B. The L1 and L2 will have the full 32B. SYSMEM reads are not cached in GPU L2. The data passes through the L2.

get is not likely to be a useful benchmark if you execute only 1 thread. Given (a) protocol overhead, (b) maximum in flight reads from your kernel, and (c) maximum in flight reads from GPU you are unlikely to saturate PCIe so you will not see an increase in latency.

AuroraCelestial · December 17, 2023, 5:23am

Thank you very much for taking the time to provide such detailed insights over the weekend.I understand that the “get” kernel may not be an ideal benchmark due to PCIe protocol overhead and the specifics of in-flight reads. I truly appreciate it!

AuroraCelestial · December 17, 2023, 9:17am

Hi Greg, I want to express my gratitude for your valuable insights. I comprehend that the “get” kernel being reduced to a null kernel results in consistent timing irrespective of the input value. Your explanation of PCIe reads and the potential minimal read granularity is much appreciated.

I have tried various methods, including reading the value of cudaMallocHost memory on the host, using nvcc -O0 to disable optimization, etc. However, even after inspecting GPU runtime with Nsight Systems, I still observed that the kernel function was not executed. I would like to inquire about how to test granularity. Regarding your speculation of 32 bytes, I highly agree, but I am still interested in a definitive statement from NVIDIA engineers. You know, this is crucial for the collaborative performance of software and hardware in high-performance computing.

Also, I’ll keep in mind the considerations for benchmarking with only one thread. Thanks again for your assistance over the weekend.

Yours

Yanru Li

从 Windows 版邮件发送

Greg · January 3, 2024, 4:17pm

I have tried various methods, including reading the value of cudaMallocHost memory on the host, using nvcc -O0 to disable optimization,

nvcc will eliminate code even with -O0. Please review the GPU disassembly (SASS) to verify.

If you add a side effect then the code will not be eliminated.

__global__ 
void get(int* pin, int i, int* pout){
    int lengthMod = 8 * MB - 1;
    int tmp;
    for (int j = 0; j < 6; j++) {  
        for (unsigned int k = 0; k < 512 * MB; k++) {
            tmp += pin[(k * i) & lengthMod];
        }
    }
    *pout = tmp;  // side effect to make sure nvcc does not strip out all of the code
}

Greg · January 3, 2024, 6:51pm

The minimum load size in the GPU memory subsystem is 32 bytes.

The SM L1 will mark all sectors (32 bytes) missed by the instruction. The SM L1 will read 32B sectors from the L2. If the address is in SYSMEM then the load requests will pass through the L2 (uncached) and the load request will go to the PCIe interface. The returned 32Bs will be returned to the SM L1. The SM L1 will return the requested bytes to each thread.

Topic		Replies	Views
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1987	April 18, 2017
Coalesced Memory access related doubt CUDA Programming and Performance	13	1980	December 9, 2010
Question about GPU Memory Overhead with Cudamallocmanaged CUDA Programming and Performance	7	1001	August 21, 2024
Zero Copy performance problem CUDA Programming and Performance	4	2061	July 6, 2021
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16295	January 30, 2011
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10658	January 4, 2011
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1043	April 7, 2024
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5075	September 6, 2008
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7126	December 6, 2007
GPU to GPU transfers most effective method? CUDA Programming and Performance	27	38151	March 3, 2011

May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed?

Related topics