Cache coherence of pinned memory on Jetson Xavier NX

tsuzu.0623 · October 22, 2021, 3:13pm

Hello!
I’m investigating methods to transfer data b/w CPU and GPU.

I executed the following program compiled with nvcc(release 10.2, V10.2.89) on Jetson Xavier NX(CUDA 7.2).
On the GPU, it increments an int variable on pinned memory.
On the CPU, after starting the CUDA kernel, it observes the variable and writes it to stdout.

#include <iostream>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>

__global__
void kernel(volatile int* ptr) {
    for (volatile int i = 0; i < 1e8; ++i) {
        *ptr = i;
    }
}

int main() {
    int *ptr, *dptr;
    const auto flags = cudaHostAllocMapped;
    checkCudaErrors(cudaHostAlloc(&ptr, 4, flags));
    checkCudaErrors(cudaHostGetDevicePointer((void**)&dptr, ptr, 0));

    dim3 d = {1, 1};
    kernel<<<d, d>>>(dptr);

    volatile int v = 0;
    while (v != 1e8 - 1) {
        v = *ptr;
        std::cout << v << '\n';
    }
    std::cout << std::endl;
}

And the output is like this.

My question is why the CPU can read values updated frequently in the GPU. Does the CUDA driver flush the CPU cache when the GPU writes to the memory? Or is it achieved with hardware coherency?

According to this document, on Tegra devices with I/O coherency, pinned memory has cache on CPUs, but doesn’t have on GPUs.
And with I/O coherency, the GPU can access the latest updates in the CPU caches. However, there is no description why
the CPU can access the latest updates on GPU.

Thanks.

AastaLLL · October 25, 2021, 2:27am

Hi,

YES. IO coherence is achieved by hardware.
Only the GPU architecture >= 7.2 (Xavier and Xavier NX) has this feature.

Thanks.

tsuzu.0623 · October 25, 2021, 6:06am

Thank you for the reply.
I’m sorry I didn’t explain it clearly enough.

According to this document, the coherency is one way: CPU to GPU. Therefore, we can read the latest update in the GPU.
On the other hand, the coherence from GPU to CPU is achieved by the CUDA driver.

I think if the CUDA driver flushes the CPU caches so frequency that is seen in the example program, it’ll degrades the memory access performance.

Is the coherency really achieved by the driver?

AastaLLL · October 27, 2021, 6:26am

Hi,

Based on the design, GPU can read CPU cache but not vice versa.
Sorry that we don’t have the right to disclosure more details.

Please note that Jetson doesn’t support concurrent access.
So only one process can access the buffer per time.

Thanks.

system · November 17, 2021, 5:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.