Cache coherence of pinned memory on Jetson Xavier NX

Hello!
I’m investigating methods to transfer data b/w CPU and GPU.

I executed the following program compiled with nvcc(release 10.2, V10.2.89) on Jetson Xavier NX(CUDA 7.2).
On the GPU, it increments an int variable on pinned memory.
On the CPU, after starting the CUDA kernel, it observes the variable and writes it to stdout.

#include <iostream>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>

__global__
void kernel(volatile int* ptr) {
    for (volatile int i = 0; i < 1e8; ++i) {
        *ptr = i;
    }
}

int main() {
    int *ptr, *dptr;
    const auto flags = cudaHostAllocMapped;
    checkCudaErrors(cudaHostAlloc(&ptr, 4, flags));
    checkCudaErrors(cudaHostGetDevicePointer((void**)&dptr, ptr, 0));

    dim3 d = {1, 1};
    kernel<<<d, d>>>(dptr);

    volatile int v = 0;
    while (v != 1e8 - 1) {
        v = *ptr;
        std::cout << v << '\n';
    }
    std::cout << std::endl;
}

And the output is like this.

0
1350
1459
1553
1656
1758
1859
1960
2062
2316
2421
2523
2626
2728
2829
2930
3046
3147
3541
3652
3753
3855
3956
...

My question is why the CPU can read values updated frequently in the GPU. Does the CUDA driver flush the CPU cache when the GPU writes to the memory? Or is it achieved with hardware coherency?

According to this document, on Tegra devices with I/O coherency, pinned memory has cache on CPUs, but doesn’t have on GPUs.
And with I/O coherency, the GPU can access the latest updates in the CPU caches. However, there is no description why
the CPU can access the latest updates on GPU.

Thanks.

Hi,

YES. IO coherence is achieved by hardware.
Only the GPU architecture >= 7.2 (Xavier and Xavier NX) has this feature.

Thanks.

Thank you for the reply.
I’m sorry I didn’t explain it clearly enough.

According to this document, the coherency is one way: CPU to GPU. Therefore, we can read the latest update in the GPU.
On the other hand, the coherence from GPU to CPU is achieved by the CUDA driver.

I think if the CUDA driver flushes the CPU caches so frequency that is seen in the example program, it’ll degrades the memory access performance.

Is the coherency really achieved by the driver?

Hi,

Based on the design, GPU can read CPU cache but not vice versa.
Sorry that we don’t have the right to disclosure more details.

Please note that Jetson doesn’t support concurrent access.
So only one process can access the buffer per time.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.