Dual problems with unified memory

jetsonnvidia · February 25, 2020, 4:47pm

I have modified the CUDA sample code “12_camera_v4l2_cuda” so that it calculates the means of a 32x32 grid of rectangles (each rectangle measures 128 pixels wide by 64 pixels high). The source image is 4096x2048 pixels. I want the CPU that calls the CUDA code to be able to read the 32x32=1024 means that are calculated for each of the sub rectangles.

Unfortunately I experience 2 problems:

I get a “bus error” when I attempt to print out the mean for the first of the rectangles.
The code runs slow if I access the unified memory in my CUDA function. That is to say, if I just keep the algorithm local I get 30fps but storing the mean in the passed-in unified memory gives me 12fps.

Here is my code. I am using unified memory for the stats array.

struct rectangleStats
    {
        float mean;
        // These fields are the rectangle offset. i.e. 0-31.
        uint8_t rect_x;
        uint8_t rect_y;
    };

    // Total of 1024 threads for an image 4096 x 2048.
    __global__ void findSdMeanKernel(int* pDevPtr, rectangleStats* stats)
    {
        // Each row is 4096 bytes long, 1 byte per luminance pixel.
        int row = blockIdx.y * blockDim.y + threadIdx.y;
        int start_row = row * 64;
        int col = blockIdx.x * blockDim.x + threadIdx.x;
        int start_col = col * 128;
        uint8_t x_offset, y_offset;
        uint32_t pixel;
        float mean;
        uint32_t sum_pixels = 0;

        for(x_offset = 0; x_offset < 128; x_offset++)
        {
            for(y_offset = 0; y_offset < 64; y_offset++)
            {
                pixel = ((char *)pDevPtr)[(start_row + y_offset) * 4096 + start_col + x_offset];
                sum_pixels += pixel;
            }
        }
        
        mean = (float)sum_pixels/(128.0 * 64.0);
        
        stats[row * BOX_W + col].mean = mean;
        stats[row * BOX_W + col].rect_x = col;
        stats[row * BOX_W + col].rect_y = row;

        return;
    }

    int findSdMean(CUdeviceptr pDevPtr)
    {
        dim3 threadsPerBlock(BOX_W, BOX_H);
        dim3 blocks(1,1);
        rectangleStats *stats; 

        cudaMallocManaged(&stats, 32 * 32 * sizeof(rectangleStats));

        findSdMeanKernel<<<blocks, threadsPerBlock>>>((int *)pDevPtr, stats);

        // The line below cause the "bus error".
        printf("Mean at x, y %d, %d is %f\n", stats[0].rect_x, stats[0].rect_y, stats[0].mean);

        cudaFree(stats);
        
        return 0;
    }

I have discovered that the malloc/free calls do not slow things down so it seems to be the fact I access the unified memory that slows things down.

Any ideas for the cause of these 2 issues? Thanks.

jetsonnvidia · February 25, 2020, 5:49pm

Solved 1): I forgot to call cudaDeviceSynchronize before accessing the unified memory.

Still have issue 2). If I don’t pass data to/from host/device it goes at full frame rate. Suddenly get a massive frame rate drop when passing the data back.

AastaLLL · February 26, 2020, 2:58am

Hi,

1. Jetson platform doesn’t support concurrent access.
So you will need to call the CUDA synchronize to make sure all the GPU tasks is finished before accessing via CPU.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

2. Here is our document for the memory recommendation of Jetson system.
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Please noticed that unified memory rely on CUDA driver to handle the synchronization.
For a read-once input data, it’s recommended to use zero copy memory to save the overhead of unified memory.

Thanks.

jetsonnvidia · February 26, 2020, 5:13pm

When you say “read once” do you mean “read once by the host”?

AastaLLL · February 27, 2020, 2:27am

Hi,

Read once by CUDA kernel.

For example, a DNN detection pipeline is a typical read-once use case.
The input frame only be accessed once via TensorRT engine.

Thanks.

jetsonnvidia · February 27, 2020, 10:56pm

Okay, so you give the data from host to device, then it reads the data and that is the end of that data being used?

AastaLLL · March 2, 2020, 2:13am

Hi,

It’s recommended to read this document first:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Zero copy memory won’t copy the memory from host to device but using a pinned memory to allow GPU accessing.
Since the physical location of memory is pinned in CPU system memory, the access may be fast or slow depends on where it is.

Unified memory is more like migration when demand. It decouples memory and execution spaces so that all data accesses are fast.
However, it takes some overhead for the mechanism.

Thanks.

jetsonnvidia · March 2, 2020, 2:59pm

Okay, I have read that first link so basically I have to use cudaStreamAttachMemAsync to tell the system where I want to access the unified memory in order to speed things up.

I suppose also I could use pinned memory for getting data back from the GPU since “Pinned memory is recommended for small buffers because the caching effect is negligible for such buffers and also because pinned memory does not involve any additional overhead, unlike Unified Memory.”

Topic		Replies	Views
Bus Error with simple CUDA code using unified memory ("12_camera_v4l2_cuda" example) CUDA Programming and Performance	0	639	February 20, 2020
Array + Array (1D or 2D): Why is performance of my code TERRIBLE? CUDA Programming and Performance cuda , image-processing	6	72	October 21, 2024
Unified Memory not reachable by GPU? Jetson Nano cuda	3	623	November 17, 2021
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	4145	January 7, 2022
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	39	November 25, 2024
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1727	October 18, 2021
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	71	November 19, 2024
Unified memory with CUDA on Jetson Nano needs memcpy? Jetson Nano cuda	9	2335	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
Image processing with CUDA: design question. CUDA Programming and Performance	5	1014	January 26, 2018

Dual problems with unified memory

Related topics