Cuda pointer alignment for GPUDirect Remote Direct Memory Access

I am new to GPUDirect and I found an official example from this link. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html. There is a section code as the following.

// for boundary alignment requirement
#define GPU_BOUND_SHIFT   16
#define GPU_BOUND_SIZE    ((u64)1 << GPU_BOUND_SHIFT)
#define GPU_BOUND_OFFSET  (GPU_BOUND_SIZE-1)
#define GPU_BOUND_MASK    (~GPU_BOUND_OFFSET)

struct kmd_state {
	nvidia_p2p_page_table_t *page_table;
    // ...
};

void kmd_pin_memory(struct kmd_state *my_state, void *address, size_t size)
{ 
    // do proper alignment, as required by NVIDIA kernel driver
    u64 virt_start = address & GPU_BOUND_MASK;
    size_t pin_size = address + size - virt_start;
    if (!size)
    	return -EINVAL;
    int ret = nvidia_p2p_get_pages(0, 0, virt_start, pin_size, &my_state->page_table, free_callback, &my_state);
    if (ret == 0) {
        // Succesfully pinned, page_table can be accessed
    } else {
        // Pinning failed
    }
}

I am confused about the line 15 which is a bit operation. It seems that the GPU_BOUND_MASK is just 0…01111111111111111 where there are 16 ones and the rest are all zeros. I do not understand why this provides us a aligned pointer.

Let’s be clear that when you say 0…01111111111111111 you are referring to a binary view of GPU_BOUND_MASK. And that wouldn’t be a correct (achievable) value anyway.

For simplicity of my typing, let’s pretend that GPU_BOUND_SHIFT were 4 instead of 16. This would suggest that we are expecting to align to a boundary where the lowest 4 bits are zero, which would be equivalent to a 16-byte boundary (if you have the lower 16 bits as 0 in GPU_BOUND_MASK, it simply means that the expected alignment boundary would have the lower 16 bits as zero, i.e. a 64kbyte boundary).

The way this works is as follows.

address is passed to the function, and it is assumed to take on potentially any numerical value. Let’s pick an arbitrary value. suppose address is 0x81. That is not a 16-byte boundary (the 16-byte boundaries would be 0, 16, 32, 48, 64, 80, etc., or in hex: 0x10, 0x20, 0x30, 0x40 …)

If GPU_BOUND_SHIFT is 4, then:

#define GPU_BOUND_SHIFT   4
#define GPU_BOUND_SIZE    ((u64)1 << GPU_BOUND_SHIFT) == 1ULL << 4 = 16 = 0x10 = 010000b
#define GPU_BOUND_OFFSET  (GPU_BOUND_SIZE-1)          == 16 - 1 = 15     = 0x0F = 01111b
#define GPU_BOUND_MASK    (~GPU_BOUND_OFFSET)         ==     011111111111..111111110000b

So it should be evident that you had your sense of 1 and 0 backwards when you suggested that GPU_BOUND_MASK is a bunch of zeros followed by 16 ones. If we take the GPU_BOUND_MASK as above, and do a bitwise-and with 0x81 (our address value) it will result in a value of 0x80, which is the correct closest 16-byte aligned pointer less than or equal to address. This gives us virt_start.

By the way, this kind of GPU Direct driver coding actually is not CUDA, and is ordinary C/C++ code, and the above treatment is not using any concepts from CUDA, just ordinary C/C++ coding concepts.

Thank you so much!

One more question please. I see that we just find the closest 16-byte aligned pointer less than or equal to our original address. In the following computation, we pass this address to the function

nvidia_p2p_get_pages

. But I am confused that how we can make sure that the address between virt_start and address is available for usage. If this region is used by other API, it seems that there might be something wrong with that. Could you please help me figure it out? Thank you

The GPU KMD (kernel mode driver) requires that buffers used for this capability (GPUDirect RDMA) be pinned. The pinning process is required to be done per page, that is in this case by 64kbyte chunks of memory, properly aligned (on 64kbyte boundary).

The pinning of that memory range does not prevent anyone else from using it, nor does it imply that it is somehow reserved for GPUDirect RDMA.

From the caller’s perspective, it requested size to be pinned, starting with address. Caller doesn’t know that some number of bytes prior to address was also pinned, nor does caller care about that. From caller’s perspective, the requested region was pinned.

From any other user of the space between virt_start and address, the fact that it is now pinned does not impede the use of that space by anyone else.

Thank you!