Am I allocating too much memory?

Hi, I have a kernel that is erroring out. When I run it though compute-sanitizer I get an Invalid __global__ write of size 4 bytes, but when I look at the offending line I don’t see how it can be out of bounds. (Maybe I am wrong though)

Here is the code and the offending line (with some unrelated code stripped out for brevity):

__device__ inline int64_t divider() {
    static const uint32_t look_up_table[] = { /* 33 const values*/ };

    struct divider_node {
        int64_t diff;
        int32_t left;
        int32_t sign;
    }

    divider_node* div_nodes = new divider_node[33];

    for (size_t i = 0; i < 32; ++i) {
        div_nodes[i].left = (int32_t)look_up_table[i]; // <--- offending line
       
        // rest of loop
    }

    // rest of function

}

I’m not super well-versed in how Cuda handles local memory. I know that GPU memory is handled via register file, but not sure if that’s just for the stack or if that also for the heap when you allocate via new.

I have tried reducing my block sizes. At first I was running 32x16x1 and then tried 8x8x1 but was still getting the same error, so I am not certain if it’s a memory overflow or not.

That is strange. Hopefully somebody finds a solution.

Generally, instead of new for each thread, you can allocate globally an array with two of the indices being the block and thread numbers.

Oh I hadn’t considered that. A 2D global array might fix it. I’ll give it a go

In the cases, in which you can estimate the temporary space (better more than less), it is advantageous to allocate it in the beginning of the program or before the performance-critical part and possibly reuse it, instead of inside the kernel or in a performance critical loop. Memory allocations are slow.

For better (coalesced) memory accesses, make the lane number (i.e. thread number % 32) be the right-most index, even if it looks strange at first.

Better
div_nodes[i][blockthreadidx]
than
div_nodes[blockthreadidx][i]

Or rather with a struct as data type divide your struct into

div_nodes_diff[i][blockthreadIdx]
div_nodes_left[i][blockthreadIdx]
div_nodes_sign[i][blockthreadIdx]

or

    struct divider_node32 {
        int64_t diff[32]; // for the 32 lanes
        int32_t left[32];
        int32_t sign[32];
    }

and

div_nodes[blockthreadIdx / 32][i].left[blockthreadIdx % 32]

Now the three different arrays div_nodes_diff, div_nodes_left, div_nodes_sign look simpler than using a struct.

On the other hand you can do

int lane = blockthreadIdx % 32;
auto& my_div_nodes = div_nodes[blockthreadIdx / 32];

for (int i = 0; i < 32; i++)
    my_div_nodes[i].left[lane];

(Internally as optimization the assembler will probably combine getting the offset for my_div_nodes when the reference is made with the offset for lane and the fixed offset for .diff, .left or .sign so that only i is dynamic.)

BTW
Your allocation goes from 0…32, your loop from 0…31.

If your needed space is small enough, do it in shared memory instead of in global memory.

The code is not checking the return value of new. It is very likely that new returned nullptr indicating out of memory resulting in div_nodes[i].eft = value writing to out of bounds address.

CUDA driver allocates a heap when a kernel uses new/malloc. The code examples is performing new per thread resulting in a 528B allocation per thread. CUDA Programing Guide Section Dynamic Global Memory Allocation and Operations provides more information on use of new/malloc and API calls to resize the heap.

    divider_node* div_nodes = new divider_node[33];
    ASSERT(div_nodes);

    for (size_t i = 0; i < 32; ++i) {
        int32_t lut = (int32_t)look_up_table[i];
        div_nodes[i].left = lut;
       
        // rest of loop
    }
  1. Check the return value of new
  2. Separating lines with multiple memory accesses into separate statements improves the ability of tools that only support line-info to report the location of the error.
  3. When using device heap it will help you to write the calculation for size allocation and the location of the delete.

Thanks, asserting the pointer return an error. Interestingly when I did a basic if (div_nodes == nullptr) check it didn’t print that the pointer is null.

Increasing the heap sized fixed the issue

It could be that after an exception afterwards the buffer, your message was printed into, was not shown on the console.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.