Why would cumemAllocAsync want to "insert new stream dependencies"?

One of the attributes of a memory pool in CUDA is :

     * Allow cuMemAllocAsync to insert new stream dependencies
     * in order to establish the stream ordering required to reuse
     * a piece of memory released by cuFreeAsync (default enabled).

I don’t understand what this means. If the pool is not certain it would be able to reuse a piece of already-allocated memory - why would it risk slowing down my computation rather than just increasing its backing memory size?

cuFreeAsync doesn’t free memory, by itself. with this default enabled, It inserts a stream marker so that when stream processing gets to that point the memory will be submitted for freeing (returned to the pool). Reuse can happen after that. This allows the mem pool to have a mechanism to manage that.

If the stream processing has reached the necessary point, then the memory in question can be reused, and is subject to reuse by the mem pool (for another incoming allocation request.) If the stream processing has not reached this point, then it cannot be reused. There is no lack of certainty here.

gets to which point?

I mean, if you mean some point stream-order-after the execution of a free operation on a stream, then - that should be handled by CU_MEMPOOL_ATTR_REUSE_FOLLOW_EVENT_DEPENDENCIES; and if you mean at a point stream-ordered-before such a free - it’s illegitimate to reclaim the allocation then. Isn’t it?

the point at which the free operation inserted a marker into the stream.

Suppose you have a created stream, and you insert a kernel call, a memcpy, kernel call, and memcpy, then did the free:

int *d_a;
k<<<..., streamA>>>(d_a, ...);
cudaMemcpyAsync(d_a, ..,streamA);
k<<<..., streamA>>>(d_a, ...);
cudaMemcpyAsync(d_a, ..,streamA);
cudaFreeAsync(d_a, streamA);

cudaFreeAsync in the above example inserts a marker into streamA, so that when the streamA processing reaches that point (i.e. when it has completed both kernel calls and both copy operations), then the memory associated with d_a will be returned to the pool.

That is " the stream ordering required to reuse a piece of memory released by cudaFreeAsync "

Here is the documentation description:

The asynchronous allocator allows the user to allocate and free in stream order. All asynchronous accesses of the allocation must happen between the stream executions of the allocation and the free. If the memory is accessed outside of the promised stream order, a use before allocation / use after free error will cause undefined behavior.

The allocator is free to reallocate the memory as long as it can guarantee that compliant memory accesses will not overlap temporally. The allocator may refer to internal stream ordering as well as inter-stream dependencies (such as CUDA events and null stream dependencies) when establishing the temporal guarantee. The allocator may also insert inter-stream dependencies to establish the temporal guarantee

The “compliant memory accesses” referenced above would refer to one before the free operation (on the “old” state of the allocation) and one after a subsequent malloc operation (operating on the “new” state of the allocation).

This blog may also be of interest.

Adopting your example - we may want to use the memory of d_a

  • on the same stream (streamA), in an allocation scheduled after the async-free - no dependency insertion is necessary, that can just happen.
  • on another stream streamB,
    • with an event scheduled later in the code than the cudaFreeAsync(), and after wait-on-that-event scheduled on streamB. That’s possible if we set the pool’s CU_MEMPOOL_ATTR_REUSE_FOLLOW_EVENT_DEPENDENCIES attribue.
    • without such a stream-ordering, if, when execution eventually occurs, it so happens that d_a in fact gets freed before the attempted allocation. That’s possible if we set the pool’s CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC attribute.
    • without such a stream-ordering and apriori. That doesn’t make any sense - and should just not happen.

So I don’t get what CU_MEMPOOL_ATTR_REUSE_ALLOW_INTERNAL_DEPENDENCIES does for me, even with your example.

Edit: I’ll go read the blog.

@Robert_Crovella : So, I read the blog, whose last part deals with CU_MEMPOOL_ATTR_REUSE_ALLOW_INTERNAL_DEPENDENCIES. What it suggests seems nonsensical! It says:

If kernelA has not finished execution, the CUDA driver can add an implicit dependency between the two streams such that kernelB does not begin executing until kernelA finishes.

If I read that correctly, that means the CUDA driver will arbitrarily choose other streams, which have some de-allocations scheduled already, and force the current stream (on which the user is scheduling an allocation) to wait on one or more of those other streams - without the user having OKed this - to have more available memory for reuse. That’s preposterous! As I wrote in my previous post in this thread - I absolutely don’t understand how this can make sense.

It seems preposterous to you because you are comparing it to an ideal case that is not necessarily achievable.

The ideal case is certainly an application that does not need to do any “on-the-fly” allocations. When I am teaching CUDA, I always strongly encourage this, and you can find suggestions online to “reuse allocations” and “keep synchronizing ops out of the work issuance loop”. In this ideal case, the proposed behavior does seem obnoxious. But its irrelevant, because we have no need to use the pool allocator. Of course you should do this if you can.

The whole premise here is that I have decided I need to do allocations during work issuance. If we posit that, what are our options?

Well, in the “old days”, we would have used cudaMalloc (and cudaFree, I guess). And what does cudaMalloc do? It does exactly that synchronization that is depicted in that blog picture (or, at least, it has the potential to do that, depending on where/when exactly you issue it).

Now in the modern days, we have stream oriented memory pools, as discussed in the blog. What happens when the pool is running low? Well, you could either let the pool allocator try to “scavenge” potentially usable space (using this method, which may or may not have a noticeable impact on your app), or else it can “go back to the well”, and carve out more memory space, which is just like doing a new cudaMalloc operation under the hood. And we are back to a synchronizing operation.

So among those choices, I think the potential dependency choice actually looks elegant. It might be the least disruptive to your carefully crafted work issuance loop. After all, is concurrent kernels really that easy to witness, anyway? How bad is serialization? I don’t know. And you don’t either, until we talk specifics, rather than generalities.

And if you get to a specific case, and don’t like that idea/behavior, turn the behavior off, as indicated in the blog.

How is it bad or preposterous to give you choices?

I’m not likely to argue this any further. Its OK if we disagree. Its the nature of community.

and carve out more memory space, which is just like doing a new cudaMalloc operation under the hood.

Hmm. You’re saying there no possibility of the pool obtaining more global device memory without synchronizing everything happening on the device. Ok, sounds less preposterous now.

If that were generally a possibility (a low cost cudaMalloc operation) there would be no need for any of this. The reason we want to use a pool allocator is because the basic supplied allocation mechanism (cudaMalloc) is expensive. It is expensive in a few ways, one of which is often requiring a synchronization, as if you issued work into the null stream.

AFAIK, the pool allocator system does not have a magic wand, a special mechanism to avoid this expense when “going back to the well”. (If such a mechanism were generally available/possible/feasible, there would no sensible reason for the CUDA developers to withhold it from general usage.)