CUDA performance of managed array pointers.

I’m drunk right now so I might regret asking a dumb question in the morning.

This question is regarding performance differences between global variables and those passed to kernel functions. It seems that large arrays (2GB) allocated with cudaMallocManaged(…) yield inferior performance to those declared with cudaMalloc(…) in global functions. Is it safe to state that externally declared array pointers will remain at the same place in L2 cache (Pascal) as those passed as function parameters?

Example 1:

uint32_t* bigarray;
__global__ void doSomething(uint32_t* param)
{
    unsigned x = ...;
    uint32_t val = para[x];
    ...
}

void callDoSomething()
{
    //declare dim3s...
    //dim3...
    cudaMalloc(&bigarray,...);
    doSomething<<<...>>>(bigarray);
}

vs

__managed__ __device__ uint32_t* bigarray;

//everything else is the same here. I'm drunk
__global__ void doSomething()
{
    //refer to bigarray instead of param
}

[s]arrays created with cudaMalloc vs cudaMallocManaged may behave differently.

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

usually performance can be “restored” with proper use of cudaMemPrefetchAsync

The above statements are assuming you are working on a Pascal or later device, and on linux.[/s]

This is on the TX2 (6.2 Pascal). Sorry, I should’ve clarified: I’m using cudaMalloc to allocate the memory that the pointer points to. It appears that declarations of global pointers that have the managed and device modifiers don’t require cudaMalloc to be called specifically for the pointer. e.g., I don’t have to say CHECK(cudaMalloc(&myPointer, sizeof(uint32_t))).

Are you saying that I can use cudaMallocManaged for my array and call cudaMemPrefetchAsync to arrange it so that it’s as fast as it would be if I declared it using cudaMallocManaged? I had been explicitly loading one array and then using cudaMemcpy(…) to load it to another block of memory declared using cudaMalloc because the extra performance matters.

[s]I didn’t know you were on TX2. That changes things. You might want to ask TX2 questions on the TX2 forum.

It should not be necessary to use cudaMemPrefetchAsync with TX2. The items in the article I linked relating to performance don’t apply to the TX2 case.

TX2 cudaMallocManaged allocations have implications for performance with respect to caching behavior however. I’m not sure there is enough posted here for me to go on to say anything further about performance.

However this:

managed device uint32_t* bigarray;

is kind of a silly construct in my opinion.

managed device means a static allocation. The sensible use of it would be (IMO):

managed device uint32_t bigarray[SOME_CONSTANT];

That puts all of that allocation in managed memory.

This:

managed device uint32_t* bigarray;

puts the storage for the pointer itself in managed memory, but who knows what the pointer points to? Whatever it points to is not necessarily a managed allocation.

If you do a cudaMalloc on it, then you have a managed pointer pointing to an unmanaged allocation. Confusing.

If it were me I would do cudaMalloc on an ordinary pointer and pass that to the kernel, which is typical CUDA usage.[/s]

The reason for using that construct, and, correct me if I’m wrong, is so that I can allocate memory on the host and access it within kernel functions without having to deal with changing signatures and calling constructs in every place where they’re called. It seems to work fine for that. The only other way to access the pointers on the host and in the kernel is to declare them locally in the calling function and then pass them as arguments. I’m using pairs of optimized tables (index + data) for large masked operations to prevent needlessly repeating kernel calls that won’t do anything.

Example:

__managed__ __device__ int32_t indexSize;
__managed__ __device__ uint32_t *index;
__managed__ __device__ float *data;

__global__ void computeResults(float *results)
{
    unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx < indexSize)
    {
        unsigned x = index[idx]
        results[x] = degrees(sin(radians(data[idx])));
    }
}

void main()
{ 
    ifstream fd_in(PATH, ios::in | ios::binary);
    if(fd_in)
    {
    ...
    }
    fd_in.read((char*)&indexSize, sizeof(int32_t));
    CHECK(cudaMallocManaged(&index, sizeof(int32_t) * indexSize));
    fd_in.read((char*)index ...);
    CHECK(cudaMallocManaged... &data);
    fd_in.read(...data);
    fd_in.close();
    float *resultData;
    CHECK(cudaMalloc(&resultData, sizeof(float) * BIG_NUMBER));
    computeResults<<<indexSize, 256>>>(resultData);
    SYNC;
    CHECK(cudaMemcpy(...));
}

Responding here was a mistake on my part. Please disregard my comments.