Aliasing of pointers to members of objects passed to kernel

I’m considering writing kernels that accept references to objects as opposed to fundamental types, for example

structure Geometry
{
	float* 	data;
	size_t	size;
};

__global__ void munge( Geometry& geometry1 , Geometry& geometry2  )
{
	// ...
}

My concern however is that how can I ensure that geometry1.data and geometry2.data are not seen by the compiler as aliasing each other?

Even if I passed the arguments by pointer like so

__global__ void munge_( Geometry* __restrict__ geometry1 , Geometry* __restrict__ geometry2  )
{
	// ...
}

This doesn’t seem sufficient to convince the compiler that geometry1.data and geometry2.data do not alias each other, or is it?

Or maybe mark the data member as restrict, like so:

structure Geometry
{
	float* 	__restrict__ data;
	size_t	size;
};

This is illegal unless you are using managed memory:

global void munge( Geometry& geometry1 , Geometry& geometry2 )

The safest course of action if you want to tell the compiler that pointers are not aliased is to decorate the pointers in question. The pointer to a struct has no bearing on a struct member pointer.

I want to be perfectly sure that I understand what you are saying. Do you mean that kernels arguments cannot be references? If so are you suggesting that I pass pointers to the Geometry objects instead – with data members marked restrict, of course.

PS: Geometry::data is allocated with cudaMalloc() and free’d with cudaFree()

That’s correct. Kernel arguments cannot be references. (unless they refer to something allocated with cudaMallocManaged()) Try it. It won’t work. The kernel launch syntax is host code. The reference to anything in host code is a host reference (kind of like taking the address of a host variable. It is a host address. Completely useless in device code.)

Now I get it!! … Thanks.

Any address that is passed to a cuda kernel MUST reside on the device, because attempting to access host memory from the device or vice-versa is not meaningless – unless said address is created in unified memory (by definition accessible to the host and device).

I think it should also be possible to pass by reference something that is within a mapped space.

For example:

struct foo {
   int a;
}

__global void kernel(int &b){};

...
foo *data;

cudaHostAlloc(&data, sizeof(foo), cudaHostAllocDefault);

...

kernel<<<...>>>(data[0].a);

I think that should work, haven’t tried it.

I’m watching a bunch of videos on unified/managed managed memory as this is the first time I’ve considered using this type of memory. From what I’ve learnt so far the unified memory does API deep copies of pointers. But what if I don’t want a deep copy, for example in the case where the data has already been created on device as in the following struct?

struct Geometry
{
	float* __restrict__ data;	// initlaized by cudaMalloc
	size_t	            size;
	short               dim;
};

I’m also toying with the idea of creating a corresponding Geometry object in global memory but I don’t know how to copy the device data pointer from the Geometry object on the host to the one on the device.

If you are referring to allocations performed on the host:

If you don’t provide an allocation for data using cudaMallocManaged, then that allocation is not part of the managed memory cohort and won’t be automatically migrated. The numerical value of the pointer itself is part of the (higher level) managed allocation, but what it points to is not. Therefore this is probably exactly the behavior you want, according to my read.

The “automatic deep copy” associated with managed memory refers to the idea that you are using a managed memory allocator “throughout” your code. In that case, everything (that you allocated) gets managed.

You probably just need to give this some careful thought. There is a difference between a pointer (i.e. its location and value) and the region that it points to.

If you are referring to allocations done on the device (i.e. in device code):

  1. cudaMallocManaged is not available in the device runtime API
  2. allocations created by malloc, new, or cudaMalloc in device code cannot participate in a host<->device transfers of any kind, ever. these types of allocations are done out of the “device heap”, and allocations in the device heap cannot participate in host<->device transfer operations.

I’ve just come across several references that suggest that kernel arguments can be POD. Is this true?

Assuming POD means “plain old data”:

http://en.cppreference.com/w/cpp/concept/PODType

of course they can. Kernel arguments can be anything you can pass by value (subject to some size limits, documented in the programming guide). That includes POD, pointers to POD, etc.

Great! Problem solved. (Now I no longer have to resort to the sort of template trickery I’d been dreading.)

I have done a search of the programming guide but did not find the term POD.

What section of the programming guide are you referencing? I can’t find any description of the sort of arguments kernels expect in the January 2017 edition of the programming guide, especially section 2.1 (Programming Model: Kernels).

I didn’t say POD is in the programming guide. I said:

Kernel arguments can be anything you can pass by value (subject to some size limits, documented in the programming guide).

The size limits documented in the programming guide are here:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters

Yes. This is a viable option for you, I suppose. Here is a full example so as not to leave out any details. Be careful doing this, it’s easy to shoot yourself in the foot. As a rule of thumb, the only kinds of pointers that work with this are generic data types (float, int, etc). This is not actually true, but just be careful!

If you called the file pod.cu, you would compile with nvcc -std=c++11 -o pod_test pod.cu and run ./pod_test. Note that c++11 really isn’t necessary, it’s just for nullptr and constexpr. Just make sure you are ALWAYS initializing your pointers to nullptr or NULL to save yourself future headache!

struct Geometry {
    float *data = nullptr;
    size_t size = 0;
};

// Note: these are *NOT* references
__global__ void munge(Geometry geometry1, Geometry geometry2) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // set all g1 to 1
    if (idx < geometry1.size)
        geometry1.data[idx] = 1.0f;

    // set all g2 to 2
    if (idx < geometry2.size)
        geometry2.data[idx] = 2.0f;
}

#include <iostream>

int main(void) {
    static constexpr size_t N = 20;
    // Create / allocate g1, set to 0 to verify kernel works
    Geometry g1;
    cudaMalloc(&g1.data, N * sizeof(float));
    cudaMemset(g1.data, 0, N * sizeof(float));
    g1.size = N;

    // Same for g2
    Geometry g2;
    cudaMalloc(&g2.data, N * sizeof(float));
    cudaMemset(g2.data, 0, N * sizeof(float));
    g2.size = N;

    // create some host buffers to verify things
    float *h_g1_data = (float *)malloc(N * sizeof(float));
    float *h_g2_data = (float *)malloc(N * sizeof(float));

    // copy down and make sure they are zero
    cudaMemcpy(h_g1_data, g1.data, N * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_g2_data, g2.data, N * sizeof(float), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    // tally the number of zeros we found for both
    size_t n_g1_zero = 0;
    size_t n_g2_zero = 0;
    for (size_t i = 0; i < N; ++i) {
        if (h_g1_data[i] == 0.0f) ++n_g1_zero;
        if (h_g2_data[i] == 0.0f) ++n_g2_zero;
    }

    std::cout << "Number of zeros:" << std::endl
              << "  - G1: " << n_g1_zero << std::endl
              << "  - G2: " << n_g2_zero << std::endl << std::endl;

    // now we just call our kernel, g1 and g2 are POD because all that is
    // happenning is we are copying the pointer location and the size.

    // uncomment this line to verify that size is getting copied correctly,
    // by doing this we only do half the problem for g2
    // g2.size = N / 2;    

    munge<<< N , 1 >>>(g1, g2);// only doing this because 20 fits in one warp
    cudaDeviceSynchronize();

    // copy back and verify
    cudaMemcpy(h_g1_data, g1.data, N * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_g2_data, g2.data, N * sizeof(float), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    size_t n_g1_one = 0;
    size_t n_g2_two = 0;
    for (size_t i = 0; i < N; ++i) {
        if (h_g1_data[i] == 1.0f) ++n_g1_one;
        if (h_g2_data[i] == 2.0f) ++n_g2_two;
    } 

    std::cout << "Number of G1 were 1: " << n_g1_one << std::endl
              << "Number of G2 were 2: " << n_g2_two << std::endl;

    // free up the data we used
    cudaFree(g1.data);
    cudaFree(g2.data);
    free(h_g1_data);
    free(h_g2_data);

    return 0;
}

So it works fine, and may be useful, but for this problem I personally would just change the signature of munge to be

__global__ void munge(float * __restrict__ g1_data, size_t g1_size,
                      float * __restrict__ g2_data, size_t g2_size) { /* ... */ }

if you’re really that worried about aliasing here. If I recall correctly, in c++ restrict is not equivalent to restrict in C. I don’t think it necessarily always guarantees strict aliasing the way it does in C. But I’m far from an authority, I just vaguely recall this being the case.

Edit: my memory seems correct, it’s not officially standardized, but it seems most compilers will support it https://stackoverflow.com/questions/776283/what-does-the-restrict-keyword-mean-in-c

it’s the same - standard on C, but not in C++. hopefully, some compilers support it as extension. But since extension it’s only extension, they use word in a form reserved for compilers i.e. starting with underscores