How to pass a struct to a kernel?

Hello guys, recently I encountered some problems when I tried to pass a struct type to a kernel. The member variables can not be used in the kernel? In fact, this is an example given by the book “CUDA by example”. I will show the codes. Could you please help me to find the bugs? Thank you very much.

#include "cuda.h"
#include "cuda_runtime.h"
#include <device_launch_parameters.h>
#include <device_functions.h>

#define imin(a,b) (a<b?a:b)
const int N = 33*1024 * 1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = imin(32, (N + threadsPerBlock - 1) / threadsPerBlock);

struct Lock {
	int *mutex;
	Lock(void) {
		int state = 0;
		cudaMalloc((void**)&mutex, sizeof(int));
		cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
	}
	~Lock(void) {
		cudaFree(mutex);
	}
	__device__ void lock(void) {
		while (atomicCAS(mutex, 0, 1) != 0);
                __threadfence();
	}
	__device__ void unlock(void) {
		atomicExch(mutex, 0);
                __threadfence();
	}
};

__global__ void dot(Lock &lock, float *a, float *b, float *c)
{
	__shared__ float cache[threadsPerBlock];
	int tid = threadIdx.x + blockIdx.x*blockDim.x;
	int cacheIndex = threadIdx.x;

	float temp = 0;
	while (tid<N)
	{
		temp += a[tid] * b[tid];
		tid += blockDim.x*gridDim.x;
	}
	cache[cacheIndex] = temp;

	__syncthreads();

	int i = blockDim.x / 2;
	while (i!=0)
	{
		if (cacheIndex < i)
			cache[cacheIndex] += cache[cacheIndex + i];
		__syncthreads();
		i /= 2;
	}

	if (cacheIndex == 0)
	{
		lock.lock();
		*c += cache[0];
		lock.unlock();
	}

}

#define sum_squares(x) (x*(x+1)*(2*x+1)/6)
void main()
{
	float *a, *b, c = 0;
	float *dev_a, *dev_b, *dev_c;

	a = (float*)malloc(N * sizeof(float));
	b = (float*)malloc(N * sizeof(float));

	cudaMalloc((void**)&dev_a, N* sizeof(float));
	cudaMalloc((void**)&dev_b, N * sizeof(float));
	cudaMalloc((void**)&dev_c, sizeof(float));

	for (int i = 0; i < N; i++)
	{
		a[i] = i;
		b[i] = i * 2;
	}

	cudaMemcpy(dev_a, a, N*sizeof(float),cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_c, &c, sizeof(float), cudaMemcpyHostToDevice);
	
	Lock lock;
	dot << <blocksPerGrid, threadsPerBlock>> > (lock, dev_a, dev_b, dev_c);
	cudaMemcpy(&c, dev_c, sizeof(float), cudaMemcpyDeviceToHost);

	printf("Does GPU value %.6g=%.6g?\n",c,2*sum_squares((float(N-1))));

	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_c);

	free(a);
	free(b);
}

when posting code here, please post it using code formatting

Select the code, then click the </> button at the top of the edit window.

You can do this right now by editing your post above.

Please describe again what the problem is. Are you talking about a compile error or a runtime error? if its a runtime error, have you run the code with cuda-memcheck?

if its a runtime error, how are you making the determination that something is wrong? What are the result you get and what are the results you expect?

This is incorrect:

__global__ void dot(Lock &lock, float *a, float *b, float *c)
                         ^

unless you are using managed memory (you are not) it is illegal to pass parameters to the kernel by reference

If you remove the above indicated ampersand, the code appears to run correctly for me.

This is also not recommended:

~Lock(void) {
cudaFree(mutex);

This doesn’t cause the printed result to be incorrect, but will return a runtime error if you check for runtime errors or if you use cuda-memcheck.

The reason for this is outlined here:

https://stackoverflow.com/questions/42844997/cudamemcpy-error-when-copying-from-device-to-host-after-device-class-member/42848812#42848812

Thank you very much for your answer.

As your advice, I delete the ampersand as well as the deconstructor. However, nothing changes. The runtime error still exists.
I start CUDA debugging by Nsight and stop at the breakpoint at the line 3 in this kernel. I check the value: lock.mutex == 0 and lock.mutex[0]==???. However, I think after define a variable lock having the type Lock, the default constructor should have already allocated a memory to lock.mutex, and lock.mutex[0] should be assigned to 0. I don’t know why. Do you know how to assign a value to lock.mutex[0] in the main and use it in the kernel? Thank you in advance.

__global__ void dot(Lock lock, float *a, float *b, float *c)
{
	__shared__ float cache[threadsPerBlock];
	int tid = threadIdx.x + blockIdx.x*blockDim.x;
	int cacheIndex = threadIdx.x;
        ......
}

Can you be sure that lock.mutex hasn’t been free’d?

When a copy of the ‘lock’ object is passed to the kernel, are we sure that ~Lock() isn’t called? (potentially freeing up the global memory allocated ‘mutex’ variable)

EDIT:

Nevermind :)

Hello Jimmy,
thank you for your attention. Following your concern, I try to change my deconstructor as following.

~Lock(void) {
		printf("deconstructor has been called");
	}

However, when I enter into the kernel, the sentence is not printed and the runtime error exits.

Hello guys,

Finally, I find the solution which is proposed by Mark Harris in his blog:
https://devblogs.nvidia.com/unified-memory-in-cuda-6/

Following his example code, I rewrite a new struct data as following which works perfectly.
However, I do not really understand the 14th line that we should use

  • cudaMemcpy
  • to copy the address of d_mutex. This seems very tricky. Does anyone could explain this?

    struct Lock {
    	int *mutex;
    	Lock(void) {
    		mutex = (int*)malloc(sizeof(int));
    		mutex[0] = 0;
    	}
    	Lock* get_d_lock() {
    		Lock *d_lock;
    		int *d_mutex;
    		cudaMalloc((void**)&d_lock, sizeof(Lock));
    		cudaMalloc((void**)&d_mutex, sizeof(int));
    		cudaMemcpy(d_lock, this, sizeof(Lock), cudaMemcpyHostToDevice);
    		cudaMemcpy(d_mutex, this->mutex, sizeof(int), cudaMemcpyHostToDevice);
    		cudaMemcpy(&(d_lock->mutex), &d_mutex, sizeof(int*), cudaMemcpyHostToDevice);
    		return d_lock;
    	}
    	~Lock(void) {
    		printf("deconstructor has been called");
    	}
    	__device__ void lock(void) {
    		while (atomicCAS(mutex, 0, 1) != 0);
    		__threadfence();
    	}
    	__device__ void unlock(void) {
    		atomicExch(mutex, 0);
    		__threadfence();
    	}
    };
    

    I will reiterate that the changes I previously suggested seem to work for me.

    Here’s a completely worked example, showing that lock (mutex) pointer value as set in host code is the same as that observed in device code:

    $ cat t419.cu
    #include <stdio.h>
    
    #define imin(a,b) (a<b?a:b)
    const int N = 33*1024 * 1024;
    const int threadsPerBlock = 256;
    const int blocksPerGrid = imin(32, (N + threadsPerBlock - 1) / threadsPerBlock);
    
    struct Lock {
    int *mutex;
    Lock(void) {
    int state = 0;
    cudaMalloc((void**)&mutex, sizeof(int));
    cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
    printf("constructor, lock pointer: %p\n", mutex);
    }
    ~Lock(void) {
    //cudaFree(mutex);
    printf("destructor\n");
    }
    __device__ void lock(void) {
    while (atomicCAS(mutex, 0, 1) != 0);
    __threadfence();
    }
    __device__ void unlock(void) {
    atomicExch(mutex, 0);
    __threadfence();
    }
    };
    
    __global__ void dot(Lock lock, float *a, float *b, float *c)
    {
    __shared__ float cache[threadsPerBlock];
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    if (tid == 0) printf("in kernel, lock pointer: %p\n", lock.mutex);
    int cacheIndex = threadIdx.x;
    
    float temp = 0;
    while (tid<N)
    {
    temp += a[tid] * b[tid];
    tid += blockDim.x*gridDim.x;
    }
    cache[cacheIndex] = temp;
    
    __syncthreads();
    
    int i = blockDim.x / 2;
    while (i!=0)
    {
    if (cacheIndex < i)
    cache[cacheIndex] += cache[cacheIndex + i];
    __syncthreads();
    i /= 2;
    }
    
    if (cacheIndex == 0)
    {
    lock.lock();
    *c += cache[0];
    lock.unlock();
    }
    
    
    }
    
    #define sum_squares(x) (x*(x+1)*(2*x+1)/6)
    int main()
    {
    float *a, *b, c = 0;
    float *dev_a, *dev_b, *dev_c;
    
    a = (float*)malloc(N * sizeof(float));
    b = (float*)malloc(N * sizeof(float));
    
    cudaMalloc((void**)&dev_a, N* sizeof(float));
    cudaMalloc((void**)&dev_b, N * sizeof(float));
    cudaMalloc((void**)&dev_c, sizeof(float));
    
    for (int i = 0; i < N; i++)
    {
    a[i] = i;
    b[i] = i * 2;
    }
    
    cudaMemcpy(dev_a, a, N*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c, &c, sizeof(float), cudaMemcpyHostToDevice);
    
    Lock lock;
    printf("launching kernel\n");
    dot << <blocksPerGrid, threadsPerBlock>> > (lock, dev_a, dev_b, dev_c);
    cudaMemcpy(&c, dev_c, sizeof(float), cudaMemcpyDeviceToHost);
    
    printf("Does GPU value %.6g=%.6g?\n",c,2*sum_squares((float(N-1))));
    
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);
    
    free(a);
    free(b);
    }
    $ nvcc -o t419 t419.cu
    $ cuda-memcheck ./t419
    ========= CUDA-MEMCHECK
    constructor, lock pointer: 0x7ff768400200
    launching kernel
    destructor
    in kernel, lock pointer: 0x7ff768400200
    Does GPU value 2.76217e+22=2.76217e+22?
    destructor
    ========= ERROR SUMMARY: 0 errors
    $
    

    With respect to your question about line 14, I would have to make some assumptions about how you are using this new struct, because the usage must clearly be different. Anyway, your get_d_lock function is allocating space for a lock struct on the device, then copying the lock struct on the host (this) to the newly allocated device copy. This will put a pointer allocated by malloc into the mutex pointer location in the device copy, which clearly won’t work. Then we allocate a new pointer for device usage (d_mutex), and copy the pointer value so allocated into the device mutex location (replacing the previous pointer allocated by malloc that was there), in line 14.

    Your methodology seems overly complex IMO, but I believe it should work. I’m unable to explain why code posted above would not work in your setting, however.

    In fact, I use visual studio 2015 to run these codes with an integration of Nsight with which I can stop in the kernel lines and check the GPU momery. First, thank you for showing me your codes. When I run these codes in vs2015, I can finally get the same results as yours, but I can not check the GPU momery (lock.mutex) in the kernel.

    Now I understand the problem. In your codes, the lock is allocated on host and it has a host memory address. Even if the lock.mutex has a device memory address, we can check it with Nsight. Because in order to find the lock.mutex, Nsight should find lock first and then add and offset to find mutex. Since the lock has a host memory, Nsight can not find it so that can not add the offset on it as well.

    In my last version of code, the d_lock has a device memory and the d_lock->d_mutex has also a device address. The 14th line just linked d_mutex to the d_lock->d_mutex struct. Because this link is done on the device memory, we should use cudaMemcpy to do just an assign operation (d_lock->d_mutex=d_mutex if it is on host).

    Last comment, this is as you said is just like allocate a struct on the device and copy the whole struct on the host to the device. The same job could be completed more easily by an unified memory operation. I will try to use it instead of using cudaMemcpy. My objective is that I should not only get a result, but also should use Nsight to check each memory used in kernel.

    You make me think about this problem more clearly. I appreciate your deep help.