Hi all!
So far my experience of CUDA has been wonderful and I have realised vast speedups in my applications. I started using 3D memory allocations (via cudaMemcpy3D) the other day, and ever since my experience has declined rapidly.
At first I found little and conflicting documentation about cudaMalloc3D and similar 3D elements (cudaMemcpy3D, cudaMemset3D, cudaPitchedPtr, cudaExtent). I am writing in C++ under visual studio and so cannot use constructors such as make_cudaExtent and make_cudaPitchedPtr (these appear to only be available in .cu files but nowhere is this documented). Fortunately I have managed to successfully allocate my 3D array via:
// create the three dimensional pose network on the GPU
extent.width = 64 * sizeof(float); extent.height = 64; extent.depth = 32;
cudaMalloc3D(&pose_network, extent);
cudaMalloc3D(&pose_network_swap, extent);
// Initialise the network to some arbitrary energy
cudaMemset3D(pose_network, 8, extent);
cudaMemset3D(pose_network, 8, extent);
However, when debugging my GPU kernel when using the emulator (since I can’t debug a kernel actually on my GPU without Nexus - which isn’t supported) when I dereference any element in the pose_network array, the result is 0.000000 when cast to a float. Code below:
__global__ void translate(cudaPitchedPtr pitched_pose_network, cudaPitchedPtr pitched_pose_network_swap, cudaExtent extent, float v) {
register float reference;
// calculate the target cell reference
unsigned int x = threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int z = blockIdx.x;
float* pose_network = (float*)pitched_pose_network.ptr;
float* pose_network_swap = (float*)pitched_pose_network_swap.ptr;
size_t pitch = pitched_pose_network.pitch / sizeof(float*);
size_t layer_pitch = pitch * extent.height;
// dereference
reference = pose_network[z * layer_pitch + y * pitch + x];
The other peculiar thing about my situation is that when debugging, pose_network never seems to be allocated. It is always 0x00000000 even after
float* pose_network = (float*)pitched_pose_network.ptr;
Also when I use the watch 1 feature on x, y and z, it says they do not exist within the scope of the stack frame for the entire duration of the kernel.
Has anyone had random issues with the emulator before, such as corruption of variables or inability to access arrays? I need to get this fixed before writing any further code. I cannot write a hundred lines of code on the GPU and just assume it will work correctly, but a debugger that produces crap isn’t much better.
Any ideas would be greatly appreciated