When I am passing a struct (consisting only of basic C++ datatypes) to a CUDA kernel, do I have to specify an ‘alignment’ specifier at the declaration of the struct in the header file ? If so, what alignment is necessry - 8 byte or 16 byte or … ?
According to some internet resources it seems to be necessary
The reason I am asking is because I have a strange problem. I have a struct
template <class T>
class ImageBuffer
{
// pointer to image data buffer. Is NULL if no image was set.
T* data;
int pitchElems;
}
I am setting the (64-bit pointer) ‘data’ and ‘pitchElems’ values on the host, and then passing an ‘ImageBuffer<uint8_t>’ object as parameter to the CUDA kernel. In the kernel, I have a statement like
if (image.data) {
... write some elements to array starting at address 'image.data' ...
}'
which writes to the buffer only if it is not a null pointer.
This works fine on my PC (Windows 10 64-bit, VS 2013, Cuda Toolkit 7.0, GTX960/GTx770). But, on a different PC with a GTX 970 it does not work and leads to a crash. The crash is only in ‘Release’ mode, and after some debugging it looks like even if I set on the host the ‘image.data’ pointer to NULL it looks like inside the kernel the condition ‘if (image.data)’ evaluates to true and then subsequently values are written to the buffer with buffer pointer ‘NULL’ - which gives me ‘invalid store’ error when I enable CUDA-Memchecker in NSIGHT.
I will try tomorrow to add such ‘ALIGN’ macro (or the ‘alignas’ C++11 keyword which I suppose works also on NVCC with recent toolkits) to the structure declaration in the header file, but It would be nice to hear upfront if that really could be the problem. It’s strange because it appears only on Release build and only on certain GPUs, so I am not 100% sure that it is really due to the missing alignment statement.
The CUDA compiler makes sure that each constituent part of a struct is naturally aligned, since that is a basic requirement of the hardware architecture. This may involve padding, however no padding is required when the constituent elements of a struct are ordered by decreasing size, as is the case here. Alignment in excess of the alignment required for the largest struct member (here: 8 bytes, a 64-bit pointer), may sometimes help with performance, however, I do not see how that applies here since the issue is of a functional nature.
The description of your issue suggests any of the following to me: (1) uninitialized source data (2) data corruption (overwriting of valid data) (3) race condition (causing stale data to be used). These error scenarios may exist in the device or host portions of the code, it is impossible to tell from the snippets shown.
A compiler issue can never be excluded, however given the maturity of the CUDA toolchain, I consider that unlikely in the absence of evidence pointing in that direction. Does the behavior change if you reduce compiler backend optimizations (e.g. -Xptxas -O2, -Xptxas -O1) ?
I would suggest applying standard debugging techniques, including fixing all issues reported by cuda-memcheck on the device and valgrind on the host. Code instrumentation plus logging is something I personally use extensively for debugging purposes, with good effect (meaning I have been able to track down root causes of bugs where others failed to do so). You may want to try reducing the code to find the smallest program that reproduces the issue.
Then I updated the driver from 350.12 (for GTX 970, Windows 7 x64) to the newest (373.06) and … everything worked. NSIGHT with mem-checker enabled did NOT complain, and also the command-line ‘cuda-memcheck’ tool did not complain.
Good to read that a driver update fixed the issue. I am wondering, though, whether this is due to a fix in a CUDA API, or in the JIT compiler. I would generally recommend avoiding JIT compilation by building a binary that includes SASS (GPU machine code) for all intended target architectures, i.e. a fat binary.
Yep, it’s good that the update fixed the issue. Still, it took me ~ 1.5 work days to figure out cause I always first assume that I made something wrong and not the runtime / driver.
Since quite a time, we always compile our libraries as ‘fat binary’ (including of course CC 5.2 for the GTX 970), so the fix must have been in the CUDA API (most likely in the CUDA driver API function which are pushing the kernel launching arguments onto constant memory).
My experience is that bugs in CUDA driver and tool chain are quite rare these days, so that an initial assumption of a bug in user code is usually the correct way to start the debugging process. I do routinely update my drivers to the latest WHQL package, but realize that this may not always be feasible in a production environment.