Passing a struct to CUDA kernel as parameter - 'align' specifier needed ?

HannesF99 · October 10, 2016, 6:49pm

When I am passing a struct (consisting only of basic C++ datatypes) to a CUDA kernel, do I have to specify an ‘alignment’ specifier at the declaration of the struct in the header file ? If so, what alignment is necessry - 8 byte or 16 byte or … ?

According to some internet resources it seems to be necessary

The reason I am asking is because I have a strange problem. I have a struct

template <class T>
class ImageBuffer
{
  // pointer to image data buffer. Is NULL if no image was set.
  T* data;
  int pitchElems;
}

I am setting the (64-bit pointer) ‘data’ and ‘pitchElems’ values on the host, and then passing an ‘ImageBuffer<uint8_t>’ object as parameter to the CUDA kernel. In the kernel, I have a statement like

if (image.data) { 
    ... write some elements to array starting at address 'image.data' ... 
}'

which writes to the buffer only if it is not a null pointer.

This works fine on my PC (Windows 10 64-bit, VS 2013, Cuda Toolkit 7.0, GTX960/GTx770). But, on a different PC with a GTX 970 it does not work and leads to a crash. The crash is only in ‘Release’ mode, and after some debugging it looks like even if I set on the host the ‘image.data’ pointer to NULL it looks like inside the kernel the condition ‘if (image.data)’ evaluates to true and then subsequently values are written to the buffer with buffer pointer ‘NULL’ - which gives me ‘invalid store’ error when I enable CUDA-Memchecker in NSIGHT.

I will try tomorrow to add such ‘ALIGN’ macro (or the ‘alignas’ C++11 keyword which I suppose works also on NVCC with recent toolkits) to the structure declaration in the header file, but It would be nice to hear upfront if that really could be the problem. It’s strange because it appears only on Release build and only on certain GPUs, so I am not 100% sure that it is really due to the missing alignment statement.

njuffa · October 10, 2016, 7:26pm

The CUDA compiler makes sure that each constituent part of a struct is naturally aligned, since that is a basic requirement of the hardware architecture. This may involve padding, however no padding is required when the constituent elements of a struct are ordered by decreasing size, as is the case here. Alignment in excess of the alignment required for the largest struct member (here: 8 bytes, a 64-bit pointer), may sometimes help with performance, however, I do not see how that applies here since the issue is of a functional nature.

The description of your issue suggests any of the following to me: (1) uninitialized source data (2) data corruption (overwriting of valid data) (3) race condition (causing stale data to be used). These error scenarios may exist in the device or host portions of the code, it is impossible to tell from the snippets shown.

A compiler issue can never be excluded, however given the maturity of the CUDA toolchain, I consider that unlikely in the absence of evidence pointing in that direction. Does the behavior change if you reduce compiler backend optimizations (e.g. -Xptxas -O2, -Xptxas -O1) ?

I would suggest applying standard debugging techniques, including fixing all issues reported by cuda-memcheck on the device and valgrind on the host. Code instrumentation plus logging is something I personally use extensively for debugging purposes, with good effect (meaning I have been able to track down root causes of bugs where others failed to do so). You may want to try reducing the code to find the smallest program that reproduces the issue.

HannesF99 · October 11, 2016, 5:09pm

Update: I tried out to align the template class to 8 byte (in a workaround, by deriving from an properly aligned class - see c++ - Aligning Member Variables By Template Type - Stack Overflow). That did NOT help.

Then I updated the driver from 350.12 (for GTX 970, Windows 7 x64) to the newest (373.06) and … everything worked. NSIGHT with mem-checker enabled did NOT complain, and also the command-line ‘cuda-memcheck’ tool did not complain.

njuffa · October 11, 2016, 6:52pm

Good to read that a driver update fixed the issue. I am wondering, though, whether this is due to a fix in a CUDA API, or in the JIT compiler. I would generally recommend avoiding JIT compilation by building a binary that includes SASS (GPU machine code) for all intended target architectures, i.e. a fat binary.

HannesF99 · October 12, 2016, 6:49am

Yep, it’s good that the update fixed the issue. Still, it took me ~ 1.5 work days to figure out cause I always first assume that I made something wrong and not the runtime / driver.
Since quite a time, we always compile our libraries as ‘fat binary’ (including of course CC 5.2 for the GTX 970), so the fix must have been in the CUDA API (most likely in the CUDA driver API function which are pushing the kernel launching arguments onto constant memory).

njuffa · October 12, 2016, 7:05am

My experience is that bugs in CUDA driver and tool chain are quite rare these days, so that an initial assumption of a bug in user code is usually the correct way to start the debugging process. I do routinely update my drivers to the latest WHQL package, but realize that this may not always be feasible in a production environment.

Topic		Replies	Views
cuda passing user defined structure to a kernel failed CUDA Programming and Performance	3	1194	January 26, 2015
CUDA docs of Atomic Functions have code examples with Undefined Behavior CUDA Programming and Performance	7	737	June 17, 2022
Parameter passing bug in CUDA 2.0 x86_64 CUDA 2.0 compiler, parameter passing CUDA Programming and Performance	8	5127	January 27, 2009
CL_INVALID_COMMAND_QUEUE error due to local memory byte alignment CUDA Programming and Performance	13	2154	July 29, 2021
[bugreport] __alignof(CUdeviceptr) == 4, should 8 CUDA Programming and Performance	12	27201	July 5, 2010
struct member alignment inconsistent using templates CUDA Programming and Performance	3	1540	March 10, 2013
Optimization to LD.64 missing? back-to-back LD instructions not coalesced automatically CUDA Programming and Performance	10	2320	June 30, 2012
float4 alignment inconsistency... CUDA Programming and Performance	3	2216	February 19, 2015
Struct vs. parameters performance difference CUDA Programming and Performance	7	83	December 20, 2024
Compiling with debug flag gives errors while normal compilation work well Nsight Visual Studio Edition	8	4170	April 30, 2018

Passing a struct to CUDA kernel as parameter - 'align' specifier needed ?

Related topics