Different struct size between g++ and nvcc

Hello everyone,

I have a problem and hopefully someone can point me in the right direction.
I am using armadillo (C++ linear algebra libraray) and want to accelerate some part using cuda. I had some very strange behaviour and finally figured out, that the cx_mat class (and probably others as well) have different size, if I compile them using nvcc vs. g++.
I came accross this:
And the answer is to append a -malign-double to g++ as nvcc is supposed to also do that. I know about the problems of compilers beeing allowed to pack structs/classes differently.
However, in my case this did not help. Interestingly the size of cx_mat is smaller on nvcc, which should not be the case if it packs it in 8 byte blocks and the g++ doesn’t.

As far as I understood the host code will be compiled by the host compiler, which should be the same in my case. How can I figure out which arguments are passed down to that host compiler by nvcc so I can set the same for my g++ part? Then everything should be fine.

Edit: Here is a little example code to show the problem
// Compile with g++: g++ cudaSizeProblem.cpp -o cudaSizeProblemG++
// Compile with nvcc: nvcc cudaSizeProblem.cpp -o cudaSizeProblemNvcc

using namespace std;
using namespace arma;

int main(int argc, char** argv)
cout<<"sizeof(cx_mat): "<<sizeof(cx_mat)<<endl;
return 0;

Edit2: I was able to intersect the call(s) to g++ by manually specifiying a host compiler using -ccbin and writing a small bash script which put the parameters into a text file. Here are the three calls:
-c -x c++ -D__NVCC__ -I/usr/local/cuda-10.0/bin/…/targets/x86_64-linux/include -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=130 -m64 -o /tmp/tmpxft_0000349b_00000000-4_cudaSizeProblem.o cudaSizeProblem.cpp

-c -x c++ -DFATBINFILE="/tmp/tmpxft_0000349b_00000000-3_cudaSizeProblemNvcc_dlink.fatbin.c" -DREGISTERLINKBINARYFILE="/tmp/tmpxft_0000349b_00000000-2_cudaSizeProblemNvcc_dlink.reg.c" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__ -I/usr/local/cuda-10.0/bin/…/targets/x86_64-linux/include -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=130 -m64 -o /tmp/tmpxft_0000349b_00000000-6_cudaSizeProblemNvcc_dlink.o /usr/local/cuda-10.0/bin/crt/link.stub

-m64 -o cudaSizeProblemNvcc -Wl,–start-group /tmp/tmpxft_0000349b_00000000-6_cudaSizeProblemNvcc_dlink.o /tmp/tmpxft_0000349b_00000000-4_cudaSizeProblem.o -L/usr/local/cuda-10.0/bin/…/targets/x86_64-linux/lib/stubs -L/usr/local/cuda-10.0/bin/…/targets/x86_64-linux/lib -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,–end-group

I cannot see anything suspicious.

Maybe I overlooked it, but what is the output of your little test app when compiled with g++ vs nvcc?
Have you tried printing the offset of each class member to determine where the discrepancy occurs?

cudaSizeProblemG++ returns 288 and cudaSizeProblemNvcc returns 280. Interestingly, when I do the same on my large application I am working on, I get even other values from g++, although I cannot find out why.
I have not tried to go through all the elements of this class, what would that help? I guess most of them are private and I would have to muck around in a gloabal include file.

It would allow you to zero in on the particular class member(s) that trigger(s) the size difference. I would expect this to provide a clue as to why there is a size difference for the class overall.

Your initial hypothesis seems to have been that the underlying root cause for the size difference is ‘double’ alignment of class members, but this may not be so. I do not have a viable alternative hypothesis at the moment, but looking at the offset information could lead to one. [Later:] It might be an #ifdef’ed class element where the #ifdef condition makes use of compiler-specific or configuration-specific data?

Without having an idea about the root cause, I don’t know how one would try to devise a fix or workaround.

Do at least all offsets for the public class members match? You might be able to examine the offsets of private class members when running in a debugger.

I know for a fact, that at least some of the public class members do not fit. That was, what I was observing in the debugger which drove me crazy. I was passing a pointer of cx_mat to a function within the cuda file and out of the blue during the call, the debugger was thinking that the values of the instance changed. Some were the same, some seemd to be in the wrong variable.
I can have a further look into this tomorrow.

The version of armadillo that I downloaded (9.300.2) appears to be aware of the CUDA compiler and it’s not clear to me they want to support or allow use of it, at least not in the fashion you may be thinking about.

Including armadillo also includes compiler_setup.hpp

In compiler_setup.hpp, they are specifically checking for CUDACC and NVCC, and if they see it, they define ARMA_FAKE_GCC

When ARMA_FAKE_GCC is defined, a variety of “standard” alignment directives are not used. Its really no surprise, given this, that structures don’t match. (Also, given this idea that compiling under CUDA removes alignment directives that would otherwise be in place, its not really surprising that the observed CUDA structure size is smaller than the observed gcc structure size).

Anyway, if a 3rd party library has specifically chosen to recognize the CUDA compiler and do something special as a result, and that results in broken behavior, I think they are at least partly to blame.

Anyway, fixing this would require a change to armadillo. I would take it up with them.

If you want to file a bug against CUDA, you can do so at developer.nvidia.com. The instructions are linked at the top of this forum in a sticky post. I doubt that there is anything our developers could or would do on their own to fix this.

For workarounds, unless you want to modify armadillo, you should restrict your usage of it either to .cpp files, or .cu files, but not both. Use wrapper functions to tie together what is needed, and don’t pass any armadillo constructs in the wrapper functions.

I make no claim that that is a complete guarantee of correctness, its merely what I would do if I wanted to push forward in this scenario. Your mileage may vary.

Thanks for the very detailed answer.
I think you are right, that this is a problem with armadillo and not with CUDA. I will work around that by not passing the cx_mat but instead the memptr() and the size of the data. As I do not need to modify the size or any other atributes of the matrix and just read/write the data, that should work.