However, when profiling, I still see writes using size 32 instead of 64 for float2.
The array contains structures of 8 floats, and idx points to the begin of a structure.
The following statement writes as two size32:
float x;
int y;
float2 data = make_float2(x, __int_as_float(y));
((float2*)array)[idx * 4 + 3] = data;
I cannot reproduce this with the CUDA 6.5 tool chain using the sample kernel below. I see two 32-bit loads and one 64-bit store in the SASS. Are you looking at the disassembled binary code from cuobjdump --dump-sass? Don’t look at the PTX.
Are you doing a release build with full optimization? What tool chain do you use and what is the nvcc command line? It would be helpful if you would post the smallest complete, buildable, example that reproduces the issue.
Note: Converting a pointer of one data type to a pointer of a data type with tighter alignment requirements, e.g. casting a float* to a float2*, may lead to silent program failure unless you can guarantee that the converted pointer is naturally aligned to the width of the wider type (so 8-byte alignment in the case of float2*).
I’m using CUDA 7.0 for Visual Studio 2013. The project is compiled in 64 bit release mode (the test program below is compiled with 32 bits, still has two separate 32b memory writes).
After fixing the missing header file includes, I compiled your code for various architectures using CUDA 6.5:
nvcc -arch={sm_20 | sm_30 | sm_50} -o ray --machine=32 ray.cu
In all cases the generated SASS code contains a 64-bit store (see below). The issue may affect CUDA 7.0 in particular, maybe someone else can try it with that tool chain.
Seems like a regression in the compiler. Consider filing a bug with NVIDIA. The bug reporting form is linked from the CUDA registered developer webpage.
For completeness, I took the liberty to also test this in CUDA 6.5 on my desktop. The results show that the program works correctly with one 64bit write: