Uncoalesced float2 write when using __int_as_float

Hello,

I’m trying to store vectorized data as shown here: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

However, when profiling, I still see writes using size 32 instead of 64 for float2.
The array contains structures of 8 floats, and idx points to the begin of a structure.

The following statement writes as two size32:

float x;
int y;
float2 data = make_float2(x, __int_as_float(y));
((float2*)array)[idx * 4 + 3] = data;

If I change it to the following:

float2 data = make_float2(x, 0.0f);
((float2*)array)[idx * 4 + 3] = data;

The write will correctly be of size 64.

Can someone shed some light on this?

I cannot reproduce this with the CUDA 6.5 tool chain using the sample kernel below. I see two 32-bit loads and one 64-bit store in the SASS. Are you looking at the disassembled binary code from cuobjdump --dump-sass? Don’t look at the PTX.

Are you doing a release build with full optimization? What tool chain do you use and what is the nvcc command line? It would be helpful if you would post the smallest complete, buildable, example that reproduces the issue.

Note: Converting a pointer of one data type to a pointer of a data type with tighter alignment requirements, e.g. casting a float* to a float2*, may lead to silent program failure unless you can guarantee that the converted pointer is naturally aligned to the width of the wider type (so 8-byte alignment in the case of float2*).

__global__ void kernel (float2 *out, const int *in1, const float *in2)
{
    *out = make_float2 (*in2, __int_as_float (*in1));
}

I’m using CUDA 7.0 for Visual Studio 2013. The project is compiled in 64 bit release mode (the test program below is compiled with 32 bits, still has two separate 32b memory writes).

Screenshot of memory access: http://i.imgur.com/1KBcHx5.png
Screenshot of command line from CUDA: http://i.imgur.com/OZDeKnm.png (standard settings)

So here is a small program that shows this behavior:

#include "cuda_runtime.h"
#include <stdio.h>
#include <tchar.h>
#include <vector>

struct Ray
{
	float3 Position;
	float3 Direction;
	float2 Data;
};

__global__ void Test(Ray *rays)
{
	int idx = blockIdx.x;
	float test2 = 2.0f;
	float2 data = make_float2(test2, __int_as_float(idx));
	((float2*)rays)[idx * 4 + 3] = data;
}

int _tmain(int argc, _TCHAR* argv[])
{
	std::vector<Ray> h_rays (100);
	for (int i = 0; i < 100; i++)
	{
		h_rays[i].Position = make_float3(1, 1, 1);
		h_rays[i].Direction = make_float3(0, 1, 0);
		h_rays[i].Data = make_float2(1, 0);
	}

	Ray *rays;
	cudaMalloc(&rays, sizeof(Ray) * 100);
	cudaMemcpy(rays, &h_rays[0], sizeof(Ray) * 100, cudaMemcpyHostToDevice);

	dim3 block(100);
	dim3 grid(1);
	Test<<<grid, block>>>(rays);
}

Edit: added headers.
Forgot to mention I use sm_30 and Nsight to profile memory access

SASS:

MOV R1, c[0x0][0x44];								
S2R R0, SR_CTAID.X;								
MOV32I R3, 0x40000000;								
ISCADD R2, R0, c[0x0][0x140], 0x5;								
ST [R2+0x18], R3;	
ST [R2+0x1c], R0;
EXIT;					
BRA 0x40; # Target=0x0005b700

After fixing the missing header file includes, I compiled your code for various architectures using CUDA 6.5:

nvcc -arch={sm_20 | sm_30 | sm_50} -o ray --machine=32 ray.cu

In all cases the generated SASS code contains a 64-bit store (see below). The issue may affect CUDA 7.0 in particular, maybe someone else can try it with that tool chain.

code for sm_20
                Function : _Z4TestP3Ray
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];             /* 0x2800440400005de4 */
        /*0008*/         S2R R3, SR_CTAID.X;                /* 0x2c0000009400dc04 */
        /*0010*/         MOV32I R2, 0x40000000;             /* 0x1900000000009de2 */
        /*0018*/         ISCADD R0, R3, c[0x0][0x20], 0x5;  /* 0x4000400080301ca3 */
        /*0020*/         ST.64 [R0+0x18], R2;               /* 0x9000000060009ca5 */
        /*0028*/         EXIT;                              /* 0x8000000000001de7 */
                .............................

        code for sm_30
                Function : _Z4TestP3Ray
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                       /* 0x2006e04282804007 */
        /*0008*/                   MOV R1, c[0x0][0x44];               /* 0x2800400110005de4 */
        /*0010*/                   S2R R3, SR_CTAID.X;                 /* 0x2c0000009400dc04 */
        /*0018*/                   MOV32I R2, 0x40000000;              /* 0x1900000000009de2 */
        /*0020*/                   ISCADD R0, R3, c[0x0][0x140], 0x5;  /* 0x4000400500301ca3 */
        /*0028*/                   ST.64 [R0+0x18], R2;                /* 0x9000000060009ca5 */
        /*0030*/                   EXIT;                               /* 0x8000000000001de7 */
        /*0038*/                   BRA 0x38;                           /* 0x4003ffffe0001de7 */
                .............................

        code for sm_50
                Function : _Z4TestP3Ray
        .headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                       /* 0x001ff400e22007e6 */
        /*0008*/                   MOV R1, c[0x0][0x20];               /* 0x4c98078000870001 */
        /*0010*/                   S2R R3, SR_CTAID.X;                 /* 0xf0c8000002570003 */
        /*0018*/                   MOV32I R2, 0x40000000;              /* 0x010400000007f002 */
                                                                       /* 0x001ffc00fe200fe2 */
        /*0028*/                   ISCADD R0, R3, c[0x0][0x140], 0x5;  /* 0x4c18028005070300 */
        /*0030*/                   STG.64 [R0+0x18], R2;               /* 0xeedd000001870002 */
        /*0038*/                   EXIT;                               /* 0xe30000000007000f */
                                                                       /* 0x001f8000fc0007ff */
        /*0048*/                   BRA 0x48;                           /* 0xe2400fffff87000f */
        /*0050*/                   NOP;                                /* 0x50b0000000070f00 */
        /*0058*/                   NOP;                                /* 0x50b0000000070f00 */
                                                                       /* 0x001f8000fc0007e0 */
        /*0068*/                   NOP;                                /* 0x50b0000000070f00 */
        /*0070*/                   NOP;                                /* 0x50b0000000070f00 */
        /*0078*/                   NOP;                                /* 0x50b0000000070f00 */
                .............................

I’ve tried the same program on CUDA 7.5 RC.

nvcc -arch=sm_30 -o ray --machine=32 ray.cu

The output is the same: 2 32bit writes:

code for sm_30
                Function : _Z4TestP3Ray
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
/* 0x22e042e282804007 */        
/*0008*/                   MOV R1, c[0x0][0x44];               /* 0x2800400110005de4 */        
/*0010*/                   S2R R0, SR_CTAID.X;                 /* 0x2c00000094001c04 */        
/*0018*/                   MOV32I R3, 0x40000000;              /* 0x190000000000dde2 */        
/*0020*/                   ISCADD R2, R0, c[0x0][0x140], 0x5;  /* 0x4000400500009ca3 */        
/*0028*/                   ST [R2+0x18], R3;                   /* 0x900000006020dc85 */        
/*0030*/                   ST [R2+0x1c], R0;                   /* 0x9000000070201c85 */        
/*0038*/                   EXIT;                               /* 0x8000000000001de7 */        
/*0040*/                   BRA 0x40;                           /* 0x4003ffffe0001de7 */        
/*0048*/                   NOP;                                /* 0x4000000000001de4 */        
/*0050*/                   NOP;                                /* 0x4000000000001de4 */        
/*0058*/                   NOP;                                /* 0x4000000000001de4 */        
/*0060*/                   NOP;                                /* 0x4000000000001de4 */        
/*0068*/                   NOP;                                /* 0x4000000000001de4 */        
/*0070*/                   NOP;                                /* 0x4000000000001de4 */        
/*0078*/                   NOP;                                /* 0x4000000000001de4 */

Seems like a regression in the compiler. Consider filing a bug with NVIDIA. The bug reporting form is linked from the CUDA registered developer webpage.

For completeness, I took the liberty to also test this in CUDA 6.5 on my desktop. The results show that the program works correctly with one 64bit write:

MOV R1, c[0x0][0x44];								
S2R R3, SR_CTAID.X;								
MOV32I R2, 0x40000000;							
ISCADD R0, R3, c[0x0][0x140], 0x5;							
ST.64 [R0+0x18], R2;	
EXIT;					
BRA 0x38; # Target=0x0005d7f8

I will file the bug report.