Float2 Coalescing

Hi everyone,

I’m having trouble getting a float2 to properly coalesce. I’ve tried to make a simple example to run in the visual profiler but it always returns noncoalesced reads. If anyone could shed some light on this I would be really grateful, thanks.

The code is:

#include <stdio.h>

#include <cuda_runtime_api.h>

__global__ void kernel(float2 *in, float2 *out) {

	int idx=blockIdx.x*blockDim.x+threadIdx.x;

	float2 d=in[idx];

	d.x = 100.f;

	out[idx] = d;

}

int main() {

  const int dataSize=32;

  float2 *in;

  cudaMalloc((void**)&in,dataSize*sizeof(float2));

float2 *out;

  cudaMalloc((void**)&out,dataSize*sizeof(float2));

  kernel<<<1,32>>>(in,out);

  return 0;

}

EDIT: I’m running CUDA 3.1 on an 8800GTX

My suggestion would be to leave out the “volatile” statement and to do something with the data you read,
i.e. write it to another buffer.

I noticed in some of my tests that using volatile with vector types would in some situations break the coalescing, i.e. the x and y members were loaded with separate load instructions - and strided access doesn’t coalesce.

The other part is that if you don’t use the variable, then the compiler will optimize it out (unless you compile with -O0), so store the value somewhere.

Christian

Thanks for the reply Christian. I only added the volatile keyword after reading another thread but removing it still results in noncoalesced reads.

I’ve modified the example to reflect your changes but I’m still struggling to coalesce the reads.

Laurence

It looks like a compiler optimization, believe it or not. Your first code compiles to this:

$LDWbegin__Z6kernelP6float2S0_:

        .loc    28      5       0

        cvt.u32.u16     %r1, %tid.x;

        mov.u16         %rh1, %ctaid.x;

        mov.u16         %rh2, %ntid.x;

        mul.wide.u16    %r2, %rh1, %rh2;

        add.u32         %r3, %r1, %r2;

        cvt.s64.s32     %rd1, %r3;

        mul.wide.s32    %rd2, %r3, 8;

        ld.param.u64    %rd3, [__cudaparm__Z6kernelP6float2S0__in];

        add.u64         %rd4, %rd3, %rd2;

        ld.global.f32   %f1, [%rd4+4];

        .loc    28      8       0

        ld.param.u64    %rd5, [__cudaparm__Z6kernelP6float2S0__out];

        add.u64         %rd6, %rd5, %rd2;

        mov.f32         %f2, 0f42c80000;        // 100

        st.global.v2.f32        [%rd6+0], {%f2,%f1};

        .loc    28      9       0

        exit;

You can see that the compiler has elected not to load both halves of the float2 because the second word isn’t used. The store is coalesced.

Changing the kernel to this:

__global__ void kernel(float2 *in, float2 *out) {

int idx=blockIdx.x*blockDim.x+threadIdx.x;

float2 d=in[idx];

        d.x += 100.f;

out[idx] = d;

}

produces this:

$LDWbegin__Z6kernelP6float2S0_:

        .loc    28      5       0

        cvt.u32.u16     %r1, %tid.x;

        mov.u16         %rh1, %ctaid.x;

        mov.u16         %rh2, %ntid.x;

        mul.wide.u16    %r2, %rh1, %rh2;

        add.u32         %r3, %r1, %r2;

        cvt.s64.s32     %rd1, %r3;

        mul.wide.s32    %rd2, %r3, 8;

        ld.param.u64    %rd3, [__cudaparm__Z6kernelP6float2S0__in];

        add.u64         %rd4, %rd3, %rd2;

        ld.global.v2.f32        {%f1,%f2}, [%rd4+0];

        .loc    28      8       0

        ld.param.u64    %rd5, [__cudaparm__Z6kernelP6float2S0__out];

        add.u64         %rd6, %rd5, %rd2;

        mov.f32         %f3, 0f42c80000;        // 100

        add.f32         %f4, %f1, %f3;

        st.global.v2.f32        [%rd6+0], {%f4,%f2};

        .loc    28      9       0

        exit;

This version loads both halves of the float2 and should be coalesced. So the trick seems to be you need to use both parts of the float2, otherwise the compiler will optimize the redundant load away and break coalescing rules.

Hi avidday, thanks the for reply.

I’ve made the correction to the kernel, recompiled and am still getting noncoalesced reads. I’ve checked the PTX out generated at my end and its a little different. I’m starting to wonder if there is a problem using the x64 version.

The code I showed was generated using the 3.2 toolkit on Linux x86_64. If the load compiles to a ld.global.v2.f32 instruction, it should be coalesced. If you get that instruction and the profiler indicates non-coalesced reads, I would begin to question whether the profiler is correct or not…

EDIT: Another possibility might be to try disassembling the cubin and see what that code is doing. I suppose it isn’t beyond the realms of possibility that the PTX is getting miscompiled or optimized in some way which breaks things.

OK so I decided to go back and try with the 32bit compiler and toolkit to see if there was a problem there. Ran the executable through the profiler and its STILL not coalesced…

I’ve looked at the PTX and it looks like theres two reads to the input array still:

$LDWbegin__Z6kernelP6float2S0_:

$LDWbeginblock_203_1:

	.loc	28	6	0

	cvt.u32.u16 	%r1, %tid.x;

	cvt.u32.u16 	%r2, %ctaid.x;

	cvt.u32.u16 	%r3, %ntid.x;

	mul.lo.u32 	%r4, %r2, %r3;

	add.u32 	%r5, %r1, %r4;

	mov.s32 	%r6, %r5;

	.loc	28	8	0

	ld.param.u32 	%r7, [__cudaparm__Z6kernelP6float2S0__in];

	cvt.s64.s32 	%rd1, %r6;

	mul.wide.s32 	%rd2, %r6, 8;

	cvt.s32.u64 	%r8, %rd2;

	add.u32 	%r9, %r7, %r8;

	ld.global.f32 	%f1, [%r9+0];

	st.local.f32 	[__cuda_local_var_82562_16_non_const_d_8+0], %f1;

	ld.param.u32 	%r10, [__cudaparm__Z6kernelP6float2S0__in];

	cvt.s64.s32 	%rd3, %r6;

	mul.wide.s32 	%rd4, %r6, 8;

	cvt.s32.u64 	%r11, %rd4;

	add.u32 	%r12, %r10, %r11;

	ld.global.f32 	%f2, [%r12+4];

	st.local.f32 	[__cuda_local_var_82562_16_non_const_d_8+4], %f2;

	.loc	28	9	0

	ld.local.f32 	%f3, [__cuda_local_var_82562_16_non_const_d_8+0];

	mov.f32 	%f4, 0f42c80000;     	// 100

	add.f32 	%f5, %f3, %f4;

	st.local.f32 	[__cuda_local_var_82562_16_non_const_d_8+0], %f5;

	.loc	28	11	0

	ld.local.f32 	%f6, [__cuda_local_var_82562_16_non_const_d_8+0];

	ld.param.u32 	%r13, [__cudaparm__Z6kernelP6float2S0__out];

	cvt.s64.s32 	%rd5, %r6;

	mul.wide.s32 	%rd6, %r6, 8;

	cvt.s32.u64 	%r14, %rd6;

	add.u32 	%r15, %r13, %r14;

	st.global.f32 	[%r15+0], %f6;

	ld.local.f32 	%f7, [__cuda_local_var_82562_16_non_const_d_8+4];

	ld.param.u32 	%r16, [__cudaparm__Z6kernelP6float2S0__out];

	cvt.s64.s32 	%rd7, %r6;

	mul.wide.s32 	%rd8, %r6, 8;

	cvt.s32.u64 	%r17, %rd8;

	add.u32 	%r18, %r16, %r17;

	st.global.f32 	[%r18+4], %f7;

$LDWendblock_203_1:

	.loc	28	12	0

	exit;

If I compile the kernel I posted with the 3.1 toolkit on Linux x86_64, I get ptx which is no different to that produced by CUDA 3.2. What OS are you doing this with?

Windows 7 for debugging with NVIDIA Parallel Nsight. Do you think this could be a bug?

What build flags are you using to compile it?

I’m using visual studio and this is what is generated when I build:

C:\Users\Laurence Dawson\Documents\Visual Studio 2010\Projects\testing_x64\testing_x64>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2008 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\common\inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include"  -G0 --keep --keep-dir "Debug\" -maxrregcount=32  --machine 32 --compile  -D_NEXUS_DEBUG -g    -Xcompiler "/EHsc /nologo /Od /Zi  /MDd " -o "Debug\hello.obj" "C:\Users\Laurence Dawson\Documents\Visual Studio 2010\Projects\testing_x64\testing_x64\hello.cu"

This is expected behavior for a debug build. Vectorization of loads is an optimization. All optimizations are disabled when building with -g. For a release build (with full optimization) you should see vectorized loads, provided all load data is used (as avidday already pointed out).

Ah so thats all it was! Thanks avidday and njuffa for your help!