Performance local for loop vs marcro

Hi,

i have a question about a performance difference that I do not understand.

I want to copy some memory (256 * uint32_t) from global to shared memory using a blocksize of 32.

here is the definition of my shared memory:

#define _SHARED_LENGTH 256

struct _BUFFER_2
{
uint32_t ui32Buffer[_SHARED_LENGTH];
};
struct _BUFFER_3
{
uint64_t ui64Buffer[2][32];
};
union _SHARED_BUFFER
{
struct _BUFFER_2 Buffer2;
struct _BUFFER_3 Buffer3;
};

static device shared union _SHARED_BUFFER dev_sharedBuffer[32];
uint32_t* l_ui32SharedBufferPtr = dev_sharedBuffer[threadIdx.x].Buffer2.ui32Buffer;

I have a macro to copy some memory:

#define memcpy_ui32(p_pDestination, p_pSource, p_i32Count)							\
		do {																		\
			uint32_t* l_ui32DestinationPtr = (uint32_t*)(p_pDestination);			\
			uint32_t* l_ui32SourcePtr = (uint32_t*)(p_pSource);						\
			for( int32_t l_nIndex = 0; l_nIndex < p_i32Count; l_nIndex++ )			\
			{																		\
				l_ui32DestinationPtr[l_nIndex] = l_ui32SourcePtr[l_nIndex];			\
			}																		\
		} while( 0 )

I have 2 version of this copy, one with a for loop, the other uses my macro. This is my for loop version:

for( int32_t l_nLoop = 0; l_nLoop < _SHARED_LENGTH; l_nLoop++ )
{
l_ui32SharedBufferPtr[l_nLoop] = l_ui32GlobalPtr[_GLOBAL_MEMORY_SIZE_UI32 - _SHARED_LENGTH + l_nLoop];
}

and the other one uses the copy macro memcpy_ui32:

memcpy_ui32( l_ui32SharedBufferPtr, &l_ui32GlobalPtr[_GLOBAL_MEMORY_SIZE_UI32 - _SHARED_LENGTH], _SHARED_LENGTH );

Using the for loop gives me a 10% performance gain. I was under the impression that my macro will expand to almost the same code as my for loop, but that seems not to be true.

Looking into the created PTX shows that both result in different code.

Here is the resulting code for the slow macro version (only the copy part):

$L__BB0_7:
.loc 1 1551 1
mul.wide.s32 %rd699, %r2223, 4;
add.s64 %rd700, %rd16, %rd699;
ld.global.u32 %r1665, [%rd700+261120];
shl.b64 %rd701, %rd145, 10;
mov.u64 %rd702, dev_sharedBuffer;
add.s64 %rd703, %rd702, %rd701;
add.s64 %rd704, %rd703, %rd699;
st.shared.u32 [%rd704], %r1665;
ld.global.u32 %r1666, [%rd700+261124];
st.shared.u32 [%rd704+4], %r1666;
ld.global.u32 %r1667, [%rd700+261128];
st.shared.u32 [%rd704+8], %r1667;
ld.global.u32 %r1668, [%rd700+261132];
st.shared.u32 [%rd704+12], %r1668;
ld.global.u32 %r1669, [%rd700+261136];
st.shared.u32 [%rd704+16], %r1669;
ld.global.u32 %r1670, [%rd700+261140];
st.shared.u32 [%rd704+20], %r1670;
ld.global.u32 %r1671, [%rd700+261144];
st.shared.u32 [%rd704+24], %r1671;
ld.global.u32 %r1672, [%rd700+261148];
st.shared.u32 [%rd704+28], %r1672;
ld.global.u32 %r1673, [%rd700+261152];
st.shared.u32 [%rd704+32], %r1673;
ld.global.u32 %r1674, [%rd700+261156];
st.shared.u32 [%rd704+36], %r1674;
ld.global.u32 %r1675, [%rd700+261160];
st.shared.u32 [%rd704+40], %r1675;
ld.global.u32 %r1676, [%rd700+261164];
st.shared.u32 [%rd704+44], %r1676;
ld.global.u32 %r1677, [%rd700+261168];
st.shared.u32 [%rd704+48], %r1677;
ld.global.u32 %r1678, [%rd700+261172];
st.shared.u32 [%rd704+52], %r1678;
ld.global.u32 %r1679, [%rd700+261176];
st.shared.u32 [%rd704+56], %r1679;
ld.global.u32 %r1680, [%rd700+261180];
st.shared.u32 [%rd704+60], %r1680;
ld.global.u32 %r1681, [%rd700+261184];
st.shared.u32 [%rd704+64], %r1681;
ld.global.u32 %r1682, [%rd700+261188];
st.shared.u32 [%rd704+68], %r1682;
ld.global.u32 %r1683, [%rd700+261192];
st.shared.u32 [%rd704+72], %r1683;
ld.global.u32 %r1684, [%rd700+261196];
st.shared.u32 [%rd704+76], %r1684;
ld.global.u32 %r1685, [%rd700+261200];
st.shared.u32 [%rd704+80], %r1685;
ld.global.u32 %r1686, [%rd700+261204];
st.shared.u32 [%rd704+84], %r1686;
ld.global.u32 %r1687, [%rd700+261208];
st.shared.u32 [%rd704+88], %r1687;
ld.global.u32 %r1688, [%rd700+261212];
st.shared.u32 [%rd704+92], %r1688;
ld.global.u32 %r1689, [%rd700+261216];
st.shared.u32 [%rd704+96], %r1689;
ld.global.u32 %r1690, [%rd700+261220];
st.shared.u32 [%rd704+100], %r1690;
ld.global.u32 %r1691, [%rd700+261224];
st.shared.u32 [%rd704+104], %r1691;
ld.global.u32 %r1692, [%rd700+261228];
st.shared.u32 [%rd704+108], %r1692;
ld.global.u32 %r1693, [%rd700+261232];
st.shared.u32 [%rd704+112], %r1693;
ld.global.u32 %r1694, [%rd700+261236];
st.shared.u32 [%rd704+116], %r1694;
ld.global.u32 %r1695, [%rd700+261240];
st.shared.u32 [%rd704+120], %r1695;
ld.global.u32 %r1696, [%rd700+261244];
st.shared.u32 [%rd704+124], %r1696;
add.s32 %r2223, %r2223, 32;
setp.ne.s32 %p4, %r2223, 256;
@%p4 bra $L__BB0_7;

And this is the PTX of the faster for loop version:

$L__BB0_7:
.loc 1 1556 1
mul.wide.s32 %rd698, %r2191, 4;
add.s64 %rd699, %rd16, %rd698;
ld.global.u32 %r1665, [%rd699+261120];
shl.b64 %rd700, %rd145, 10;
mov.u64 %rd701, dev_sharedBuffer;
add.s64 %rd702, %rd701, %rd700;
add.s64 %rd703, %rd702, %rd698;
st.shared.u32 [%rd703], %r1665;
ld.global.u32 %r1666, [%rd699+261124];
st.shared.u32 [%rd703+4], %r1666;
ld.global.u32 %r1667, [%rd699+261128];
st.shared.u32 [%rd703+8], %r1667;
ld.global.u32 %r1668, [%rd699+261132];
st.shared.u32 [%rd703+12], %r1668;
ld.global.u32 %r1669, [%rd699+261136];
st.shared.u32 [%rd703+16], %r1669;
ld.global.u32 %r1670, [%rd699+261140];
st.shared.u32 [%rd703+20], %r1670;
ld.global.u32 %r1671, [%rd699+261144];
st.shared.u32 [%rd703+24], %r1671;
ld.global.u32 %r1672, [%rd699+261148];
st.shared.u32 [%rd703+28], %r1672;
ld.global.u32 %r1673, [%rd699+261152];
st.shared.u32 [%rd703+32], %r1673;
ld.global.u32 %r1674, [%rd699+261156];
st.shared.u32 [%rd703+36], %r1674;
ld.global.u32 %r1675, [%rd699+261160];
st.shared.u32 [%rd703+40], %r1675;
ld.global.u32 %r1676, [%rd699+261164];
st.shared.u32 [%rd703+44], %r1676;
ld.global.u32 %r1677, [%rd699+261168];
st.shared.u32 [%rd703+48], %r1677;
ld.global.u32 %r1678, [%rd699+261172];
st.shared.u32 [%rd703+52], %r1678;
ld.global.u32 %r1679, [%rd699+261176];
st.shared.u32 [%rd703+56], %r1679;
ld.global.u32 %r1680, [%rd699+261180];
st.shared.u32 [%rd703+60], %r1680;
.loc 1 1554 56
add.s32 %r2191, %r2191, 16;
.loc 1 1554 1
setp.ne.s32 %p4, %r2191, 256;
@%p4 bra $L__BB0_7;

Why is my macro different to my for loop? How can I modify my macro so that it creates similar code than a direct for loop? I tried to modify my macro or made a incline function out of it, but all without success.

Thanks a lot,
Daniel

First a general remark. It is good practice to provide a complete minimal example when asking questions regarding the performance of a specific code. (others could play around with it, etc).

The loop with 256 iterations in the macro version is unrolled 32 times, without the unroll factor is 16 instead. I don’t know why the compiler chooses to do so. But the source codes are not the same anyways, are they? The macro also has do while which is missing in the alternate version.

If I understand your code correctly, you would like to copy the same row of 256 elements from global memory into 32 rows of 256 elements in shared memory, and you are using 1 thread per row of shared memory. If this is the case, your code should have a bad memory access pattern (can’t confirm without a minimal example). Ideally, each thread would operate on a separate shared memory column, not on a separate row.

Hi striker,

thanks for your fast reply. Yes, I’m aware that it is much better to have a complete sample, but I picked this out of a very large project in the hope that I see something that is common and/or known.

I copy 256 uint32_t from global to shared. I mentioned the blocksize of 32 to demonstrate that the required shared memory fits into the allocated shared mem.

The memory is not shared across the threadblock. Each thread uses just its own mem only. It is just to avoid the use of global memory. I see the problem with the access pattern, but I assume that this is not related to my observation for loop / macro. I saw the unroll difference between the two version in the PTX, but since I do not have a unroll in place I wounder why the compiler uses different values here?

The do - while(0) construct just to force the need of a ‘;’ at the end of the macro call. I was hoping the compiler will optimize it away.

I will dive deeper into the unroll pragma and try to understand the different handling.

Thanks a lot for your insights und findings so far!