CUDA performance struct vs multiple variables

I am currently trying to improve the usability of my code and wanted to reduce the function parameters.
This could be achieved by grouping the logically connected values of array data, width and length into a structure.

Before I changed my complete code I created a test file to compare the performance of two otherwise identical implementations.

The result is quite bad, the function using structures is running 20% slower.

Here are the functions and the generated assembler code:

First function, separated variables:

__global__ void  mul1(float * array, const float * array2,const int width, const int height)
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	for(int j = i; j < width*height; j+=gridDim.x*blockDim.x)
	{
			array[j] = array[j] * array2[j];
	}
}

The assembler code created from line 6 looks like this:

SHL R6, R5, 0x2;
IADD R6, R0, R6;
MOV R6, R6;
LD R7, [R6];
SHL R6, R5, 0x2;
IADD R6, R2, R6;
MOV R6, R6;
LD R6, [R6];
FMUL R7, R7, R6;
SHL R6, R5, 0x2;
IADD R6, R0, R6;
MOV R6, R6;
ST [R6], R7;

Second function, using a structure containing array, width and height:

__global__ void mul2(s_array Array1, const s_array Array2)
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	for(int j = i; j < Array1.height*Array1.width; j+=gridDim.x*blockDim.x)
	{
			Array1.data[j] = Array1.data[j] * Array2.data[j];
	}
}

The assembler code created from line 6 looks like this:

IADD R3, R0, RZ;
MOV R3, R3;
LD R3, [R3];
SHL R4, R2, 0x2;
IADD R3, R3, R4;
MOV R3, R3;
LD R4, [R3];
IADD R3, R0, 0x10;
MOV R3, R3;
LD R3, [R3];
SHL R5, R2, 0x2;
IADD R3, R3, R5;
MOV R3, R3;
LD R3, [R3];
FMUL R5, R4, R3;
IADD R3, R0, RZ;
MOV R3, R3;
LD R3, [R3];
SHL R4, R2, 0x2;
IADD R3, R3, R4;
MOV R3, R3;
ST [R3], R5;

Also noticable is that the calculation of width*height is done multiple times, even though they are passed as constant arguments, thus the value could be stored instead of multiplying again.

Is there any reason the output of the code is so different?
To me it looks like ugly code will produce the fastest program.

Compiler optimizations are set to maximum, no restriction on used registers was made.
Using Nsight Eclipse Edition with Cuda Toolkit 6.5

Beauty or ugliness is usually in the eye of the beholder. To me, the first code looks more readable than then the second.

The kernel appears to be memory bound, and the second variant appears to have many more load instructions than the first, which does not seem to make sense. Can you post complete, buildable and runnable code, along with the exact nvcc command line used to build the code?

I suspect performance differences may also have to do with the use of the ‘const’ qualifier. Using a “const T *” is not the same semantically as sticking a pointer in a struct and declaring the struct as const. In the first case, the data the pointer is pointing has the read-only attribute, in the second, it is the struct itself.

When working with potentially memory-bound code, it is a recommended practice to use the restrict modifier with pointer argument, provided that the pointers are not aliased. This gives the compiler more freedom to re-order loads and stores:

mul1 (float * __restrict__ array, const float * __restrict__ array2, const int width, const int height)

Thanks for the answer, unfortunately I am at home now and won’t be able to post any code until monday.

About the use of the restrict keyword. As fas as I understood it that would only have an effect if there would be multiple “theoretical reads” of the same variable or not? There is no second chance for the program to load Array1 or 2 again, so in my particular case I will not notice a difference?

I tried the restrict keyword in the same example earlier today and did not noticed any performance change. I also used the restrict on the array pointer inside the structure(not sure if that can even have an effect when it is hidden so deep)

What you can notice in the second example is that the access to the structure makes the compiler generate code which accesses the data fields of the structure by adding an offset to the base address of the structure. In my eyes it should be possible to optimize this. The calculation is done multiple times, I can not see why the complier chooses to recalculate the data instead of using an additional register.

Could it be that the compiler will use registers to save intermediate results if I restrict the kernel with appropriate launch bounds?

restrict is an instruction to the compiler that says your arrays do not overlap each
other. It can be used (both on host and GPU code) by compilers to do some optimisation.
On K20 and newer GPUs, the nvcc compiler can use it with const to direct global reads via
a small read-only cache. This can help performance.

Bill

I am not going to speculate on what may be going on with the code generation of the two code variants without having buildable, runnable code to try out.

Use of restrict is perfectly appropriate for streaming kernels without data re-use. While it may not provide any performance benefits in a particular case, across the universe of all code it provides potential performance upside with negligible risk of performance downside. So it is a good idea to use it whenever it is guaranteed by construction that pointer arguments to functions do not alias.

restrict is basically a promise by the programmer that the referenced memory does not overlap, as already pointed out by wlangdon. In other words, access to each object is restricted to occur through that particular pointer and no other. If the promise is broken, the behavior of the resulting code is indeterminate. As wlangdon further points out, in addition to allowing more liberal re-ordering of loads and stores, restrict in combination with the ‘const’ modifier also enables (but does not guarantee!) use of the LDG instruction on newer GPU architectures, which can improve load performance. I have certainly experienced several cases where the simple addition of restrict (few minutes of work) resulted in a 20% speedup. Compare the recommendations of the Best Practices Guide.

The “restrict” modifier was first introduced in ISO C99, and is completely standard in C. However, it is one of the C features that has not made it into the C++ standard, for reasons I do not know. However most C++ compilers offer this as a language extension, but need to use the appropriate namespace (marked by leading double underscore) for that, which gives us restrict as supported by the CUDA compiler.

As requested, the full executable code. I am using host memory due to my platform being a Jetson TK1.
The execution times on my system for exactly this code are:
54% mul2
46% mul1

mul2 uses 9 registers, mul1 uses 8 registers, the overall active warps and occupancy are except minor differences the same.

#include "stdio.h"

struct s_array
{
	float * data;
	int width;
	int height;
};

void fill_num(float * array, int width, int height, int num)
{
	for(int i = 0; i < width*height; i++)
	{
		array[i] = num;
	}
}

__global__ void  mul1(float * array1, const float * array2,const int width, const int height)
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	for(int j = i; j < height*width; j+=gridDim.x*blockDim.x)
	{
			array1[j] = array1[j] * array2[j];
	}
}

__global__ void mul2(s_array Array1,  s_array Array2)
{
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	for(int j = i; j < Array1.height*Array1.width; j+=gridDim.x*blockDim.x)
	{
			Array1.data[j] = Array1.data[j] * Array2.data[j];
	}
}

int main(void)
{
	float * Array1,* Array2;
	s_array Array1_s, Array2_s;
	int width = 1000, height =1000;
	Array1_s.height=height;
	Array1_s.width=width;
	Array2_s.height=height;
	Array2_s.width=width;

	cudaHostAlloc((void**)&Array1,width*height*sizeof(float),cudaHostAllocDefault);
	cudaHostAlloc((void**)&Array2,width*height*sizeof(float),cudaHostAllocDefault);
	cudaHostAlloc((void**)&Array1_s.data,Array1_s.width*Array1_s.height*sizeof(float),cudaHostAllocDefault);
	cudaHostAlloc((void**)&Array2_s.data,Array2_s.width*Array2_s.height*sizeof(float),cudaHostAllocDefault);

	fill_num(Array1,width,height,7);
	fill_num(Array2,width,height,5);
	fill_num(Array1_s.data,Array1_s.width,Array1_s.height,7);
	fill_num(Array2_s.data,Array2_s.width,Array2_s.height,5);


	mul2<<<20,256>>>(Array1_s,Array2_s);
	cudaDeviceSynchronize();
	printf("Data is: %f\n",Array1_s.data[1000]);
	mul1<<<20,256>>>(Array1,Array2,width, height);
	cudaDeviceSynchronize();
	printf("Data is: %f\n",Array1[1000]);

	for(int i =0 ; i < width*height;i++)
	{
		if(Array1[i]!=Array1_s.data[i])
		{
			printf("Error in element %i\n", i);
		}
	}


	return 0;
}

On the GPU I have here, with the code compiled as is, I see pretty much identical execution times for the two kernel variants, in fact mul2() is a tad faster but the difference is at noise level (< 2%). I used the simple profiler built into the CUDA driver to measure kernel execution times. This is turned on by exporting CUDA_PROFILE=1. My build commandline was: nvcc -o struct struct.cu

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Quadro 2000
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR 13cafda47fef7af8
method,gputime,cputime,occupancy
method=[ _Z4mul27s_arrayS_ ] gputime=[ 1273.184 ] cputime=[ 22.285 ] occupancy=[ 1.000 ]
method=[ _Z4mul1PfPKfii ] gputime=[ 1289.888 ] cputime=[ 18.766 ] occupancy=[ 1.000 ]

The machine code for the two kernels looks pretty much identical (see disassembly below). Are you doing a debug build by any chance, or specifying -G or -lineinfo as part of your build?

code for sm_20
                Function : _Z4mul1PfPKfii
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];                /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_CTAID.X;                   /* 0x2c00000094001c04 */
        /*0010*/         S2R R2, SR_TID.X;                     /* 0x2c00000084009c04 */
        /*0018*/         MOV R3, c[0x0][0x34];                 /* 0x28004000d000dde4 */
        /*0020*/         IMAD R0, R0, c[0x0][0x8], R2;         /* 0x2004400020001ca3 */
        /*0028*/         IMUL R5, R3, c[0x0][0x30];            /* 0x50004000c0315ca3 */
        /*0030*/         ISETP.GE.AND P0, PT, R0, R5, PT;      /* 0x1b0e00001401dc23 */
        /*0038*/     @P0 EXIT;                                 /* 0x80000000000001e7 */
        /*0040*/         MOV R2, c[0x0][0x14];                 /* 0x2800400050009de4 */
        /*0048*/         MOV32I R10, 0x4;                      /* 0x1800000010029de2 */
        /*0050*/         IMUL R4, R2, c[0x0][0x8];             /* 0x5000400020211ca3 */
        /*0058*/         IMAD R8.CC, R0, R10, c[0x0][0x28];    /* 0x20158000a0021ca3 */
        /*0060*/         IMAD.HI.X R9, R0, R10, c[0x0][0x2c];  /* 0x20948000b0025ce3 */
        /*0068*/         IMAD R2.CC, R0, R10, c[0x0][0x20];    /* 0x2015800080009ca3 */
        /*0070*/         LD.E R6, [R8];                        /* 0x8400000000819c85 */
        /*0078*/         IMAD.HI.X R3, R0, R10, c[0x0][0x24];  /* 0x209480009000dce3 */
        /*0080*/         IADD R0, R4, R0;                      /* 0x4800000000401c03 */
        /*0088*/         LD.E R7, [R2];                        /* 0x840000000021dc85 */
        /*0090*/         ISETP.LT.AND P0, PT, R0, R5, PT;      /* 0x188e00001401dc23 */
        /*0098*/         FMUL R6, R7, R6;                      /* 0x5800000018719c00 */
        /*00a0*/         ST.E [R2], R6;                        /* 0x9400000000219c85 */
        /*00a8*/     @P0 BRA 0x58;                             /* 0x4003fffea00001e7 */
        /*00b0*/         EXIT;                                 /* 0x8000000000001de7 */
                ...............................

Function : _Z4mul27s_arrayS_
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];                /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_CTAID.X;                   /* 0x2c00000094001c04 */
        /*0010*/         S2R R2, SR_TID.X;                     /* 0x2c00000084009c04 */
        /*0018*/         MOV R3, c[0x0][0x2c];                 /* 0x28004000b000dde4 */
        /*0020*/         IMAD R0, R0, c[0x0][0x8], R2;         /* 0x2004400020001ca3 */
        /*0028*/         IMUL R5, R3, c[0x0][0x28];            /* 0x50004000a0315ca3 */
        /*0030*/         ISETP.GE.AND P0, PT, R0, R5, PT;      /* 0x1b0e00001401dc23 */
        /*0038*/     @P0 EXIT;                                 /* 0x80000000000001e7 */
        /*0040*/         MOV R2, c[0x0][0x14];                 /* 0x2800400050009de4 */
        /*0048*/         MOV32I R10, 0x4;                      /* 0x1800000010029de2 */
        /*0050*/         IMUL R4, R2, c[0x0][0x8];             /* 0x5000400020211ca3 */
        /*0058*/         IMAD R8.CC, R0, R10, c[0x0][0x30];    /* 0x20158000c0021ca3 */
        /*0060*/         IMAD.HI.X R9, R0, R10, c[0x0][0x34];  /* 0x20948000d0025ce3 */
        /*0068*/         IMAD R2.CC, R0, R10, c[0x0][0x20];    /* 0x2015800080009ca3 */
        /*0070*/         LD.E R6, [R8];                        /* 0x8400000000819c85 */
        /*0078*/         IMAD.HI.X R3, R0, R10, c[0x0][0x24];  /* 0x209480009000dce3 */
        /*0080*/         IADD R0, R4, R0;                      /* 0x4800000000401c03 */
        /*0088*/         LD.E R7, [R2];                        /* 0x840000000021dc85 */
        /*0090*/         ISETP.LT.AND P0, PT, R0, R5, PT;      /* 0x188e00001401dc23 */
        /*0098*/         FMUL R6, R7, R6;                      /* 0x5800000018719c00 */
        /*00a0*/         ST.E [R2], R6;                        /* 0x9400000000219c85 */
        /*00a8*/     @P0 BRA 0x58;                             /* 0x4003fffea00001e7 */
        /*00b0*/         EXIT;                                 /* 0x8000000000001de7 */
                ..................................

When I build for the Jetson TK1 compute capability of 3.2, there is also no significant difference of the code generated for the two kernel variants. Using CUDA 6.5, I built the code as follows:

nvcc -arch=sm_32 -o struct struct.cu

The generated machine code is:

code for sm_32
               Function : _Z4mul1PfPKfii
       .headerflags    @"EF_CUDA_SM32 EF_CUDA_PTX_SM(EF_CUDA_SM32)"
                                                                         /* 0x08b0a08c908c1000 */
       /*0008*/                   MOV R1, c[0x0][0x44];                  /* 0x64c03c00089c0006 */
       /*0010*/                   S2R R0, SR_CTAID.X;                    /* 0x86400000129c0002 */
       /*0018*/                   MOV R2, c[0x0][0x154];                 /* 0x64c03c002a9c000a */
       /*0020*/                   S2R R3, SR_TID.X;                      /* 0x86400000109c000e */
       /*0028*/                   IMUL R4, R2, c[0x0][0x150];            /* 0x61c018002a1c0812 */
       /*0030*/                   IMAD R0, R0, c[0x0][0x28], R3;         /* 0x51080c00051c0002 */
       /*0038*/                   ISETP.GE.AND P0, PT, R0, R4, PT;       /* 0xdb681c00021c001e */
                                                                         /* 0x0800808010a010b8 */
       /*0048*/               @P0 EXIT;                                  /* 0x180000000000003c */
       /*0050*/                   MOV R2, c[0x0][0x34];                  /* 0x64c03c00069c000a */
       /*0058*/                   MOV32I R11, 0x4;                       /* 0x74000000021fc02e */
       /*0060*/                   IMUL R3, R2, c[0x0][0x28];             /* 0x61c01800051c080e */
       /*0068*/                   NOP;                                   /* 0x85800000001c3c02 */
       /*0070*/                   NOP;                                   /* 0x85800000001c3c02 */
       /*0078*/                   NOP;                                   /* 0x85800000001c3c02 */
                                                                         /* 0x08a010a09c908cb0 */
       /*0088*/                   IMAD R8.CC, R0, R11, c[0x0][0x148];    /* 0x910c2c00291c0022 */
       /*0090*/                   IMAD.HI.X R9, R0, R11, c[0x0][0x14c];  /* 0x93182c00299c0026 */
       /*0098*/                   IMAD R6.CC, R0, R11, c[0x0][0x140];    /* 0x910c2c00281c001a */
       /*00a0*/                   LD.E R2, [R8];                         /* 0xc4800000001c2008 */
       /*00a8*/                   IMAD.HI.X R7, R0, R11, c[0x0][0x144];  /* 0x93182c00289c001e */
       /*00b0*/                   LD.E R5, [R6];                         /* 0xc4800000001c1814 */
       /*00b8*/                   IADD R0, R3, R0;                       /* 0xe0800000001c0c02 */
                                                                         /* 0x0800b810b810a0b8 */
       /*00c8*/                   ISETP.LT.AND P0, PT, R0, R4, PT;       /* 0xdb181c00021c001e */
       /*00d0*/                   FMUL R2, R5, R2;                       /* 0xe3400000011c140a */
       /*00d8*/                   ST.E [R6], R2;                         /* 0xe4800000001c1808 */
       /*00e0*/               @P0 BRA 0x80;                              /* 0x12007fffcc00003c */
       /*00e8*/                   MOV RZ, RZ;                            /* 0xe4c03c007f9c03fe */
       /*00f0*/                   EXIT;                                  /* 0x18000000001c003c */
       /*00f8*/                   BRA 0xf8;                              /* 0x12007ffffc1c003c */
               ...............................

Function : _Z4mul27s_arrayS_
       .headerflags    @"EF_CUDA_SM32 EF_CUDA_PTX_SM(EF_CUDA_SM32)"
                                                                         /* 0x08b0a08c908c1000 */
       /*0008*/                   MOV R1, c[0x0][0x44];                  /* 0x64c03c00089c0006 */
       /*0010*/                   S2R R0, SR_CTAID.X;                    /* 0x86400000129c0002 */
       /*0018*/                   MOV R2, c[0x0][0x14c];                 /* 0x64c03c00299c000a */
       /*0020*/                   S2R R3, SR_TID.X;                      /* 0x86400000109c000e */
       /*0028*/                   IMUL R4, R2, c[0x0][0x148];            /* 0x61c01800291c0812 */
       /*0030*/                   IMAD R0, R0, c[0x0][0x28], R3;         /* 0x51080c00051c0002 */
       /*0038*/                   ISETP.GE.AND P0, PT, R0, R4, PT;       /* 0xdb681c00021c001e */
                                                                         /* 0x0800808010a010b8 */
       /*0048*/               @P0 EXIT;                                  /* 0x180000000000003c */
       /*0050*/                   MOV R2, c[0x0][0x34];                  /* 0x64c03c00069c000a */
       /*0058*/                   MOV32I R11, 0x4;                       /* 0x74000000021fc02e */
       /*0060*/                   IMUL R3, R2, c[0x0][0x28];             /* 0x61c01800051c080e */
       /*0068*/                   NOP;                                   /* 0x85800000001c3c02 */
       /*0070*/                   NOP;                                   /* 0x85800000001c3c02 */
       /*0078*/                   NOP;                                   /* 0x85800000001c3c02 */
                                                                         /* 0x08a010a09c908cb0 */
       /*0088*/                   IMAD R8.CC, R0, R11, c[0x0][0x150];    /* 0x910c2c002a1c0022 */
       /*0090*/                   IMAD.HI.X R9, R0, R11, c[0x0][0x154];  /* 0x93182c002a9c0026 */
       /*0098*/                   IMAD R6.CC, R0, R11, c[0x0][0x140];    /* 0x910c2c00281c001a */
       /*00a0*/                   LD.E R2, [R8];                         /* 0xc4800000001c2008 */
       /*00a8*/                   IMAD.HI.X R7, R0, R11, c[0x0][0x144];  /* 0x93182c00289c001e */
       /*00b0*/                   LD.E R5, [R6];                         /* 0xc4800000001c1814 */
       /*00b8*/                   IADD R0, R3, R0;                       /* 0xe0800000001c0c02 */
                                                                         /* 0x0800b810b810a0b8 */
       /*00c8*/                   ISETP.LT.AND P0, PT, R0, R4, PT;       /* 0xdb181c00021c001e */
       /*00d0*/                   FMUL R2, R5, R2;                       /* 0xe3400000011c140a */
       /*00d8*/                   ST.E [R6], R2;                         /* 0xe4800000001c1808 */
       /*00e0*/               @P0 BRA 0x80;                              /* 0x12007fffcc00003c */
       /*00e8*/                   MOV RZ, RZ;                            /* 0xe4c03c007f9c03fe */
       /*00f0*/                   EXIT;                                  /* 0x18000000001c003c */
       /*00f8*/                   BRA 0xf8;                              /* 0x12007ffffc1c003c */
               ..................................

If I add the restrict modifier to the pointers in mul1() as I previously suggested, an LDG instruction is generated for the load from the ‘const’ array, as expected (line 23):

code for sm_32
                Function : _Z4mul1PfPKfii
        .headerflags    @"EF_CUDA_SM32 EF_CUDA_PTX_SM(EF_CUDA_SM32)"
                                                                         /* 0x08b0a08c908c1000 */
        /*0008*/                   MOV R1, c[0x0][0x44];                 /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_CTAID.X;                   /* 0x86400000129c0002 */
        /*0018*/                   MOV R2, c[0x0][0x154];                /* 0x64c03c002a9c000a */
        /*0020*/                   S2R R3, SR_TID.X;                     /* 0x86400000109c000e */
        /*0028*/                   IMUL R4, R2, c[0x0][0x150];           /* 0x61c018002a1c0812 */
        /*0030*/                   IMAD R0, R0, c[0x0][0x28], R3;        /* 0x51080c00051c0002 */
        /*0038*/                   ISETP.GE.AND P0, PT, R0, R4, PT;      /* 0xdb681c00021c001e */
                                                                         /* 0x0800808010a010b8 */
        /*0048*/               @P0 EXIT;                                 /* 0x180000000000003c */
        /*0050*/                   MOV R2, c[0x0][0x34];                 /* 0x64c03c00069c000a */
        /*0058*/                   MOV32I R9, 0x4;                       /* 0x74000000021fc026 */
        /*0060*/                   IMUL R3, R2, c[0x0][0x28];            /* 0x61c01800051c080e */
        /*0068*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0070*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0078*/                   NOP;                                  /* 0x85800000001c3c02 */
                                                                         /* 0x08a010a0b080a0b0 */
        /*0088*/                   IMAD R6.CC, R0, R9, c[0x0][0x148];    /* 0x910c2400291c001a */
        /*0090*/                   IMAD.HI.X R7, R0, R9, c[0x0][0x14c];  /* 0x93182400299c001e */
        /*0098*/                   LDG.E R2, [R6];                       /* 0x600210847f9c1809 */
        /*00a0*/                   IMAD R6.CC, R0, R9, c[0x0][0x140];    /* 0x910c2400281c001a */
        /*00a8*/                   IMAD.HI.X R7, R0, R9, c[0x0][0x144];  /* 0x93182400289c001e */
        /*00b0*/                   LD.E R5, [R6];                        /* 0xc4800000001c1814 */
        /*00b8*/                   IADD R0, R3, R0;                      /* 0xe0800000001c0c02 */
                                                                         /* 0x08b810b810a30880 */
        /*00c8*/                   ISETP.LT.AND P0, PT, R0, R4, PT;      /* 0xdb181c00021c001e */
        /*00d0*/                   TEXDEPBAR 0x0;                        /* 0x77000000001c0002 */
        /*00d8*/                   FMUL R2, R5, R2;                      /* 0xe3400000011c140a */
        /*00e0*/                   ST.E [R6], R2;                        /* 0xe4800000001c1808 */
        /*00e8*/               @P0 BRA 0x80;                             /* 0x12007fffc800003c */
        /*00f0*/                   MOV RZ, RZ;                           /* 0xe4c03c007f9c03fe */
        /*00f8*/                   EXIT;                                 /* 0x18000000001c003c */
        /*0100*/                   BRA 0x100;                            /* 0x12007ffffc1c003c */
        /*0108*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0110*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0118*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0120*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0128*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0130*/                   NOP;                                  /* 0x85800000001c3c02 */
        /*0138*/                   NOP;                                  /* 0x85800000001c3c02 */
                ...............................

Thanks for the fast answer.
I am using the Nsight Eclipse programming environment. Indeed for the debug build the -G flag is set.
When I clean the project and start the profiling the Release build seems to be started, so I assume the profiling is done on the release executable.

I have just tested creating the same dump as yours on the Jetson, without the use of the Nsight program using the following command:

nvcc -cubin -arch=sm_32 --ptxas-options=-v struct.cu
and
nvcc -G -cubin -arch=sm_32 --ptxas-options=-v struct.cu

after this I used
cuobjdump -sass struct.cubin

The dump is very different in case the -G option is specified, while it looks the same without the -G options as you expected.

The code I am seeing when I profile the application with Nsight looks rather like the one I got when using the -G option.

When I clean the project and just press profile the command it issues is the same as the one for the release build. I do not really know if there is any special command included in it, maybe you are able to clear this up:

Invoking: NVCC Compiler
/usr/local/cuda-6.5/bin/nvcc -O3 -gencode arch=compute_32,code=sm_32  -odir "" -M -o "cuda_structure_test.d" "../cuda_structure_test.cu"
/usr/local/cuda-6.5/bin/nvcc -O3 --compile --relocatable-device-code=false -gencode arch=compute_32,code=compute_32 -gencode arch=compute_32,code=sm_32  -x cu -o  "cuda_structure_test.o" "../cuda_structure_test.cu"

In the profiling view of Nsight I am shown 8/9 registers used, while there are 7 used after issuing the command from above directly on the Jetson.

[b]Edit: In a hidden setting of the Nsight the Debug executable was selected for the profiling process. After changing the executable to release the result is similar to yours.

Thank you very much for the help![/b]

The -G flag is needed to produce code suitable for debugging. It causes all compiler optimizations to be disabled. There is no point in timing code that is compiled with optimizations turned off. I have never used Nsight, so I do not know how it sets the compiler flags. I generally distrust IDEs and tend to use make files where I can explicitly control (and check) compilation flags with ease.

After experiencing this issue I can very good understand your mistrust in them. All performance measurements I ever made might be wrong, since the default setting seems to be using the Debug executable for profiling, which as you just said makes no sense.

The good part about it is that I just reduced the runtime of the program I actually want to optimize by the factor 9, which is very good news! Thanks again!

Nsight is pretty good though. It’s taken some getting used to but there are options for release and debug builds when building. Also, they’ll auto-generate Makefiles for you and I think that’s pretty sexy.

The only thing I don’t like is that they force you to store all your extra command line options in a separate ASCII file. But I guess that’s just being super organized.