code generated with cuda 5.0 is slower than with cuda 4.2 !?

Hi there,
a simple example

__global__ void testCU(float2 *ptr)
	float2 a;
	a = *ptr;
	a.x += 2.0f;
	a.y += 3.0f;
	*ptr = a;

If compiled using 5.0 we get pairs of ST and LD from/to global memory.
Whiles, 4.2 generates single loads and stores using LD.64 and ST.64 respectively.
Both compilations were done for 3.0 architecture.
I would attach dissembler for both but this website seem to be very unhappy when I do so, and didn’t let me create a topic with them.

This is just an example of very wrong compilation, as it’s clear it should use vector load/store operations for coalesced memory access.

For my own kernel that works on complex data (float2) I gained a noticeable speed up by switching from 5.0 to 4.2, or should I say my code got slower with upgrading to 5.0.

Is it a bug in the new release ?
Where can I submit it ?

Have you noticed anything like this ?
Can I force 5.0 to generate correct code ?


Sorry for the trouble with attachments to forum posts. There are known issues with various types of attachments (using a different file extension seems to work as a workaround for some of them). I will ping the relevant team. What file name or file extension did you use for the files with disassembled code? What exact error was reported by the forum software?

I am aware of issues with missed load / store vectorization opportunities when using the CUDA 5.0 toolchain, but was under the impression that instances are isolated rather than wide-spread. While these issues may be fixed by now, to be completely sure they have been addressed for your particular use case it would be best to file a bug with a self-contained, runnable, repro case. The bug reporting form is accessible via the registered developer website
sample_disassembly.txt (2.31 KB)

Hi njuffa,
thank you for a very quick reply.
I was trying to put them in text as ‘code’.
They weren’t longer than ~15 lines.
Every time I tried that the page simply died with blank response (something about ‘no return data’).

Regarding missing vectorizations. Example I posted hear is the most basic I could come up with and yet compiler got it wrong.
I’ll follow it with a bug report as you suggested, thank you.

I tried to attach a simple text file with disassembled code to my previous post, and while I got an error message telling me the system was unable to upload the file at the time, I now see that this attachment is there after all. Weird. I have filed a bug about the error message. Again, sorry for inconvenience in attaching files to posts. I view this as important functionality in a forum for programmers for sharing code that extends beyond the confies of a forum post of reasonable length.

Thank you for following up with a bug report on the missing load / store vectorization in your code.

Being surprised this is a bug, I gave it a quick try. And for some reason, I can’t reproduce, I do get the correct output with the loads and stores merged using LD.64 and ST.64.

code for sm_20
         Function : _Z6testCUP6float2
    /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
    /*0008*/     /*0x80019de428004000*/     MOV R6, c [0x0] [0x20];
    /*0010*/     /*0x9001dde428004000*/     MOV R7, c [0x0] [0x24];
    /*0018*/     /*0x80011de428004000*/     MOV R4, c [0x0] [0x20];
    /*0020*/     /*0x90015de428004000*/     MOV R5, c [0x0] [0x24];
    /*0028*/     /*0x00609ca584000000*/     LD.E.64 R2, [R6];
    /*0030*/     /*0x00209c005000d000*/     FADD R2, R2, 0x40000;
    /*0038*/     /*0x0030dc005000d010*/     FADD R3, R3, 0x40400;
    /*0040*/     /*0x00409ca594000000*/     ST.E.64 [R4], R2;
    /*0048*/     /*0x00001de780000000*/     EXIT;

Can this be a Linux vs. Windows compiler problem? I’m using Linux (NVCC 5.0, V0.2.1221) and noticed njuffa’s file has Windows line-endings (and the order of the columns of the output of cuobjdump is different for some reason).

Glupol, does it use a single store if you write

__global__ void testCU(float2 *ptr)
	ptr->x += 2.0f;
	ptr->y += 3.0f;

And which compute capability are you compiling for - 1.x by any chance?
I remember previous releases of CUDA specifically required either this or your form of the code depending on whether the Open64- or the LLVM-based compiler was used. And 5.0 is the first release to use the LLVM-based compiler for compute capability 1.x code as well.

thank you for interest in the post.
As I wrote, I used cuda 5.0 & 4.2 toolkits compiling for 2.0 and 3.0 compute capabilities with --compile option.
I’m using vs 2010 with nsight.
One thing I left out was that I compiled it for debug.
I reran it with debug and release options for 4.2 and 5.0.
It’s not as bad as I though, but it’s a bit messy.
nvcc from 5.0 for debug compilation is the only one that didn’t get it right.
Probably it’s not a massive issue but it’s a bit surprising, and confusing result.
I attached dissemble for all for compilations.

Thank you,
test_rel_5_0.txt (1.43 KB)
test_rel_4_2.txt (1.39 KB)
test_deb_5_0.txt (5.89 KB)
test_deb_4_2.txt (5.1 KB)

Note that the CUDA compiler turns off all optimizations for debug builds. Since vectorization is a performance optimization, it is not surprising if vectorization does not occur for debug builds.

@tera: Could you point me to the source of the statement that CUDA 5.0 uses the LLVM-based frontend for sm_1x targets?

Sorry Norbert. I probably thought so because the -nvvm has disappeared from nvcc, but a quick test with -dryrun proves me wrong, nvopencc is still used on compute capability 1.x.

It’s a pity that -nvvm is gone because it usually produces much better code on 1.x devices and gets around a few limitations (e.g. it can unroll nested loops)

Thanks for confirming that I didn’t miss any functionality changes. Use of nvvm with the sm_1x backend was never supported functionality, and the combination was untested. I am aware that the potential for serious code generation bugs with this combination was not merely theoretical.