If compiled using 5.0 we get pairs of ST and LD from/to global memory.
Whiles, 4.2 generates single loads and stores using LD.64 and ST.64 respectively.
Both compilations were done for 3.0 architecture.
I would attach dissembler for both but this website seem to be very unhappy when I do so, and didn’t let me create a topic with them.
This is just an example of very wrong compilation, as it’s clear it should use vector load/store operations for coalesced memory access.
For my own kernel that works on complex data (float2) I gained a noticeable speed up by switching from 5.0 to 4.2, or should I say my code got slower with upgrading to 5.0.
Is it a bug in the new release ?
Where can I submit it ?
Have you noticed anything like this ?
Can I force 5.0 to generate correct code ?
Sorry for the trouble with attachments to forum posts. There are known issues with various types of attachments (using a different file extension seems to work as a workaround for some of them). I will ping the relevant team. What file name or file extension did you use for the files with disassembled code? What exact error was reported by the forum software?
I am aware of issues with missed load / store vectorization opportunities when using the CUDA 5.0 toolchain, but was under the impression that instances are isolated rather than wide-spread. While these issues may be fixed by now, to be completely sure they have been addressed for your particular use case it would be best to file a bug with a self-contained, runnable, repro case. The bug reporting form is accessible via the registered developer website [url]https://developer.nvidia.com/user/register[/url] sample_disassembly.txt (2.31 KB)
Hi njuffa,
thank you for a very quick reply.
I was trying to put them in text as ‘code’.
They weren’t longer than ~15 lines.
Every time I tried that the page simply died with blank response (something about ‘no return data’).
Regarding missing vectorizations. Example I posted hear is the most basic I could come up with and yet compiler got it wrong.
I’ll follow it with a bug report as you suggested, thank you.
G.
I tried to attach a simple text file with disassembled code to my previous post, and while I got an error message telling me the system was unable to upload the file at the time, I now see that this attachment is there after all. Weird. I have filed a bug about the error message. Again, sorry for inconvenience in attaching files to posts. I view this as important functionality in a forum for programmers for sharing code that extends beyond the confies of a forum post of reasonable length.
Thank you for following up with a bug report on the missing load / store vectorization in your code.
Being surprised this is a bug, I gave it a quick try. And for some reason, I can’t reproduce, I do get the correct output with the loads and stores merged using LD.64 and ST.64.
code for sm_20
Function : _Z6testCUP6float2
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20];
/*0010*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24];
/*0018*/ /*0x80011de428004000*/ MOV R4, c [0x0] [0x20];
/*0020*/ /*0x90015de428004000*/ MOV R5, c [0x0] [0x24];
/*0028*/ /*0x00609ca584000000*/ LD.E.64 R2, [R6];
/*0030*/ /*0x00209c005000d000*/ FADD R2, R2, 0x40000;
/*0038*/ /*0x0030dc005000d010*/ FADD R3, R3, 0x40400;
/*0040*/ /*0x00409ca594000000*/ ST.E.64 [R4], R2;
/*0048*/ /*0x00001de780000000*/ EXIT;
..................................
Can this be a Linux vs. Windows compiler problem? I’m using Linux (NVCC 5.0, V0.2.1221) and noticed njuffa’s file has Windows line-endings (and the order of the columns of the output of cuobjdump is different for some reason).
?
And which compute capability are you compiling for - 1.x by any chance?
I remember previous releases of CUDA specifically required either this or your form of the code depending on whether the Open64- or the LLVM-based compiler was used. And 5.0 is the first release to use the LLVM-based compiler for compute capability 1.x code as well.
Hi,
thank you for interest in the post.
As I wrote, I used cuda 5.0 & 4.2 toolkits compiling for 2.0 and 3.0 compute capabilities with --compile option.
I’m using vs 2010 with nsight.
One thing I left out was that I compiled it for debug.
I reran it with debug and release options for 4.2 and 5.0.
It’s not as bad as I though, but it’s a bit messy.
nvcc from 5.0 for debug compilation is the only one that didn’t get it right.
Probably it’s not a massive issue but it’s a bit surprising, and confusing result.
I attached dissemble for all for compilations.
Note that the CUDA compiler turns off all optimizations for debug builds. Since vectorization is a performance optimization, it is not surprising if vectorization does not occur for debug builds.
@tera: Could you point me to the source of the statement that CUDA 5.0 uses the LLVM-based frontend for sm_1x targets?
Sorry Norbert. I probably thought so because the -nvvm has disappeared from nvcc, but a quick test with -dryrun proves me wrong, nvopencc is still used on compute capability 1.x.
It’s a pity that -nvvm is gone because it usually produces much better code on 1.x devices and gets around a few limitations (e.g. it can unroll nested loops)
Thanks for confirming that I didn’t miss any functionality changes. Use of nvvm with the sm_1x backend was never supported functionality, and the combination was untested. I am aware that the potential for serious code generation bugs with this combination was not merely theoretical.