Coding Guideline for CUDA Make CUDA more readable!

It would be great, if NVIDIA could release a coding guideline document for CUDA so that CUDA code looks a bit readable!

For example:

  1. You could follow polish notation OR whatever notation. BUT:
    a) All variables must be prefixed by a qualifier that specifies where the variable resides:
    a) PARAM – Variable is a Parameter (ld.param, st.param is used to access)
    B) SH – Variable resides in Shared Memory(ld.shared, st.shared)
    c) LL – Local variable (register access OR ld.local/st.local)
    d) GG – Global Variable (ld.global/st.global)

This will help programmers identify performance bottlenecks when writing the code itself.

For example, I can easily say “GGintHello * GGintWorld” will take more time to execute than “SHintHello * SHintWorld”.

Also, this will help programmers to define various variables with same name BUT in different memory space:
GGintHello, LLintHello, SHintHello etc…

“SHlpintHello” means that the variable is a POINTER and RESIDES in shared memory. It does NOT say anything about WHERE the POINTER points to.

Let us know what you guys think!!!

Best Regards,
Sarnath

Yeah, when I advise people on writing CUDA kernels, I always suggest that they adopt some sort of naming scheme for variables that makes it clear where things are being stored. The scheme I’ve used in a few cases is just prepending prefixes that indicate where the data actually resides:
c_foo -> constant memory
s_bar -> shared memory
g_baz -> global memory

I didn’t start doing this until I had a lot of CUDA kernels floating around and had to explain to people what they were doing and how they worked. Then it became immediately evident that naming variables more carefully would be an easy way to clarify the operation of the code.
That addresses much of what I typically need to see when reading an unfamiliar CUDA kernel…

John Stone

also using d_* for device variables and h_* for host variables in host-code is also useful to prevent errors. All your input to kernels needs to be d_* and to memcpy functions, you need the h_name & d_name variable.

I do basically the same scheme as tachyon_john for my kernel code (didn’t do it with constants, but might be worth to change that when I need to touch my code again, now I am even myself sometimes wondering where something came from in long kernels.)

Ah yes, the h_ and d_ are used in some of our codes too, forgot those, good catch!

John

This is getting better guys. THanks for your inputs.

Can some1 from NVIDIA take this up and generate a small coding guideline for CUDA programmers.

So, When CUDA code starts spreading, there will be a need to understand code written by some1 else and so on…

May b, We should add this to the wishlist.

THanks for your time guys.

I would recommend that you keep functions short enough so that the type declaration is never so far from where it’s used that you can’t just look a few lines up to figure it out. I don’t think Hungarian notation adds value, it’s a net harm IMO. But this is likely to be Holy War territory.

I believe that for CPU code, most compilers are smart enough to inline functions where possible/convenient/optimized, especially where the “inline” hint is used. I would hope the same is true for GPU. Can anyone comment on observations for CUDA code? Should I expect worse performance (or greater register usage, etc) from a bunch of “device inline float foo(float bar)” functions than had I kept the logic for “foo()” inside the global kernel?

My recollection is that CUDA inlines everything on the device side. So break your functions up as much as you need to for clarity.

Technically, that doesn’t really answer my question. The 1.1 Programming Guide does have content like this: “By default, a device function is always inlined” that agrees with your statement. I’m not questioning whether nvcc will inline functions or not.

I guess I’m probably barking up the wrong tree, but there’s nothing that says that inlining is without any ill consequence to performance. For the record, I would think there wouldn’t be any consequence to performance. I was just looking to see if anyone could back it up with actual evidence. Something like, “I tried xyz algorithm with inlined device calls and with the content in the body of the global and saw no difference in the profiler output, or no difference in the duration of the call.” Or someone authoritative from nVidia who could say, “yeah, the end result is the same between the two. The same code is generated.”

Ask and ye shall receive.

I have one complicated kernel in my code where each block loops over 27 regions of memory, loads them into shared memory and then does a lot of processing on each region. I took the original function (no device functions) and split the part that loads into shared memory and then processes into a device function, leaving just the loop in the global. There was no difference in performance.

Better yet, I unrolled the outer 27 loop by hand, thus calling the same device function 27 times with slightly different arguments. This version takes ~30 seconds to compile and produces a very large cubin file (because of all the inlined functions), but the performance is actually improved. Note that the improved performance came because of reduced register usage because I made what were register loop indices into values in shared memory for the unrolled 27 loop (each thread loops over the same 27 things).

If you really have a burning curiosity, the code I mentioned can be viewed here:

http://trac2.assembla.com/hoomd/browser/tr…rnel.cu?rev=958

(note that the versions I mentioned are commented out: a new and improved and much simpler version is active in the code now).

I am attaching a coding guideline to start with. If any1 gets interested, you could add more to it.
CUDA_CodingGuidelines.doc (31 KB)

Well, I replaced several macros with functions and the performance stayed the same, but I would not promise that the code will always be the same.

E.g. I seem to remember from previous discussions that particularly pointer arguments seem likely to cause issues despite the inlining, particularly global vs. shared memory pointers etc.