Implementation Questions arrising from Ch.5 on Performace Guidelines in the Programming Guide 2.0

Aditi · June 1, 2009, 2:28am

Hi,

I have a few implementation related questions after going through the Ch.5 on Performace Guidelines in the Programming Guide 2.0 and playing around a bit with the Occupancy Calculator:

From the Occupancy Calculator, I could understand the meaning of “active thread blocks”…i.e. the number of thread-blocks that can be scheduled at a time on a given multiprocessor given the limited resources of shared memory and registers per multiprocessor and the physical limits on the GPU hardware. I can understand the limitations due to shared memory and GPU hardware but not due to registers. I understand that register spills are stored in the local memory. So there should be some way in which I should be able to decide/specify how much to store in the registers and how much in local memory so that I could minimize register usage. I vaguely remember having come across one such discussion in which I could specify something into a file but can’t find it now. Can anyone please suggest?
Quoting from the occupancy calculator:

Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocationis already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the wall clock time of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.

Now I am confused as to how do I figure out whether my code is bottlenecked by bandwidth, by computation or by global memory accesses?

This is regarding the coalesced memory accesses. I understand that memory access by a half-warp should be limited to within a “segment” which are aligned starting address “zero” for any segment size. Now how do I make sure that the half-warp knows the start address of any segment? Eg. for 64B segments, the starting addresses of different segments would be 0,64,128,192 and so on as in figure 5.4 in the programming guide.
Can I use align (16) for arrays too or it should be used only with structs? Will using align (16) with large arrays help in reducing the number of load instructions when I move data from device to shared memory?
Quoting from article 5.3 of the programming guide:

What variable types are we talking about…device or shared? If intermediate data structures can be created and destroyed, will I be correct in thinking that creation and destruction should happen inside the kernel? If yes, then why I was not able to declare a new variable as __device__type from inside the kernel? I could ever declare a device type variable only in the global space (i.e. outside the CPU and the GPU code…where we declare global variables in a normal C code). If I am wrong in my understanding, can someone please paste a small code-snippet to illustrate the right way to create and destroy such intermediate variables?

Quoting from article 5.1.1.1 of the programming guide:

I believe [(i%n) = (i&(n-1))] would be true for n=any multiple of 2, and n=1 (and not just a power of 2). Wanted to reconfirm since I have a lot of modulo division at almost every step in my code. Is there a faster way for modulo division with any integer n?

Thanks & regards,

Aditi

SPWorley · June 1, 2009, 2:59am

No, it’s true only when n is a power of 2, not a multiple. So it’s fine for 2 4 8 16 32 64 etc.

But not for example for 10, which is a multiple of 2, but i%10 != i&9.

Yes, there are great math tricks for doing faster modulii if the modulus is known ahead of time. For example, for mod 3 you can do something like

i%3 == ((i&3)+((i>>2)&3) %3

which doesn’t look like a win until you realize that there are only 6 values that can occur and you can use a table or a test to fold them to the 3 appropriate values.

This huge topic is covered in an entire chapter of Hacker’s Delight.

Some compilers will do these tricks for you, especially for certain especially easy values like 3, 5, 7, 9, 15, 17, but there’s tricks in that book for most modulus categories.

dlmeetei · June 1, 2009, 4:23am

The devide varible reside in the global memory, That meight be the case.

dlmeetei · June 1, 2009, 4:28am

dlmeetei · June 1, 2009, 4:30am

N should be power of two, if n = 1, then 2^0 =1.

Sarnath · June 1, 2009, 5:29am

Check -maxrregcount NVCC option (two r required. not spell mistake)

If you have writen the kennel, you should know it. Like, if u r having huge “for” loops fetching lot of data from global memory – too often than computing – then you know u have a globl mem thing… if u r too often doing sinx, cosx, tanx, mul, sub, add, div,% - then most likely u r arithmetics bound.

Often, you need to find your latency by doing small experiments (by reducing ur data-size and plotting against performance) – THis is needed only for very high fine-tuning…

Usually, if u access coalesced way, stage data in shared memory for repeated computation and minimize warp divergences and serialization – u WILL get good performance. (serializing and un-coalesced acceses r killers)

Mostly u access arrays. And “cudaMalloc()” takes care of aligning (it always aligns on 256 bytes, IIRC). So, if ur start address of your arrays is aligned - you are done for life.

I dont know. Read the compiler documentation.

You need to read it with the context told in background. The context is that you need to move more code into the GPU kernel instead of doing multiple kernel launches that will require memcopy back and forth.

However the guide fails to account for the fact that a kernel launch can still re-use the results of previous kernel launch without actually copying them to host memory.

But there could some occasions where you will be doing…

kernel_launch

copy results back to host

manipulate the resullt

kernel_launch with the new results

Now, this is what the guide is talking about. Instead of copying that intermeidate result back to host memory, move the entire stuff as a GPU kernel.

The 3rd step (manipulate the result) can also happen within the GPU kernel.

The intermediate data-structure is just nothing but a buffer allocated by cudaMalloc() that will NEVER be copied to the host.

Modulo is the slowest operator in CUDA, i think.

The fastest way to do it is to avoid doing it :-)

T.B · June 1, 2009, 3:29pm

Hi.

A nice way to figure this out is to have a look at the IPC and bandwidth values calculated by the profiler (WARNING: If your number grid size is not a multiple of your number of MPs, these are unreliable)

If IPC approaches 1.0, you are running into compute limits, if bandwidth approaches whatever the bandwidthTest is saying, that’s your limit.

If neither is high, you are stalling, which probably means non-coalesced accesses, branch divergence, bank conflicts or partition camping, as we’ve recently learned. ;)

Everything but camping should be directly visible in the profiler.

Well, you can use it with static arrays, but most of the times, your arrays will be heap objects, which should be 16 byte aligned on most platforms. If they are not, you will have to manually align them, as align is a compiler directive and will not change runtime behavior.

Think about it. If you create a global-mem variable in a kernel, how is memory management going to be handled, and by whom? Is that one per thread? One per block? One per kernel?

What you need to do is allocate a block of global mem and manage it yourself.

Just pass an array, cast it to your struct-type and knock yourself out. ;)

Not really. Depends on what you’re doing.

I’ve had some good success with using iterative updates. So if I was scanning through a 2D-array slices based on linear addresses, I would not go x=i%pitch, but add the step-width to i in each step and then subtract pitch if the result was larger than or equal to the pitch. If your step-width is larger than your pitch, treat a step as a vector addition in 2D, then cap as before in x, adding to y in case of overflow. That’s a lot faster.

I hope some of it helped. I’m in a bit of a hurry right now, so the implementations are left as an exercise to the reader. ;)

Aditi · June 2, 2009, 1:34am

Thanks everyone for their replies. They were helpful. But now I am facing this problem:

I was trying to use the CUDA profiler. I am supposed to set the environment variables CUDA_PROFILE=1. To do that I tried a bunch of different things (together and/or their combination) (changed it in control panel->system properties->advanced->environment variables->add new in both boxes; changed in nvcc.profile; used CUDA_PROFILE=1 in the command line input) but nothing seems to have helped. I have not got any .log or .csv file (inspite of setting CUDA_PROFILE_CSV=1 the same way) in any of the folders (project,CUDA,MATLAB,searched the entire computer,etc.) I am using nvmex (CUDA+MATLAB). Is that is what is causing it to not work? Do you know how should I get the CUDA Profiler to work with nvmex. No amount of google-ing helped too :(

Thanks & regards,

Aditi
:(

Sarnath · June 2, 2009, 5:12am

You may need to open a new CMD prompt for the changes to be visible. It wont reflect in existing CMD prompts.

After launching new cmd prompt, do “echo %CUDA_PROFILE%” and see if it is 1. this will confirm if the envmt variable is relly ok or not

jph4599 · June 2, 2009, 12:59pm

This post discusses profiling CUDA+Matlab MEX files, at least for the visual profiler…maybe it will help?

Aditi · June 6, 2009, 1:19am

Thanks for the help. I went through this post and various other posts discussing the same topic and after lot of hit-and-trial finally got it running!! Following are the tips for others who may stumble on this thread looking for an answer and feeling the same miserable as I was a while ago!!

With Profiler 1.0

CUDA_PROFILE and CUDA_PROFILE_CSV set to 1 in controlpanel->system->advanced->environment_variables (in both the boxes).
Run the cuda project using visual profiler. For that, set the following in the Visual Profiler->Profile->Session Settings:
a) Launch: “C:/Program Files/MATLAB/R2008a/bin/win32/MATLAB.exe” (I figured this out after a long time that I was supposed to use this and not C:/Program Files/MATLAB/R2008a/bin/MATLAB.exe …win32 made all the difference!!)
B) Working Directory: “Your cuda project directory”
c) Arguments: “–noprompt” OR "–noprompt -r <name of the .m file to run without .m extension>
d) Max. Execution Time: (Set it enough long) “2000secs”
In the Configuration: Enable “Time Stamp” and only a max of 4 “Profile Counters” at anytime. So do it in three runs for all counters.
Ignore the Error -94.

With Profiler 1.1 and onwards:

Same
a) Same
B) Same
c) -r <name of the .m file to run without .m extension>
Check everything → all parameters, time stamps, counters, extra parameters, etc. All are available in one go.
Doesn’t appear
Keep closing every Matlab session as it ends. If you wait for it to happen automatically, that never happens.

BOOM!! You have all the results!!

Sarnath · June 6, 2009, 2:55am

It is always a good community gesture to share your findings in the forum. Good work!

computerulz · June 8, 2009, 10:56pm

It does if you put ‘exit’ at the end of your matlab script, makes it much easier to time.