Some advice needed pls Doubts we have, we're starting with CUDA programming

mcolom · June 15, 2011, 3:08pm

Hi everyone

We’re starting a project on CUDA technology. We’re trying to implement on CUDA a business process for the tourism industry which is paralellizable, but that it also includes a lot of logic. By that I mean branching, looping etc.

After reading perhaps half of the documentation and listening to a good deal of the webinars, I think we are starting to grab the main concepts involving effective GPU programming. But at this time we would like also to receive feedback to confirm or deny if our understanding of the architecture is right or we’re still missing something. We have therefore a bunch of questions, and we would appreciate if someone could help us with some of them.

OK, here are the questions:

Regarding memory usage.

We understannd that there is global, texture, local, shared memory and registers.

1-Shared memory and registers are hundreds of times faster than global, local or texture memory. Is that right?
2-Preloading data to shared memory is appropiate as long as the data is accesed more than once by any given thread. If thats not the case, delaying the use of the data with an intermediate operation would be enough to mask access to global memory latency. Is that right?
3-Perhaps because we’re used to CPU programming, we are not sure we will be confortable using texture memory. Are we ok with only using global memory?. Also, for instance in a Tesla 2050 card there are 6 GB of DRAM. Are all of those 6 GBytes global memory? We dont understand the ratio between global, local and texture memory in DRAM.
4-Whats the difference between local and global memory? We know both are at DRAM and not cached, but thats it.
5-To get good performance there should be an initial data loading from host to device global memory.
6-Data Coalescing. We understand that to get good memory performance we should be using SOA. But, we also understand that if the primitive struct size is less or equal to 128 bits, we re ok. For instance

typedef truct {
double net, comission;
} Price;

global kernel(vector inputData);

and

typedef struct {
vector net, comission;
} PriceVector

global kernel(PriceVector data);

are both good. Is that right?

7-We’re confused about what will happen with register and shared memory pressure. We understand that there are 32 KBytes (or KBits) of memory for shared memory on every scalar processor. Is that right. What will happen if we overflow the amount of that available memory, be it because of register or shared memory pressure? How do you guys monitor and design around that problem?

Now regarding data processing:
1-Our process involves branching and looping when designed on CPU. When porting it to GPU, we understand we should be preprocessing (possible expanding) input data as much as possible in order to avoid as much branching and looping as possible. are we right?

2-If branching is unavoidable, short conditions will avoid divergent branching. For instance:
if (ConditionA)
DoA();
else
DoB();

This will have no performance penalty as long as the ConditionA is short enough. Is that right?

3-We’ve seen that loop unroll is recommended. We’re wondering if this piece of code:
for(int i = 0, max=InputData; i < max; i++)
DoSomething();
will have any perfomance impact if max is the same for all the threas in a warp.

4-As s general rule, GPUs are less efficient that CPUs when managing branching and looping, even if no divergence happens. Is that assumption right?

5-We anticipate that despite all our efforts, we will be forced to use divergent branching at some of our code. Which is the exact performance impact of divergent branching? How can we code to minimize it?
6-Functions. Is there an specially severe performance penalty by using functions? We know CPU compilers make a very good job at optimizing function calls in release mode. Does the same happens with GPU compiler and code?

I think thats all External Image

Thanks to you all in advance.
Miquel

mcolom · June 15, 2011, 3:08pm

Hi everyone

We’re starting a project on CUDA technology. We’re trying to implement on CUDA a business process for the tourism industry which is paralellizable, but that it also includes a lot of logic. By that I mean branching, looping etc.

After reading perhaps half of the documentation and listening to a good deal of the webinars, I think we are starting to grab the main concepts involving effective GPU programming. But at this time we would like also to receive feedback to confirm or deny if our understanding of the architecture is right or we’re still missing something. We have therefore a bunch of questions, and we would appreciate if someone could help us with some of them.

OK, here are the questions:

Regarding memory usage.

We understannd that there is global, texture, local, shared memory and registers.

1-Shared memory and registers are hundreds of times faster than global, local or texture memory. Is that right?
2-Preloading data to shared memory is appropiate as long as the data is accesed more than once by any given thread. If thats not the case, delaying the use of the data with an intermediate operation would be enough to mask access to global memory latency. Is that right?
3-Perhaps because we’re used to CPU programming, we are not sure we will be confortable using texture memory. Are we ok with only using global memory?. Also, for instance in a Tesla 2050 card there are 6 GB of DRAM. Are all of those 6 GBytes global memory? We dont understand the ratio between global, local and texture memory in DRAM.
4-Whats the difference between local and global memory? We know both are at DRAM and not cached, but thats it.
5-To get good performance there should be an initial data loading from host to device global memory.
6-Data Coalescing. We understand that to get good memory performance we should be using SOA. But, we also understand that if the primitive struct size is less or equal to 128 bits, we re ok. For instance

typedef truct {
double net, comission;
} Price;

global kernel(vector inputData);

and

typedef struct {
vector net, comission;
} PriceVector

global kernel(PriceVector data);

are both good. Is that right?

7-We’re confused about what will happen with register and shared memory pressure. We understand that there are 32 KBytes (or KBits) of memory for shared memory on every scalar processor. Is that right. What will happen if we overflow the amount of that available memory, be it because of register or shared memory pressure? How do you guys monitor and design around that problem?

Now regarding data processing:
1-Our process involves branching and looping when designed on CPU. When porting it to GPU, we understand we should be preprocessing (possible expanding) input data as much as possible in order to avoid as much branching and looping as possible. are we right?

2-If branching is unavoidable, short conditions will avoid divergent branching. For instance:
if (ConditionA)
DoA();
else
DoB();

This will have no performance penalty as long as the ConditionA is short enough. Is that right?

3-We’ve seen that loop unroll is recommended. We’re wondering if this piece of code:
for(int i = 0, max=InputData; i < max; i++)
DoSomething();
will have any perfomance impact if max is the same for all the threas in a warp.

4-As s general rule, GPUs are less efficient that CPUs when managing branching and looping, even if no divergence happens. Is that assumption right?

5-We anticipate that despite all our efforts, we will be forced to use divergent branching at some of our code. Which is the exact performance impact of divergent branching? How can we code to minimize it?
6-Functions. Is there an specially severe performance penalty by using functions? We know CPU compilers make a very good job at optimizing function calls in release mode. Does the same happens with GPU compiler and code?

I think thats all External Image

Thanks to you all in advance.
Miquel

hamster143 · June 15, 2011, 10:56pm

Global/texture memory is cached. If you’re accessing a memory address that’s in the cache, it is fast. Otherwise, it’s quite slow.

More or less. You need a lot of operations to mask global memory latency. Preloading to shared memory is not always necessary, because of caching (see above).

It’s all physically the same memory, all 6 GB of it. The same memory region can be accessed either as global or as texture. Depending on circumstances, one or the other method may be slightly faster. For simplicity, you could stick with global.

Local memory is basically a kludge that compiler resorts to use if you try to use too many variables in your kernel. It is global memory in disguise. The best advice is, “don’t use it.”

There are 16 to 48 kbytes of shared memory per SP (configurable at runtime) on Tesla 2050. If you try to launch a kernel that requires more shared memory than you have, the kernel will fail to launch.

hamster143 · June 15, 2011, 10:56pm

Global/texture memory is cached. If you’re accessing a memory address that’s in the cache, it is fast. Otherwise, it’s quite slow.

More or less. You need a lot of operations to mask global memory latency. Preloading to shared memory is not always necessary, because of caching (see above).

It’s all physically the same memory, all 6 GB of it. The same memory region can be accessed either as global or as texture. Depending on circumstances, one or the other method may be slightly faster. For simplicity, you could stick with global.

Local memory is basically a kludge that compiler resorts to use if you try to use too many variables in your kernel. It is global memory in disguise. The best advice is, “don’t use it.”

There are 16 to 48 kbytes of shared memory per SP (configurable at runtime) on Tesla 2050. If you try to launch a kernel that requires more shared memory than you have, the kernel will fail to launch.

seibert · June 15, 2011, 11:40pm

Yes, registers are very fast, and shared memory is basically just as fast, assuming you do not have bank conflicts in your access pattern. However, even with bank conflicts, shared memory is much faster than other kinds of memory.

That does help. The main way to mask global memory is to have enough active threads per multiprocessor. This is ultimately why the CUDA architecture encourages the use of many more threads than CUDA cores.

Global, local, texture and constant memory are stored in the DRAM, with no specific boundaries. Texture memory is global memory accessed through the special texture cache on the GPU. Similarly, there is another cache for constant memory that is optimized for broadcasting the same value to many threads. Local memory is memory allocated to each thread, whereas global memory can be accessed by all threads.

You will not be able to use all of the memory on the CUDA device because the driver usually reserves a few hundred MB for its own use. That can be further reduced if you have OpenGL or a GUI desktop active on the CUDA device.

See above. Note that Fermi-class (like the Tesla C2050) GPUs include a 16/48 kB per-multiprocessor L1 cache and a per-GPU 384 or 768kB L2 cache for all local/global memory access.

To get best performance, you want to minimize data transfer over the PCI-Express bus, as that is the slowest link in the GPU processing chain. If you can load all your data at the beginning and reuse it for many calculations, then that will work. The real goal is to maximum the ratio of arithmetic operations to bytes transferred.

Yes, that is true now. The underlying hardware can issue 32, 64 and 128-bit transactions per thread. If your data element is not one of those sizes, then loading it will require several instructions, and each instruction will result in a memory transfer with gaps. The L1 and L2 cache may mitigate some of the penalty here on Fermi devices.

On all CUDA devices, there is 16 kB of shared memory. You can’t overflow it (except with a pointer bug), as the total shared memory each block needs to use is known at kernel launch time. If your kernel uses too much memory, it will fail to launch, and you will see an error return code on the next CUDA function call. (Always check these error codes, otherwise you will have no idea that your kernel is not running!)

The key issue is “warp divergence”. Threads are bundled into groups of 32 to make warps, and warps are the unit of instruction execution. (And this assignment of threads to warps is fixed. Threads 0-31 go to warp 0, 32-63 to warp 1, etc.) If at a branching point, like an if-statement or while condition, some threads go one direction and other threads go a different direction, then the warp will need to be executed twice to cover each code path. A warp cannot process faster, even if some of its threads are masked off, so throughput is now cut in half. More branches will make this problem worse. However, the hardware is pretty good about merging code paths again as soon as possible, so don’t worry too much about short branches where some threads need to do a little extra processing before rejoining the group.

Similarly, looping is no problem at all as long as all threads in the warp complete the loop at approximately the same time.

Divergent branching always has a performance penalty, but you probably have some performance to burn. Most kernels are limited by memory bandwidth and not instruction throughput.

The benefit of loop unrolling depends entirely on the contents of DoSomething(). Also keep in mind that you can only unroll the loop if max is known at compile time, or if you tell the compiler to partially unroll it because you know that max is divisible by some integer.

seibert · June 15, 2011, 11:40pm