Some advice needed pls Doubts we have, we're starting with CUDA programming

Hi everyone

We’re starting a project on CUDA technology. We’re trying to implement on CUDA a business process for the tourism industry which is paralellizable, but that it also includes a lot of logic. By that I mean branching, looping etc.

After reading perhaps half of the documentation and listening to a good deal of the webinars, I think we are starting to grab the main concepts involving effective GPU programming. But at this time we would like also to receive feedback to confirm or deny if our understanding of the architecture is right or we’re still missing something. We have therefore a bunch of questions, and we would appreciate if someone could help us with some of them.

OK, here are the questions:

Regarding memory usage.

We understannd that there is global, texture, local, shared memory and registers.

1-Shared memory and registers are hundreds of times faster than global, local or texture memory. Is that right?
2-Preloading data to shared memory is appropiate as long as the data is accesed more than once by any given thread. If thats not the case, delaying the use of the data with an intermediate operation would be enough to mask access to global memory latency. Is that right?
3-Perhaps because we’re used to CPU programming, we are not sure we will be confortable using texture memory. Are we ok with only using global memory?. Also, for instance in a Tesla 2050 card there are 6 GB of DRAM. Are all of those 6 GBytes global memory? We dont understand the ratio between global, local and texture memory in DRAM.
4-Whats the difference between local and global memory? We know both are at DRAM and not cached, but thats it.
5-To get good performance there should be an initial data loading from host to device global memory.
6-Data Coalescing. We understand that to get good memory performance we should be using SOA. But, we also understand that if the primitive struct size is less or equal to 128 bits, we re ok. For instance

typedef truct {
double net, comission;
} Price;

global kernel(vector inputData);

and

typedef struct {
vector net, comission;
} PriceVector

global kernel(PriceVector data);

are both good. Is that right?

7-We’re confused about what will happen with register and shared memory pressure. We understand that there are 32 KBytes (or KBits) of memory for shared memory on every scalar processor. Is that right. What will happen if we overflow the amount of that available memory, be it because of register or shared memory pressure? How do you guys monitor and design around that problem?

Now regarding data processing:
1-Our process involves branching and looping when designed on CPU. When porting it to GPU, we understand we should be preprocessing (possible expanding) input data as much as possible in order to avoid as much branching and looping as possible. are we right?

2-If branching is unavoidable, short conditions will avoid divergent branching. For instance:
if (ConditionA)
DoA();
else
DoB();

This will have no performance penalty as long as the ConditionA is short enough. Is that right?

3-We’ve seen that loop unroll is recommended. We’re wondering if this piece of code:
for(int i = 0, max=InputData; i < max; i++)
DoSomething();
will have any perfomance impact if max is the same for all the threas in a warp.

4-As s general rule, GPUs are less efficient that CPUs when managing branching and looping, even if no divergence happens. Is that assumption right?

5-We anticipate that despite all our efforts, we will be forced to use divergent branching at some of our code. Which is the exact performance impact of divergent branching? How can we code to minimize it?
6-Functions. Is there an specially severe performance penalty by using functions? We know CPU compilers make a very good job at optimizing function calls in release mode. Does the same happens with GPU compiler and code?

I think thats all External Image

Thanks to you all in advance.
Miquel

Hi everyone

We’re starting a project on CUDA technology. We’re trying to implement on CUDA a business process for the tourism industry which is paralellizable, but that it also includes a lot of logic. By that I mean branching, looping etc.

After reading perhaps half of the documentation and listening to a good deal of the webinars, I think we are starting to grab the main concepts involving effective GPU programming. But at this time we would like also to receive feedback to confirm or deny if our understanding of the architecture is right or we’re still missing something. We have therefore a bunch of questions, and we would appreciate if someone could help us with some of them.

OK, here are the questions:

Regarding memory usage.

We understannd that there is global, texture, local, shared memory and registers.

1-Shared memory and registers are hundreds of times faster than global, local or texture memory. Is that right?
2-Preloading data to shared memory is appropiate as long as the data is accesed more than once by any given thread. If thats not the case, delaying the use of the data with an intermediate operation would be enough to mask access to global memory latency. Is that right?
3-Perhaps because we’re used to CPU programming, we are not sure we will be confortable using texture memory. Are we ok with only using global memory?. Also, for instance in a Tesla 2050 card there are 6 GB of DRAM. Are all of those 6 GBytes global memory? We dont understand the ratio between global, local and texture memory in DRAM.
4-Whats the difference between local and global memory? We know both are at DRAM and not cached, but thats it.
5-To get good performance there should be an initial data loading from host to device global memory.
6-Data Coalescing. We understand that to get good memory performance we should be using SOA. But, we also understand that if the primitive struct size is less or equal to 128 bits, we re ok. For instance

typedef truct {
double net, comission;
} Price;

global kernel(vector inputData);

and

typedef struct {
vector net, comission;
} PriceVector

global kernel(PriceVector data);

are both good. Is that right?

7-We’re confused about what will happen with register and shared memory pressure. We understand that there are 32 KBytes (or KBits) of memory for shared memory on every scalar processor. Is that right. What will happen if we overflow the amount of that available memory, be it because of register or shared memory pressure? How do you guys monitor and design around that problem?

Now regarding data processing:
1-Our process involves branching and looping when designed on CPU. When porting it to GPU, we understand we should be preprocessing (possible expanding) input data as much as possible in order to avoid as much branching and looping as possible. are we right?

2-If branching is unavoidable, short conditions will avoid divergent branching. For instance:
if (ConditionA)
DoA();
else
DoB();

This will have no performance penalty as long as the ConditionA is short enough. Is that right?

3-We’ve seen that loop unroll is recommended. We’re wondering if this piece of code:
for(int i = 0, max=InputData; i < max; i++)
DoSomething();
will have any perfomance impact if max is the same for all the threas in a warp.

4-As s general rule, GPUs are less efficient that CPUs when managing branching and looping, even if no divergence happens. Is that assumption right?

5-We anticipate that despite all our efforts, we will be forced to use divergent branching at some of our code. Which is the exact performance impact of divergent branching? How can we code to minimize it?
6-Functions. Is there an specially severe performance penalty by using functions? We know CPU compilers make a very good job at optimizing function calls in release mode. Does the same happens with GPU compiler and code?

I think thats all External Image

Thanks to you all in advance.
Miquel

Global/texture memory is cached. If you’re accessing a memory address that’s in the cache, it is fast. Otherwise, it’s quite slow.

More or less. You need a lot of operations to mask global memory latency. Preloading to shared memory is not always necessary, because of caching (see above).

It’s all physically the same memory, all 6 GB of it. The same memory region can be accessed either as global or as texture. Depending on circumstances, one or the other method may be slightly faster. For simplicity, you could stick with global.

Local memory is basically a kludge that compiler resorts to use if you try to use too many variables in your kernel. It is global memory in disguise. The best advice is, “don’t use it.”

There are 16 to 48 kbytes of shared memory per SP (configurable at runtime) on Tesla 2050. If you try to launch a kernel that requires more shared memory than you have, the kernel will fail to launch.

Global/texture memory is cached. If you’re accessing a memory address that’s in the cache, it is fast. Otherwise, it’s quite slow.

More or less. You need a lot of operations to mask global memory latency. Preloading to shared memory is not always necessary, because of caching (see above).

It’s all physically the same memory, all 6 GB of it. The same memory region can be accessed either as global or as texture. Depending on circumstances, one or the other method may be slightly faster. For simplicity, you could stick with global.

Local memory is basically a kludge that compiler resorts to use if you try to use too many variables in your kernel. It is global memory in disguise. The best advice is, “don’t use it.”

There are 16 to 48 kbytes of shared memory per SP (configurable at runtime) on Tesla 2050. If you try to launch a kernel that requires more shared memory than you have, the kernel will fail to launch.

Yes, registers are very fast, and shared memory is basically just as fast, assuming you do not have bank conflicts in your access pattern. However, even with bank conflicts, shared memory is much faster than other kinds of memory.

That does help. The main way to mask global memory is to have enough active threads per multiprocessor. This is ultimately why the CUDA architecture encourages the use of many more threads than CUDA cores.

Global, local, texture and constant memory are stored in the DRAM, with no specific boundaries. Texture memory is global memory accessed through the special texture cache on the GPU. Similarly, there is another cache for constant memory that is optimized for broadcasting the same value to many threads. Local memory is memory allocated to each thread, whereas global memory can be accessed by all threads.

You will not be able to use all of the memory on the CUDA device because the driver usually reserves a few hundred MB for its own use. That can be further reduced if you have OpenGL or a GUI desktop active on the CUDA device.

See above. Note that Fermi-class (like the Tesla C2050) GPUs include a 16/48 kB per-multiprocessor L1 cache and a per-GPU 384 or 768kB L2 cache for all local/global memory access.

To get best performance, you want to minimize data transfer over the PCI-Express bus, as that is the slowest link in the GPU processing chain. If you can load all your data at the beginning and reuse it for many calculations, then that will work. The real goal is to maximum the ratio of arithmetic operations to bytes transferred.

Yes, that is true now. The underlying hardware can issue 32, 64 and 128-bit transactions per thread. If your data element is not one of those sizes, then loading it will require several instructions, and each instruction will result in a memory transfer with gaps. The L1 and L2 cache may mitigate some of the penalty here on Fermi devices.

On all CUDA devices, there is 16 kB of shared memory. You can’t overflow it (except with a pointer bug), as the total shared memory each block needs to use is known at kernel launch time. If your kernel uses too much memory, it will fail to launch, and you will see an error return code on the next CUDA function call. (Always check these error codes, otherwise you will have no idea that your kernel is not running!)

The key issue is “warp divergence”. Threads are bundled into groups of 32 to make warps, and warps are the unit of instruction execution. (And this assignment of threads to warps is fixed. Threads 0-31 go to warp 0, 32-63 to warp 1, etc.) If at a branching point, like an if-statement or while condition, some threads go one direction and other threads go a different direction, then the warp will need to be executed twice to cover each code path. A warp cannot process faster, even if some of its threads are masked off, so throughput is now cut in half. More branches will make this problem worse. However, the hardware is pretty good about merging code paths again as soon as possible, so don’t worry too much about short branches where some threads need to do a little extra processing before rejoining the group.

Similarly, looping is no problem at all as long as all threads in the warp complete the loop at approximately the same time.

Divergent branching always has a performance penalty, but you probably have some performance to burn. Most kernels are limited by memory bandwidth and not instruction throughput.

The benefit of loop unrolling depends entirely on the contents of DoSomething(). Also keep in mind that you can only unroll the loop if max is known at compile time, or if you tell the compiler to partially unroll it because you know that max is divisible by some integer.

Yes, registers are very fast, and shared memory is basically just as fast, assuming you do not have bank conflicts in your access pattern. However, even with bank conflicts, shared memory is much faster than other kinds of memory.

That does help. The main way to mask global memory is to have enough active threads per multiprocessor. This is ultimately why the CUDA architecture encourages the use of many more threads than CUDA cores.

Global, local, texture and constant memory are stored in the DRAM, with no specific boundaries. Texture memory is global memory accessed through the special texture cache on the GPU. Similarly, there is another cache for constant memory that is optimized for broadcasting the same value to many threads. Local memory is memory allocated to each thread, whereas global memory can be accessed by all threads.

You will not be able to use all of the memory on the CUDA device because the driver usually reserves a few hundred MB for its own use. That can be further reduced if you have OpenGL or a GUI desktop active on the CUDA device.

See above. Note that Fermi-class (like the Tesla C2050) GPUs include a 16/48 kB per-multiprocessor L1 cache and a per-GPU 384 or 768kB L2 cache for all local/global memory access.

To get best performance, you want to minimize data transfer over the PCI-Express bus, as that is the slowest link in the GPU processing chain. If you can load all your data at the beginning and reuse it for many calculations, then that will work. The real goal is to maximum the ratio of arithmetic operations to bytes transferred.

Yes, that is true now. The underlying hardware can issue 32, 64 and 128-bit transactions per thread. If your data element is not one of those sizes, then loading it will require several instructions, and each instruction will result in a memory transfer with gaps. The L1 and L2 cache may mitigate some of the penalty here on Fermi devices.

On all CUDA devices, there is 16 kB of shared memory. You can’t overflow it (except with a pointer bug), as the total shared memory each block needs to use is known at kernel launch time. If your kernel uses too much memory, it will fail to launch, and you will see an error return code on the next CUDA function call. (Always check these error codes, otherwise you will have no idea that your kernel is not running!)

The key issue is “warp divergence”. Threads are bundled into groups of 32 to make warps, and warps are the unit of instruction execution. (And this assignment of threads to warps is fixed. Threads 0-31 go to warp 0, 32-63 to warp 1, etc.) If at a branching point, like an if-statement or while condition, some threads go one direction and other threads go a different direction, then the warp will need to be executed twice to cover each code path. A warp cannot process faster, even if some of its threads are masked off, so throughput is now cut in half. More branches will make this problem worse. However, the hardware is pretty good about merging code paths again as soon as possible, so don’t worry too much about short branches where some threads need to do a little extra processing before rejoining the group.

Similarly, looping is no problem at all as long as all threads in the warp complete the loop at approximately the same time.

Divergent branching always has a performance penalty, but you probably have some performance to burn. Most kernels are limited by memory bandwidth and not instruction throughput.

The benefit of loop unrolling depends entirely on the contents of DoSomething(). Also keep in mind that you can only unroll the loop if max is known at compile time, or if you tell the compiler to partially unroll it because you know that max is divisible by some integer.

It depends. GPUs lack the sophisticated branch prediction logic that CPUs use to mitigate the performance loss of a pipeline flush when the branch is mispredicted. That would be a problem if GPUs were single threaded devices. However, as long as you have enough threads per block active, a pipeline flush is seldom needed, even in the presence of branching. The pipeline will be unlikely to contain two instructions from the same warp, since it will be processing instructions from all the other available warps.

This depends entirely on what your branching looks like, and how much contributing performance factors, like memory latency and bandwidth, supersede any delays caused by branching.

No. CUDA by default inlines all function calls. Newer architectures actually can perform more traditional function calls with a stack, but that generally has to be requested deliberately.

As a general comment: Do not worry too much up front about branching as long as your overall approach is to apply roughly the same calculation in each thread. The emphasis on avoiding divergent branching is mostly to be sure people realize that CUDA programming is data-parallel and not task-parallel. You should worry a lot more about the number of global memory reads/writes you will need, and how coalesced these accesses will be. Device memory speed is often the bottleneck for CUDA programs.

It depends. GPUs lack the sophisticated branch prediction logic that CPUs use to mitigate the performance loss of a pipeline flush when the branch is mispredicted. That would be a problem if GPUs were single threaded devices. However, as long as you have enough threads per block active, a pipeline flush is seldom needed, even in the presence of branching. The pipeline will be unlikely to contain two instructions from the same warp, since it will be processing instructions from all the other available warps.

This depends entirely on what your branching looks like, and how much contributing performance factors, like memory latency and bandwidth, supersede any delays caused by branching.

No. CUDA by default inlines all function calls. Newer architectures actually can perform more traditional function calls with a stack, but that generally has to be requested deliberately.

As a general comment: Do not worry too much up front about branching as long as your overall approach is to apply roughly the same calculation in each thread. The emphasis on avoiding divergent branching is mostly to be sure people realize that CUDA programming is data-parallel and not task-parallel. You should worry a lot more about the number of global memory reads/writes you will need, and how coalesced these accesses will be. Device memory speed is often the bottleneck for CUDA programs.

Hello Seibert and Hamster

Thank you very much for your answers and the time you devoted to write them.

I have only one doubt regarding your last comment Seibert about global memory access and its impact on performance. We were under the impression that bringing data to shared memory is something that is a must only if:

1-The data is read more than once/twice during the block lifetime.
2-The data is actually shared for all the threads in the block (in that case if I understood correctly there is no bank conflicts as only 1 broadcast operation happens).

Is that right?

Best Regards
Miquel

Hello Seibert and Hamster

Thank you very much for your answers and the time you devoted to write them.

I have only one doubt regarding your last comment Seibert about global memory access and its impact on performance. We were under the impression that bringing data to shared memory is something that is a must only if:

1-The data is read more than once/twice during the block lifetime.
2-The data is actually shared for all the threads in the block (in that case if I understood correctly there is no bank conflicts as only 1 broadcast operation happens).

Is that right?

Best Regards
Miquel

Correct. My comment regarding memory performance is not about choosing between shared memory and global memory. It is a suggestion that you want to think about your algorithm design and how many times (and in what order) you read and write data in general. Most mid-to-high range CUDA devices have so much floating point power that they can process data faster than it can be read from global memory. Designing your algorithm to be efficient about memory reads and writes is often more important than worrying about branch divergence.

Correct. My comment regarding memory performance is not about choosing between shared memory and global memory. It is a suggestion that you want to think about your algorithm design and how many times (and in what order) you read and write data in general. Most mid-to-high range CUDA devices have so much floating point power that they can process data faster than it can be read from global memory. Designing your algorithm to be efficient about memory reads and writes is often more important than worrying about branch divergence.

OK, understood. We agree completely. The succes with CUDA performance comes from massive paralelism combined with data coalescing and wise use of shared memory.

Thank you very much again for your help.

Best regards

OK, understood. We agree completely. The succes with CUDA performance comes from massive paralelism combined with data coalescing and wise use of shared memory.

Thank you very much again for your help.

Best regards

Hello,

I have a very simple kernel function:

if(threadIdx.x < WARP_SIZE/2)  
   //do something
else
   //do something else

I called this kernel with N number of blocks where the block size is 32
I did this experiment on Tesla M2050(Fermi…when viewing the profiler results, the number of divergent branches came out with X value,
when placing the same code within for loop, like this

 for(int i=0 ; i<10 ; i++)
if(threadIdx.x < WARP_SIZE)  
   //do something
else
   //do something else

one will expect that the number of divergent branches will be 10*X, however, the profiler gave me the same number X!!!

On the other hand, doing the same experiment on Quadro FX3800M, the profiler there gave me what I was expecting X*10,

So, it’s either the profiler results are not correct, or the M2050 GPU (Fermi) dealt with the situation in a smart way…

Anyone had a similar issue? Any feedback is greatly appreciated…

The compiler can choose to deal with branches in your code by either allowing divergence, or by using a predicated instruction. I don’t think a predicated instruction will show up in the divergence counter. You should compile with the --ptx flag to keep the PTX intermediate code and see what the compiler did.

Thank you.
Is there an option to disable the branch predication?