Hi everyone
We’re starting a project on CUDA technology. We’re trying to implement on CUDA a business process for the tourism industry which is paralellizable, but that it also includes a lot of logic. By that I mean branching, looping etc.
After reading perhaps half of the documentation and listening to a good deal of the webinars, I think we are starting to grab the main concepts involving effective GPU programming. But at this time we would like also to receive feedback to confirm or deny if our understanding of the architecture is right or we’re still missing something. We have therefore a bunch of questions, and we would appreciate if someone could help us with some of them.
OK, here are the questions:
Regarding memory usage.
We understannd that there is global, texture, local, shared memory and registers.
1-Shared memory and registers are hundreds of times faster than global, local or texture memory. Is that right?
2-Preloading data to shared memory is appropiate as long as the data is accesed more than once by any given thread. If thats not the case, delaying the use of the data with an intermediate operation would be enough to mask access to global memory latency. Is that right?
3-Perhaps because we’re used to CPU programming, we are not sure we will be confortable using texture memory. Are we ok with only using global memory?. Also, for instance in a Tesla 2050 card there are 6 GB of DRAM. Are all of those 6 GBytes global memory? We dont understand the ratio between global, local and texture memory in DRAM.
4-Whats the difference between local and global memory? We know both are at DRAM and not cached, but thats it.
5-To get good performance there should be an initial data loading from host to device global memory.
6-Data Coalescing. We understand that to get good memory performance we should be using SOA. But, we also understand that if the primitive struct size is less or equal to 128 bits, we re ok. For instance
typedef truct {
double net, comission;
} Price;
global kernel(vector inputData);
and
typedef struct {
vector net, comission;
} PriceVector
global kernel(PriceVector data);
are both good. Is that right?
7-We’re confused about what will happen with register and shared memory pressure. We understand that there are 32 KBytes (or KBits) of memory for shared memory on every scalar processor. Is that right. What will happen if we overflow the amount of that available memory, be it because of register or shared memory pressure? How do you guys monitor and design around that problem?
Now regarding data processing:
1-Our process involves branching and looping when designed on CPU. When porting it to GPU, we understand we should be preprocessing (possible expanding) input data as much as possible in order to avoid as much branching and looping as possible. are we right?
2-If branching is unavoidable, short conditions will avoid divergent branching. For instance:
if (ConditionA)
DoA();
else
DoB();
This will have no performance penalty as long as the ConditionA is short enough. Is that right?
3-We’ve seen that loop unroll is recommended. We’re wondering if this piece of code:
for(int i = 0, max=InputData; i < max; i++)
DoSomething();
will have any perfomance impact if max is the same for all the threas in a warp.
4-As s general rule, GPUs are less efficient that CPUs when managing branching and looping, even if no divergence happens. Is that assumption right?
5-We anticipate that despite all our efforts, we will be forced to use divergent branching at some of our code. Which is the exact performance impact of divergent branching? How can we code to minimize it?
6-Functions. Is there an specially severe performance penalty by using functions? We know CPU compilers make a very good job at optimizing function calls in release mode. Does the same happens with GPU compiler and code?
I think thats all External Image
Thanks to you all in advance.
Miquel