Doubts related to CUDA

Hello,
I am a little new to CUDA. And I have some doubts whose answers I am not able to find. Please help me answer these questions:

  1. How does the number of simultaneous blocks running per SM effect the performance: As far as i know finally at hardware level threads are executed in terms of warps. So at a time only threads of one block can run on an SM no matter how many resources you have to accommodate more blocks, hence having more blocks running simultaneously doesn’t seem to increase performance. For instance if i have a kernel which can be executed in two ways: one way in which each block uses say 14k for shared memory so in this case each SM can have only one block and other way can be if I use only 5k of shared memory so now it can support two blocks. Which way is better as far as performance

  2. How much is the maximum size of texture memory that we can use for general purpose?

Hello,
I am a little new to CUDA. And I have some doubts whose answers I am not able to find. Please help me answer these questions:

  1. How does the number of simultaneous blocks running per SM effect the performance: As far as i know finally at hardware level threads are executed in terms of warps. So at a time only threads of one block can run on an SM no matter how many resources you have to accommodate more blocks, hence having more blocks running simultaneously doesn’t seem to increase performance. For instance if i have a kernel which can be executed in two ways: one way in which each block uses say 14k for shared memory so in this case each SM can have only one block and other way can be if I use only 5k of shared memory so now it can support two blocks. Which way is better as far as performance

  2. How much is the maximum size of texture memory that we can use for general purpose?

I’m new to CUDA too and haven’t had the chance to start writing any code yet so don’t take my opinions seriously :unsure:

  1. to me, it seems the idea of block has not much to do with performance, but only with memory access patterns. Maybe it helps you to get your threads divided and organised in a clear way (with all the indexes of x, y and z). My unreliable suggestion is that, when performance is concerned, just figure out the best number of threads/SM without thinking so much into the block size. Also, since you mentioned the use of shared memory, it might be better if you could squeeze it all into the registers. According some article on developers.nvidia.com, there is indeed a difference between the access width of register and shared memory.

  2. I read loads of stuffs these two days, and from what I can recall, a kernel could only have a maximum of 128 texture objects. Besides, every 4 SMs form a Graphics Processing Cluster (GPC), which has a 32K texture cache shared among the 4 SMs. So effectively each SM gets around 8KB of texture cache. As for the specific size of the total texture memory allowed, I somehow have the impression that it’s 128Kb, which is very likely not the correct answer. :biggrin: Let’s wait for someone else to answer that!

Also, just a reminder, texture cache doesn’t speed up memory access unless your memory access pattern is not coalesced. I don’t remember from which document I read of this, but anyway it’s said that texture cache only reduces the demand for global memory bandwidth, meaning it would still take around 400 cycles to load texture memory.

Hope someone could point out if there’s anything wrong with what I said :smile:

I’m new to CUDA too and haven’t had the chance to start writing any code yet so don’t take my opinions seriously :unsure:

  1. to me, it seems the idea of block has not much to do with performance, but only with memory access patterns. Maybe it helps you to get your threads divided and organised in a clear way (with all the indexes of x, y and z). My unreliable suggestion is that, when performance is concerned, just figure out the best number of threads/SM without thinking so much into the block size. Also, since you mentioned the use of shared memory, it might be better if you could squeeze it all into the registers. According some article on developers.nvidia.com, there is indeed a difference between the access width of register and shared memory.

  2. I read loads of stuffs these two days, and from what I can recall, a kernel could only have a maximum of 128 texture objects. Besides, every 4 SMs form a Graphics Processing Cluster (GPC), which has a 32K texture cache shared among the 4 SMs. So effectively each SM gets around 8KB of texture cache. As for the specific size of the total texture memory allowed, I somehow have the impression that it’s 128Kb, which is very likely not the correct answer. :biggrin: Let’s wait for someone else to answer that!

Also, just a reminder, texture cache doesn’t speed up memory access unless your memory access pattern is not coalesced. I don’t remember from which document I read of this, but anyway it’s said that texture cache only reduces the demand for global memory bandwidth, meaning it would still take around 400 cycles to load texture memory.

Hope someone could point out if there’s anything wrong with what I said :smile:

Thanks for your reply.

  1. Registers are very limited and i have a huge data to load. I have following options:

a. Load the whole data into texture memory ( and it turns out that even that is not enough my data size is more than 20 kB) - I am not sure how to do this.

b. I have two data chunks one is of size 5kB and other of around than 20kB. Will it be better that if if use 5kb of shared memory for smaller chunk and remaining to load 10kb of the other data (bigger one) or if i use complete 16kB of data to load the other data (bigger one) and 5Kb I will read from global memory. Which would be better and why??

  1. Frankly speaking I didnt clearly understood the concept of texture memory (what exactly is this??:argh:). When would it be benefial to use it. I have a huge array which i need for each thread on GPU. Will it be advisable to load this array into texture memory or is there any other suggestion related to this?

Thanks for your reply.

  1. Registers are very limited and i have a huge data to load. I have following options:

a. Load the whole data into texture memory ( and it turns out that even that is not enough my data size is more than 20 kB) - I am not sure how to do this.

b. I have two data chunks one is of size 5kB and other of around than 20kB. Will it be better that if if use 5kb of shared memory for smaller chunk and remaining to load 10kb of the other data (bigger one) or if i use complete 16kB of data to load the other data (bigger one) and 5Kb I will read from global memory. Which would be better and why??

  1. Frankly speaking I didnt clearly understood the concept of texture memory (what exactly is this??:argh:). When would it be benefial to use it. I have a huge array which i need for each thread on GPU. Will it be advisable to load this array into texture memory or is there any other suggestion related to this?

I think I gave you some wrong information. Texture cache should be 12KB/4SM.

Looks like you’re using Compute Capability 1.x, cos 2.x could have shared memory of 48KB/SM

“On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well). Unless texture fetches provide other benefits such as address calculations or texture filtering (Section 5.3.2.5), this optimization can be counter-productive on devices of compute capability 2.0, however, since global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache.”
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_FermiTuningGuide.pdf

In my understanding, texture cache is kinda like a separate global memory with its own memory interface. Access to it does not have to be coalesced, which makes it easier for certain forms of random memory access. Since it’s just another global memory, latency is long. But since it has a separate memory interface, access to it does not compete with access to the main global memory. So maybe when your global memory access is already very busy, you could consider using the texture cache.

In all, do not use texture cache unless:

  1. you’re using uncoalesced memory access
  2. Your global memory bandwidth is already at its limit.

Actually 2.x has cache for constant memory too. But since you’re using 1.x, I guess this doesn’t help.

Just try greater thread-level parallelism to hide the latency of global memory access?

I think I gave you some wrong information. Texture cache should be 12KB/4SM.

Looks like you’re using Compute Capability 1.x, cos 2.x could have shared memory of 48KB/SM

“On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well). Unless texture fetches provide other benefits such as address calculations or texture filtering (Section 5.3.2.5), this optimization can be counter-productive on devices of compute capability 2.0, however, since global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache.”
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_FermiTuningGuide.pdf

In my understanding, texture cache is kinda like a separate global memory with its own memory interface. Access to it does not have to be coalesced, which makes it easier for certain forms of random memory access. Since it’s just another global memory, latency is long. But since it has a separate memory interface, access to it does not compete with access to the main global memory. So maybe when your global memory access is already very busy, you could consider using the texture cache.

In all, do not use texture cache unless:

  1. you’re using uncoalesced memory access
  2. Your global memory bandwidth is already at its limit.

Actually 2.x has cache for constant memory too. But since you’re using 1.x, I guess this doesn’t help.

Just try greater thread-level parallelism to hide the latency of global memory access?

I have another doubt in the first point ie number of threads per SM.

  1. For each SM at one point only one warp will be executing on the hardware so how does it matter how many threads per SM are there? Even if there are more threads still it will first execute one warp and then will execute the next warp. I was going through the video lecture present in the CUDA website n there he mention that if we increase the number of register then its possible that we have less number of blocks per SM. He gave an example in which he said every thread uses 10 registers and total there are 256 threads so in all there are 2560 registers. And for G80 hardware which can support 8192 registers so you can have 3 blocks running per multiprocessor. But if suppose each thread uses 11 registers so we will need 25611 = 2816 registers per block . Now 8192 < 28163 so now only two blocks can run per SM. Now my question again is how does it matter how many blocks per SM can run together because actually in hardware they wont be executing together. Correct me if my understanding is wrong somewhere

I have another doubt in the first point ie number of threads per SM.

  1. For each SM at one point only one warp will be executing on the hardware so how does it matter how many threads per SM are there? Even if there are more threads still it will first execute one warp and then will execute the next warp. I was going through the video lecture present in the CUDA website n there he mention that if we increase the number of register then its possible that we have less number of blocks per SM. He gave an example in which he said every thread uses 10 registers and total there are 256 threads so in all there are 2560 registers. And for G80 hardware which can support 8192 registers so you can have 3 blocks running per multiprocessor. But if suppose each thread uses 11 registers so we will need 25611 = 2816 registers per block . Now 8192 < 28163 so now only two blocks can run per SM. Now my question again is how does it matter how many blocks per SM can run together because actually in hardware they wont be executing together. Correct me if my understanding is wrong somewhere

There is one key reason why you want far more threads than CUDA cores active on a multiprocessor: memory latency. In CPU code, oversubscribing your processors with extra threads is helpful to increase throughput if compute threads find themselves often waiting for disk or network I/O, which is very high latency. CUDA uses the same idea, but applied to the on-board global memory.

For large streaming workloads, the overhead of “wait for data from memory, then run instruction” would kill performance unless your entire working set fit into a low latency cache. However, if you can make the overhead of switching threads essentially zero (which is true for CUDA, but not true on a CPU), then you can use a large number of threads to keep the compute units busy while other threads wait for memory accesses to finish. Overall throughput goes up, even though every individual thread still has to wait for data from memory.

There is one key reason why you want far more threads than CUDA cores active on a multiprocessor: memory latency. In CPU code, oversubscribing your processors with extra threads is helpful to increase throughput if compute threads find themselves often waiting for disk or network I/O, which is very high latency. CUDA uses the same idea, but applied to the on-board global memory.

For large streaming workloads, the overhead of “wait for data from memory, then run instruction” would kill performance unless your entire working set fit into a low latency cache. However, if you can make the overhead of switching threads essentially zero (which is true for CUDA, but not true on a CPU), then you can use a large number of threads to keep the compute units busy while other threads wait for memory accesses to finish. Overall throughput goes up, even though every individual thread still has to wait for data from memory.

Can anyone help me understand the meaning of number of blocks running simultaneously ? Finally if everything is going to run at warp level than what is the meaning of multiple blocks running simultaneously on an SM?

Can anyone help me understand the meaning of number of blocks running simultaneously ? Finally if everything is going to run at warp level than what is the meaning of multiple blocks running simultaneously on an SM?

In that context running really means active. There is dedicated scheduling and context switching hardware in each multiprocessor which can manage the local state and resources of multiple block simultaneously. Even though instructions are being executed from only one or two warps (the latter applies to fermi where warps are dual issued) at a given clock cycle, the multiprocessor is free to context switch between different active blocks in order to maintain instruction throughput when there is both considerable pipeline and memory latency in the architecture. Each hardware generation multiprocessor design has a limit to the number of total blocks and warps it can manage simultaneously, and how many will be active at a time depends mostly on the per block register and shared memory consumption. The fraction of how many warps are active compared to the total the hardware can handle is usually referred to a “occupancy”, and there is both a discussion in the programming guide, and a spreadsheet avaiable which can calculated the occupancy for a given kernel and generation of hardware.

In that context running really means active. There is dedicated scheduling and context switching hardware in each multiprocessor which can manage the local state and resources of multiple block simultaneously. Even though instructions are being executed from only one or two warps (the latter applies to fermi where warps are dual issued) at a given clock cycle, the multiprocessor is free to context switch between different active blocks in order to maintain instruction throughput when there is both considerable pipeline and memory latency in the architecture. Each hardware generation multiprocessor design has a limit to the number of total blocks and warps it can manage simultaneously, and how many will be active at a time depends mostly on the per block register and shared memory consumption. The fraction of how many warps are active compared to the total the hardware can handle is usually referred to a “occupancy”, and there is both a discussion in the programming guide, and a spreadsheet avaiable which can calculated the occupancy for a given kernel and generation of hardware.

Thanks a lot for your reply.

See I have a data of 5kB and 9 kB which i load into the shared memory. So in this case i can have only one block per multiprocessor. So Should I try to reduce these data so as to make more blocks or even one block per SM is also OK?

Also I am facing another problem. In my kernel I load 5kB data first but when I load the other 9KB data my 5kB data is getting corrupted. I am not sure why this is happening because my GPU has a 16kB of memory so it should be able to load the two data. Any idea what could be going wrong??

Thanks a lot for your reply.

See I have a data of 5kB and 9 kB which i load into the shared memory. So in this case i can have only one block per multiprocessor. So Should I try to reduce these data so as to make more blocks or even one block per SM is also OK?

Also I am facing another problem. In my kernel I load 5kB data first but when I load the other 9KB data my 5kB data is getting corrupted. I am not sure why this is happening because my GPU has a 16kB of memory so it should be able to load the two data. Any idea what could be going wrong??