SM,share memory size,constant size?

i use GTX960 to program Cuda C code.

now the device inquiry show GTX940M has 5 SM (multiprocessors).

according to some indicate, one SM (multiprocessors) has 48KB share memory and 16KB L1 cache, 64KB constant memory.

now i have 2 concerns:

  1. when i compile the code, running the program… the GTX960 will start the total 5 SM(multiprocessor) to run the cuda program or just start one of the total 5 SM(multiprocessors) to run the cuda program as default.

If it only start 1 SM(multiprocessors). in order to speed up , which command or which setting should i use for compiler to start the other SM(multiprocessor) run in the same time?

  1. if one SM (multiprocessors) has 48KB share memory and 16KB L1 cache, 64KB constant memory. then GTX940M should have total 48K5=240KB share memory, 16K5=80KB L1 cache, 64K*5=320KB constant memory.

    when i assign 100KB constant var, the compiler show error. (out of the range size of constant). how can i assign 100KB constant var in different SM’s constant memory. need some settings in compiler? or i am wrong. the GTX960M only have total 64KB constant memory for all 5 SM(multiprocessor).

i use VS2017,cuda 9.0 for debug …

When you launch a kernel on the GPU, as long as your grid contains as many blocks as there are SMs (or more), all SMs will be in use. You can assume that to first order, blocks in the grid are distributed roughly equally across the available SMs. For typical CUDA programs, you would want to use hundreds of blocks in your launch configuration, that way your code will scale well from the smallest to the biggest GPUs.

constant memory is limited to 64 KB, this is documented. At the hardware level, constant memory is split into various banks of different sizes. Only one of the banks is exposed for the use of programmers as constant memory. When you disassemble GPU machine code with cuobjdump --dump-sass, you can see constant memory references of the form c[bank][index]. E.g.

/*00b0*/                   XMAD.MRG R8, R9, c[0x0] [0x14].H1, RZ;
/*00b8*/                   FFMA R5, R4, c[0x2][0x0], R5;
/*01a8*/                   IADD.X R3, R11, c[0x0][0x14c];
/*01b8*/              @!P0 ICMP.LE R5, RZ, c[0x2][0x20], R4;

thanks juffa,

regarding share memory… ,now the device inquiry show GTX940M has 5 SM (multiprocessors).one SM (multiprocessors) has 48KB share memory. then the GTX940M 5SM(multiprocessors) has total: 48KB*5=240K Byte share memory. right? but each SM can only see his own share memory? is it right?

Correct. Note that depending on the GPU architecture, a single thread block may not be able to utilize all the shared memory available per SM.

GTX940M is a GM108 and GTX960 is a GM107. Both are compute capability 5.0.

The sizes that you have mentioned above are incorrect for CC 5.0.

The values are available at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

Maximum amount of shared memory per multiprocessor 64 KB
Maximum amount of shared memory per thread block 48 KB

Constant memory size 64 KB
Cache working set per multiprocessor for constant memory 8 KB
Cache working set per multiprocessor for texture memory Between 12 KB and 48 KB

The cache size is listed a working set for texture and constant because there are multiple caches per SM. It is very common that both caches have almost identical content in which the working set size is smaller that what is listed.

In terms of SM count the GTX940M has 3 SMs and the GTX960 has 5 SMs.
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

To answer the questions above.

(1) The compute work distributor will distribute blocks across all SMs. For most GPUs this is done in a breadth first distribution.

(2) If you have a 100 KB of read-only data you will probably see a performance improvement moving to read-only accesses via global memory instead of using constant memory. Constant memory is limited to 64 KB. Constant memory is only fast if all threads in the warp access the same address and if the working set is small (e.g. kernel parameters).

the problem is that i did not assign which SM… i just use share_ define in code… How compile know this share memory block in which SM?

then how i can assign X1(30KB) share memory for SM1, X2(20KB) share memory for SM2?

It seems you have misconceptions on how the GPU works. I would suggest spending more time reading through CUDA documentation (the Programming Guide in particular), and also systematically working through the example apps that ship with CUDA, starting with the simplest ones. There are probably some useful free online video lectures you could use to get up tp speed, but I don’t have a good overview so can’t point you.

maybe i use too many confused words.in last question ,SM means 5 SM (multiprocessors).

In your above information, Maximum amount of shared memory per multiprocessor :64 KB.

Does any other command or setting except(share…define) need in order to access these total 64K B*5=320K B share memory? Of course I will take care each block can only access Max.48K byte share memory.

Each thread block runs on some SM and can access only this SM memory. you don’t need to ensure that and cannot control that in any way. Multiple thread blocks running on the same SM simultaneously, share the 64 KB shared mem area. So if you thread block is using f.e. 20 KB, GPU can run no more than 3 thread blocks per SM simultaneously. All thread blocks in a single grid use the same amount of shared mem, so you can’t alloc 20 KB for one block but 30 KB for another block without using more sophisticated techniques (i.e. merging those two blocks in a single one using 20+30 KB or running multiple grids simultaneously).

Since API doesn’t allow to alloc more than 48 KB per block anyway, it’s usually better to run multiple blocks per SM using 10-20 KB per block rather than one or two larger blocks, so try hard to reduce shared memory usage per block - you can relocate data into registers and L1/L2/texture/constant caches. Read about SHFL command - it allows to read data from registers of adjacent threads, and register pool is as large as 256 KB per SM (!). Also it has ~5x smaller delay compared to reading from shared mem or L1 caches

I also suggest you to read some CUDA book to get proper picture of GPU architecture