how to run the same program in different number of sm cores?

hello everyone

could u can help me to design a experinment to test one program|(foe example the vector adder) work in one SM core, 2Sm cores and 4SM cores?

I want to get the execution time with different sm cores. In my view, if I USE 2 SM CORES run the vector adder, it should reduces the half time compared to use one core? How to design this experinment?

that’s definitely possible, but it’s not how things are usually arranged on GPU

just f.e. - one of top nvidia GPUs, GF1080, has 2560 ALUs. If you implement naive vector addition kernel that performs one addition per thread, you will have gridsize equal to the number of array elements. If the number of elements is less than 2560 - it’s very small amount of work. More probable, each kernel call will add million of elements or more. This means that the kernel call will start and finish million threads during the kernel execution. Of course, with 2560 ALUs all million threads can’t be executed simultaneously, so new threads will be started in the same cores as older threads are finished

It’s the standard arrangement - each kernel call has more jobs (i.e. gridsize) than there ALUs, each job is run by a separate thread, so most time ALL SMs are occupied by jobs. Only at the end of execution there are some tail threads that are executed occupying only part of SMs. And programmers try to minimize this tail effect by various ways.

Making a kernel call that will occupy only part of SMs is possible (i.e. entire grid will be in tail zone), but doesn’t make much sense since it will slowdown execution (usually)

in the other thread, SPWorley already gave you a hint about structuring this: use a grid-striding loop.

Here’s a fully worked example:

$ cat
#include <stdio.h>
const int ds=1024*1024*32;

template <int sz>
__global__ void vadd(float *c, const float *a, const float *b, const int dsize){

  int idx= threadIdx.x+blockDim.x*blockIdx.x;
  while (idx < dsize){
    c[idx] = a[idx] + b[idx];

int main(){

  float *a, *b, *c;
  cudaMalloc(&c, ds*sizeof(float));
  cudaMalloc(&a, ds*sizeof(float));
  cudaMalloc(&b, ds*sizeof(float));

  vadd<0><<<ds/1024,1024>>>(c, a, b, ds); // warm-up
  vadd<1><<<1,1024>>>(c, a, b, ds);
  vadd<2><<<2,1024>>>(c, a, b, ds);
  vadd<4><<<4,1024>>>(c, a, b, ds);
  vadd<8><<<8,1024>>>(c, a, b, ds);
  vadd<16><<<16,1024>>>(c, a, b, ds);
$ nvcc -arch=sm_61 -o t352
$ nvprof ./t352
==26500== NVPROF is profiling process 26500, command: ./t352
==26500== Profiling application: ./t352
==26500== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 45.67%  12.013ms         1  12.013ms  12.013ms  12.013ms  void vadd<int=1>(float*, float const *, float const *, int)
 24.23%  6.3734ms         1  6.3734ms  6.3734ms  6.3734ms  void vadd<int=2>(float*, float const *, float const *, int)
 13.40%  3.5241ms         1  3.5241ms  3.5241ms  3.5241ms  void vadd<int=4>(float*, float const *, float const *, int)
  7.88%  2.0726ms         1  2.0726ms  2.0726ms  2.0726ms  void vadd<int=8>(float*, float const *, float const *, int)
  4.86%  1.2795ms         1  1.2795ms  1.2795ms  1.2795ms  void vadd<int=16>(float*, float const *, float const *, int)
  3.95%  1.0385ms         1  1.0385ms  1.0385ms  1.0385ms  void vadd<int=0>(float*, float const *, float const *, int)

==26500== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 90.48%  323.45ms         3  107.82ms  557.40us  322.33ms  cudaMalloc
  7.36%  26.294ms         1  26.294ms  26.294ms  26.294ms  cudaDeviceSynchronize
  1.16%  4.1646ms       364  11.441us     298ns  509.09us  cuDeviceGetAttribute
  0.89%  3.1969ms         4  799.23us  771.42us  828.77us  cuDeviceTotalMem
  0.08%  299.38us         4  74.844us  68.688us  89.177us  cuDeviceGetName
  0.02%  56.046us         6  9.3410us  4.8850us  27.819us  cudaLaunch
  0.00%  6.4090us        24     267ns     138ns  2.5990us  cudaSetupArgument
  0.00%  6.0590us        12     504ns     306ns  1.4260us  cuDeviceGet
  0.00%  3.6390us         3  1.2130us     397ns  2.4440us  cuDeviceGetCount
  0.00%  2.3460us         6     391ns     207ns  1.1340us  cudaConfigureCall

In each example of the kernel call, the amount of work (the size of the vectors to be added) is the same. However when we launch only 1 threadblock, the duration is the longest, and decreases by about half as we double the number of threadblocks. We can safely assume that the block distributor will generally distribute blocks to “empty” SMs first, so as we go from 1 to 2 to 4 blocks in the grid, we can assume that we are engaging 1, 2, and 4 SMs respectively.

The GPU you run on here will definitely matter. I am running this on a Pascal Titan X. If you run on a very small GPU with only 1 or 2 SMs the performance pattern may be different.

Thank you ver much for your help, but I am a newer in cuda ,could you give a program for visual studio?

chickennight, suggest you try to create a dummy project in visual studio first. You’ll find some on the net. There’s lots of information out there, for example:

Come back and ask questions if you’re having difficulty with anything specifically. Its going to be difficult to upload a visual studio project for you, even if anyone is willing to do that.