How to run with more than 65535 blocks?

HongXiang · November 6, 2013, 1:23pm

Hi,
I call the kernel function with kernelfun<<<102400,1024>>>, it simply skiped it and does not run call the kernel function,
Then I divided my task into 1024 parts, for each parts there is only <<<100,1024>>>, then it works well.

However I uses deviceQuery to check my GPUs.
There are two GPUs in the desktop, one GTX780 and one GT610, and I uses GT610 for display.

For GTX780, the maximum grid size is <2147483647,65535,65535>.
For GT610, the maximum grid size is <65535,65535,65535>.

However it should automatically choose GT780 to run the program, so there should be no problem for runing with more than 65535 blocks.

How can I fix this problem?

Thanks.

vacaloca · November 6, 2013, 1:37pm

I forget how NVIDIA decides on device ordering, but it is definitely not automatic – I believe it chooses device 0 on your machine unless you explicitely tell it to otherwise. You have a few options:

Hide the GT610 via the environment variable CUDA_SET_VISIBLE_DEVICES: http://www.resultsovercoffee.com/2011/02/cudavisibledevices.html
Use code like this, which in your case would work, since GTX780 has a larger number of SMs:

// selects the card with the largest number of multiprocessors
    int num_devices = 0; int device;
    cudaGetDeviceCount(&num_devices);
    if (num_devices == 0) {
        printf("No CUDA-capable device found!\n\n"); }
    if (num_devices > 1) {
          int max_multiprocessors = 0, max_device = 0;
          for (device = 0; device < num_devices; device++) {
                  cudaDeviceProp properties;
                  cudaGetDeviceProperties(&properties, device);
                  if (max_multiprocessors < properties.multiProcessorCount) {
                          max_multiprocessors = properties.multiProcessorCount;
                          max_device = device;
                  }
          }
          cudaSetDevice(max_device);
    }

HongXiang · November 6, 2013, 2:09pm

vacaloca:

I forget how NVIDIA decides on device ordering, but it is definitely not automatic – I believe it chooses device 0 on your machine unless you explicitely tell it to otherwise. You have a few options:

Hide the GT610 via the environment variable CUDA_SET_VISIBLE_DEVICES: http://www.resultsovercoffee.com/2011/02/cudavisibledevices.html

Use code like this, which in your case would work, since GTX780 has a larger number of SMs:
// selects the card with the largest number of multiprocessors
    int num_devices = 0; int device;
    cudaGetDeviceCount(&num_devices);
    if (num_devices == 0) {
        printf("No CUDA-capable device found!\n\n"); }
    if (num_devices > 1) {
          int max_multiprocessors = 0, max_device = 0;
          for (device = 0; device < num_devices; device++) {
                  cudaDeviceProp properties;
                  cudaGetDeviceProperties(&properties, device);
                  if (max_multiprocessors < properties.multiProcessorCount) {
                          max_multiprocessors = properties.multiProcessorCount;
                          max_device = device;
                  }
          }
          cudaSetDevice(max_device);
    }

Thank you for your reply.

I’ve copied your code into my program, however it still does not work.

Also Device 0 of my desktop is exactly GTX780.

I tried to hide GT610. deviceQuery shows there is only one GPU now. However it still skips call to the kernel function.

pasoleatis · November 6, 2013, 3:04pm

It is not enough just to choose the G780 device,you also have to specify the cc capability in the compiling flags. Use -arch=sm_30.

By the way wat cc is GT 610? Is it 2.0 or 3.0? ( Is it a Kepler card or Fermi card?).

HongXiang · November 7, 2013, 3:13am

Thank you for your reply.

I’m working with CUDA+VS2012. How to set this option?
I’ve found that in poject property → CUDA C/C++ → Device → Code Generation, it is compute_10,sm_10 as default, there is no other selection. I typed compute_30,sm_35 into it. But it still does not work.

I checked from the project property → CUDA C/C++ → Command Line, and I don’t find any terms like -arch=sm_30, is there anything I should do to apply the setting sm_35?

And cc of GT610 is 2.1

vacaloca · November 7, 2013, 3:43am

Type compute_30,sm_30 or compute_35,sm_35 in that same option

HongXiang · November 7, 2013, 11:03am

I tried both settings, just not work.

After I apply those seetings, I can’t find anything change in CUDA C/C++ → Command Line. Does those setting work?

When I click on the triangle on the right side of the blank, there is not selections like compute_** or sm_**, I have to type it manully into the blank. Is there any problem?

vacaloca · November 7, 2013, 1:53pm

When you say ‘not work’ do you mean you can’t run more than say kernelfun<<<65535,1024>>> elements? It may also be that you are exceeding some other GPU resource that is not launch bounds.

Also look in the build output after you compile to see if the -gencode=arch=compute_30,code="sm_30,compute_30" is being sent to NVCC, it will not show up in the project command line settings.