How to run with more than 65535 blocks?

I call the kernel function with kernelfun<<<102400,1024>>>, it simply skiped it and does not run call the kernel function,
Then I divided my task into 1024 parts, for each parts there is only <<<100,1024>>>, then it works well.

However I uses deviceQuery to check my GPUs.
There are two GPUs in the desktop, one GTX780 and one GT610, and I uses GT610 for display.

For GTX780, the maximum grid size is <2147483647,65535,65535>.
For GT610, the maximum grid size is <65535,65535,65535>.

However it should automatically choose GT780 to run the program, so there should be no problem for runing with more than 65535 blocks.

How can I fix this problem?


I forget how NVIDIA decides on device ordering, but it is definitely not automatic – I believe it chooses device 0 on your machine unless you explicitely tell it to otherwise. You have a few options:

  1. Hide the GT610 via the environment variable CUDA_SET_VISIBLE_DEVICES:

  2. Use code like this, which in your case would work, since GTX780 has a larger number of SMs:

// selects the card with the largest number of multiprocessors
    int num_devices = 0; int device;
    if (num_devices == 0) {
        printf("No CUDA-capable device found!\n\n"); }
    if (num_devices > 1) {
          int max_multiprocessors = 0, max_device = 0;
          for (device = 0; device < num_devices; device++) {
                  cudaDeviceProp properties;
                  cudaGetDeviceProperties(&properties, device);
                  if (max_multiprocessors < properties.multiProcessorCount) {
                          max_multiprocessors = properties.multiProcessorCount;
                          max_device = device;

Thank you for your reply.

I’ve copied your code into my program, however it still does not work.

Also Device 0 of my desktop is exactly GTX780.

I tried to hide GT610. deviceQuery shows there is only one GPU now. However it still skips call to the kernel function.

It is not enough just to choose the G780 device,you also have to specify the cc capability in the compiling flags. Use -arch=sm_30.

By the way wat cc is GT 610? Is it 2.0 or 3.0? ( Is it a Kepler card or Fermi card?).

Thank you for your reply.

I’m working with CUDA+VS2012. How to set this option?
I’ve found that in poject property -> CUDA C/C++ -> Device -> Code Generation, it is compute_10,sm_10 as default, there is no other selection. I typed compute_30,sm_35 into it. But it still does not work.

I checked from the project property -> CUDA C/C++ -> Command Line, and I don’t find any terms like -arch=sm_30, is there anything I should do to apply the setting sm_35?

And cc of GT610 is 2.1

Type compute_30,sm_30 or compute_35,sm_35 in that same option

I tried both settings, just not work.

After I apply those seetings, I can’t find anything change in CUDA C/C++ -> Command Line. Does those setting work?

When I click on the triangle on the right side of the blank, there is not selections like compute_** or sm_**, I have to type it manully into the blank. Is there any problem?

When you say ‘not work’ do you mean you can’t run more than say kernelfun<<<65535,1024>>> elements? It may also be that you are exceeding some other GPU resource that is not launch bounds.

Also look in the build output after you compile to see if the -gencode=arch=compute_30,code=“sm_30,compute_30” is being sent to NVCC, it will not show up in the project command line settings.