Max number of thhreads per block and max number of blocks

I’m trying to understand the example where the gpu is used for calculating the sum of all the element of an array (reduction, in folder 6_Advanced).

Specifically, I’m trying to figure out what is the maximum number of threads and blocks that I can use to run this example, given a specific Jetson board (TX2, Xavier, etc).

Running the deviceQuery for the Xavier NX, I get the following results:

...
Total amount of global memory:                 7764 MBytes (8140709888 bytes)
( 6) Multiprocessors, ( 64) CUDA Cores/MP:     384 CUDA Cores
GPU Max Clock rate:                            1109 MHz (1.11 GHz)
...
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
...

With the default values, the test is passed:
maxThreads = 256
whichKernel = 6;
maxBlocks = 64;

If I change the number of maxThreads from 256 to 1024, the test fails. However, the deviceQuery tool says that the Maximum number of threads per block is 1024. Why 1024 makes the test fail?

Thank you.

What’s if 1023?

It fails also with 1023:

./vectorReduction Starting...

GPU Device 0: "Xavier" with compute capability 7.2

Using Device 0: Xavier

Reducing array of type int

16777216 elements
1023 threads (max) 
64 blocks

Reduction, Throughput = 33.4858 GB/s, Time = 0.00200 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 1023

GPU result = 814203
CPU result = 2139353471

Test failed!