New findings needed to be verified: Maximum thread block is not 1024 in K20

Hello everyone,
This is the first time for me to post here. Accidentally, I run a program with several blocks whose # of threads exceed 1024. But it works fine, and result is right, performance is the same as 1024 per block one. I just wonder, what’s inside? Does the # of threads when launch means true threads ?

Thanks,
Finger Lake

1 16,16,192.623901
2 16,32,216.237720
3 16,48,203.623804
4 16,64,103.184595
5 16,80,112.258691
6 16,96,92.113696
7 16,112,103.654834
8 16,128,104.031885
9 16,144,111.264795
10 16,160,111.328296
11 16,176,104.204199
12 16,192,109.268994
13 16,208,103.413794
14 16,224,92.161792
15 16,240,102.891797
16 16,256,110.897290
17 16,272,110.212915
18 16,288,103.546704
19 16,304,
20 16,320,
21 32,16,160.154590
22 32,32,104.332397
23 32,48,104.071802
24 32,64,111.025903
25 32,80,103.804077
26 32,96,109.769800
27 32,112,109.313208
28 32,128,111.482788
29 32,144,112.331201
30 32,160,101.548877
31 32,176,
32 32,192,
33 32,208,
34 32,224,
35 32,240,
36 32,256,
37 32,272,
38 32,288,
39 32,304,
40 32,320,
41 48,16,144.636987
42 48,32,103.406592
43 48,48,103.028711
44 48,64,103.537573
45 48,80,99.180103
46 48,96,104.066699
47 48,112,
48 48,128,
49 48,144,
50 48,160,
51 48,176,
52 48,192,
53 48,208,
54 48,224,
55 48,240,
56 48,256,
57 48,272,
58 48,288,
59 48,304,
60 48,320,
61 64,16,103.425000
62 64,32,102.947583
63 64,48,98.968604
64 64,64,103.164600
65 64,80,103.684204
66 64,96,
67 64,112,
68 64,128,

the first column is line number, the second column is x dimension in thread block, the third is y dimension of thread block, the last column is running time. If running time is NULL, it means failed to run.

it appears that the restriction of 1024 applies to the x, y or z axis of the block size, but not to the product of xyz.

The kernel will launch as long as the number of threads is less than the maximum number of resident threads per block, and each of the block’s dimensions (x/y/z) is less than 1024 (and other resources, such as register file size and shared memory size are not exceeded)

EDIT: I gave wrong information, see below.

The restriction of 1024 threads per block applies to the product. Claiming that a code that is not shown produces results that suggest the rule is violated is not productive. For starters, add proper cuda error checking to your code. You will surely discover that kernel launches requesting more than 1024 threads per block (the total product of x,y,z in threadblock dimensions) will fail.

Ah right, at first I missed the “maximum number of threads per block” entry in the table detailing the compute capabilities, so I gave an incorrect response two posts earlier.

CUDA - Wikipedia (scroll to section Version features and specifications / Table: Technical specifications)

It appears anything above 1024 threads per block should fail. If it appears to work for you then it must be caused by missing or incorrect error checking.

Christian