Fewer than MaxThreads per Block Fails (Code Included) 400 threads per block fails while 300 successf


EDIT: SOLVED BELOW (REGISTER PROBLEM)


I am running on 2.6.32-3-amd64 #1 SMP Wed Feb 24 18:07:42 UTC 2010 x86_64 GNU/Linux.

Both of my NVidia cards support up to 512 threads per block (found from deviceQuery). When I run a simple code with 300 threads per block, it runs successfully. When I run 400 threads per block, it fails to print anything.

This is the smallest piece of code that I can provide that displays this behavior. I have the problem with my larger code where 256 threads per block works while 257 threads per block doesn’t. I am hoping that the solution to this sample problem will steer me towards the solution for my larger problem.

I am including a simple segment of code that displays this strange behavior. Running ‘compile’ will compile the sample code contained in the code.tar file.

Any ideas?

Richard

Code segment:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include "cuPrintf.cu"

const unsigned int THREADS_USED = 300; // 300 works

//const unsigned int THREADS_USED = 400; // 400 does not work

// Device code

__global__ void GAKernel(int number) {

  if (threadIdx.x == THREADS_USED-1) { // only print for last thread

    cuPrintf("test %d\n", number);

  }

}

// Host code

int main() {

  cudaPrintfInit();

GAKernel<<<200, THREADS_USED>>>(2);

cudaPrintfDisplay(stdout, true);

  cudaPrintfEnd();

}

$ nvcc --version

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

%%%%%%%%%%%%%%%%%%%%%%%%%%%

$ gcc --version

gcc-4.3 (Debian 4.3.5-1) 4.3.5

Copyright © 2008 Free Software Foundation, Inc.

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

%%%%%%%%%%%%%%%%%%%%%%%%%%%

When I run ./deviceQuery I get:

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: “GeForce 8500 GT”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 1.1

Total amount of global memory: 536674304 bytes

Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 0.92 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device 1: “GeForce 8400 GS”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 1.1

Total amount of global memory: 536150016 bytes

Multiprocessors x Cores/MP = Cores: 1 (MP) x 8 (Cores/MP) = 8 (Cores)

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.40 GHz

Concurrent copy and execution: No

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 2, Device = GeForce 8500 GT, Device = GeForce 8400 GS

code.tar (40 KB)

You are probably running out of registers. Compile with [font=“Courier New”]–ptxas-options=-v[/font] to see ressource usage and use the occupancy calculator to check the maximum blocksize with that many registers.

I am guessing that the kernel is running (it compiles to 14 registers with cuda 3.2), but that some part of cuPrintf is failing for larger block sizes.

When I compile with the flag I get:

$ nvcc --ptxas-options=-v -g -G StaticTests.cu -o static -gencode arch=compute_11,code=sm_11 -gencode arch=compute_20,code=sm_20

ptxas info : Compiling entry function ‘_Z8GAKerneli’ for ‘sm_11’

ptxas info : Used 20 registers, 16+0 bytes lmem, 4+16 bytes smem, 21 bytes cmem[0], 36 bytes cmem[1], 8 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z8GAKerneli’ for ‘sm_20’

ptxas info : Used 34 registers, 36 bytes cmem[0], 12 bytes cmem[2], 12 bytes cmem[14]

The same result comes from compiling with 300 threads and 400 threads in a block. Only 300 threads does show print statements while 400 doesn’t. I only show statements for the last thread so the size that is displayed should be constant.

I started using cuprint when I tried debugging my larger program which started having this type of behavior.

I’ve tried running the compiled code on a friend’s Tesla card and found the same behavior.

Computing only one block of 400 threads demonstrates this behavior.

GAKernel<<< 1, 400 >>>(2);

It seems strange that cuPrintf would show this behavior with 400 threads and not have it be more prominent from the forum comments. I’ve tried becoming a registered developer to make sure that I have the latest version of cuPrintf but it has been a few weeks now. Could someone confirm that the tar file contains the latest version of cuPrintf?

I’m at a lost regarding what else it could be.

Thanks,

Richard

Just solved it. Thought I would boil down a few weeks of lessons-learned in case it helps anyone else out there.

  1. You might be used to programs volunteering that an error has occurred while running. CUDA is happy to tell you about any errors but it won’t volunteer that information (in my case only showing mangled results). You should try to ask it regarding any errors after each CUDA function call. Useful search terms would be CudaCheckError and CudaSafeCall for more info.

  2. Registers are a limited number of very quick memory locations on a processor. If you compile using the --ptxas-options=-v flag you will

be able to see how many registers your program uses per thread. If the number of registers times the number of threads in a block is over

8192, for most cards, the Kernel will not have sufficient resources to start. CUDA won’t tell you that is what has happening, of course,

unless you ask.

  1. If you compile using the --maxrregcount=# flag you can suggest that the compiler try to compile the code so that it only requires #

registers per thread. This could allow your (register count) times (threads per block) to fall under the 8192 limit. Awesome if you need

more threads per block (me).

Hope this helps,

Richard

Now you awakened my interest: Why did it fail in your case? For the kernel where you gave the ressource usage, 400x20 is still lower than 8192.
Or are you saying that the failure happened for a different kernel, and you just didn’t check for errors until after this kernel’s invocation?

For both Kernels, the example here and the larger one elsewhere, the answer was the use of --maxrregcount=#. It’s unsatisfying because 400*20 is less than the 8192 deviceQuery stated was my register limit per block.

To be specific, the current example code that displays this behavior is:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include "cuPrintf.cu"

#include "ErrorCode.c"

//const unsigned int THREADS_USED = 300; // below 385 works

const unsigned int THREADS_USED = 385; // 385 and above doesn't work

const unsigned int NUM_CASES = 1;

// Device code

__global__ void GAKernel(int numCases, int numThreads) {

  if (threadIdx.x == THREADS_USED-1) { // only print for last thread

    cuPrintf("test %d\n", numCases);

  }

}

// Host code

int main() {

  cudaPrintfInit();

GAKernel<<<NUM_CASES, THREADS_USED>>>(NUM_CASES, THREADS_USED);

  cudaThreadSynchronize();

  CudaCheckError();

cudaPrintfDisplay(stdout, true);

  cudaPrintfEnd();

}

When I compile and run using the code above:

$ nvcc --ptxas-options=-v -g -G StaticTests.cu -o static -gencode arch=compute_11,code=sm_11

ptxas info : Compiling entry function ‘_Z8GAKernelii’ for ‘sm_11’

ptxas info : Used 20 registers, 16+0 bytes lmem, 8+16 bytes smem, 21 bytes cmem[0], 36 bytes cmem[1], 8 bytes cmem[14]

$ ./static

cudaCheckError() failed at StaticTests.cu:25 : too many resources requested for launch.

%%%%%%%%%%%%%%%%%%%

When I change the compile command to read:

$ nvcc --maxrregcount=10 --ptxas-options=-v -g -G StaticTests.cu -o static -gencode arch=compute_11,code=sm_11

ptxas info : Compiling entry function ‘_Z8GAKernelii’ for ‘sm_11’

ptxas info : Used 10 registers, 80+0 bytes lmem, 8+16 bytes smem, 21 bytes cmem[0], 36 bytes cmem[1], 8 bytes cmem[14]

$ ./static

[0, 384]: test 1

%%%%%%%%%%%%%%%%%%%

For 450 threads:

$ nvcc --maxrregcount=10 --ptxas-options=-v -g -G StaticTests.cu -o static -gencode arch=compute_11,code=sm_11

ptxas info : Compiling entry function ‘_Z8GAKernelii’ for ‘sm_11’

ptxas info : Used 10 registers, 80+0 bytes lmem, 8+16 bytes smem, 21 bytes cmem[0], 36 bytes cmem[1], 8 bytes cmem[14]

$ ./static

[0, 449]: test 1

%%%%%%%%%%%%%%%%%%%

For 525 threads:

$ nvcc --maxrregcount=10 --ptxas-options=-v -g -G StaticTests.cu -o static -gencode arch=compute_11,code=sm_11

ptxas info : Compiling entry function ‘_Z8GAKernelii’ for ‘sm_11’

ptxas info : Used 10 registers, 80+0 bytes lmem, 8+16 bytes smem, 21 bytes cmem[0], 36 bytes cmem[1], 8 bytes cmem[14]

$ ./static

cudaCheckError() failed at StaticTests.cu:25 : invalid configuration argument.

%%%%%%%%%%%%%%%%%%%

I don’t know what is happening regarding the configuration argument. I’ve attached a refresh of the code. Though using --maxrregcount=# allowed both kernels to get beyond the limitation that had been a problem for me, they should have been able to do more given the results from --ptxas-options=-v unless I’m still missing something.

Best regards,

Richard
codeExample_v2.tar.gz (44.1 KB)

Ah, sorry for the confusion, simple arithmetic error on my side: 400 is a multiple of 16 but not of 32 (and thus not of 64 as well).

The number of threads per block is rounded up to get an integer number of warps (for compute capability 1.x its even rounded up to get an even number of warps). So your block will have 14 warps for an effective size of 448 threads per block. And since 448*20=8960 > 8192, the kernel cannot be launched.

To check this, you could convince yourself that the kernel launches for 384 threads per block (12 warps), but not 385.

Check section 4.2 of the Programming Guide for the exact formula for registers allocated per block.

No need to convince myself, I’ve seen that to be true, it launches for 384 but not for 385. ;)

Thank you for clarifying what’s happening here in a satisfying way!