EDIT: SOLVED BELOW (REGISTER PROBLEM)
I am running on 2.6.32-3-amd64 #1 SMP Wed Feb 24 18:07:42 UTC 2010 x86_64 GNU/Linux.
Both of my NVidia cards support up to 512 threads per block (found from deviceQuery). When I run a simple code with 300 threads per block, it runs successfully. When I run 400 threads per block, it fails to print anything.
This is the smallest piece of code that I can provide that displays this behavior. I have the problem with my larger code where 256 threads per block works while 257 threads per block doesn’t. I am hoping that the solution to this sample problem will steer me towards the solution for my larger problem.
I am including a simple segment of code that displays this strange behavior. Running ‘compile’ will compile the sample code contained in the code.tar file.
Any ideas?
Richard
Code segment:
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include "cuPrintf.cu"
const unsigned int THREADS_USED = 300; // 300 works
//const unsigned int THREADS_USED = 400; // 400 does not work
// Device code
__global__ void GAKernel(int number) {
if (threadIdx.x == THREADS_USED-1) { // only print for last thread
cuPrintf("test %d\n", number);
}
}
// Host code
int main() {
cudaPrintfInit();
GAKernel<<<200, THREADS_USED>>>(2);
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
%%%%%%%%%%%%%%%%%%%%%%%%%%%
$ gcc --version
gcc-4.3 (Debian 4.3.5-1) 4.3.5
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
%%%%%%%%%%%%%%%%%%%%%%%%%%%
When I run ./deviceQuery I get:
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 2 devices supporting CUDA
Device 0: “GeForce 8500 GT”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 536674304 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 0.92 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 1: “GeForce 8400 GS”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 536150016 bytes
Multiprocessors x Cores/MP = Cores: 1 (MP) x 8 (Cores/MP) = 8 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.40 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 2, Device = GeForce 8500 GT, Device = GeForce 8400 GS
code.tar (40 KB)