device printf not working cuda 3.1 printf not working on tesla c2050

andradx · July 12, 2010, 7:51am

Since emulation has been deprecated from cuda 3.0 and printf is now supported within the device code there would be no difference in what I’m trying to do. However, when I run the kernel the device stalls, the fan peaks from 30% to 80% and no output is sent to stdout.

I compiled the code using the SDK makefile with -arch sm_20 and the device is of compute capability 2.0, it’s a Tesla C2050. Is this a bug? Driver is 256.35 for 64bit Linux.

lc412 · July 13, 2010, 3:53pm

You might try reducing the number of printfs. I had a similar problem when I was trying to print from a large number of threads (probably thousands), but when I was only trying to execute a couple of printfs it worked fine.

(This was also on a c2050, linux, same driver.)

That leaves the question of why it should behave so…

lc412 · July 13, 2010, 3:54pm

[ double post ]

RoofTopG · August 27, 2010, 1:46pm

Have you maybe figured out how to fix this problem?

Reducing the number of printfs doesn’t help either - I launch the kernel with only a few threads and one printf statement…

RoofTopG · August 27, 2010, 1:46pm

Have you maybe figured out how to fix this problem?

Reducing the number of printfs doesn’t help either - I launch the kernel with only a few threads and one printf statement…

Leeuw · August 31, 2010, 11:11pm

I have the same problem:

error: calling a host function from a device/global function is not allowed

but on GEForce GTX 295

The device driver is 256.40 (linux 64 bit RHEL5)

Thanks,

W

Leeuw · August 31, 2010, 11:11pm

I have the same problem:

error: calling a host function from a device/global function is not allowed

but on GEForce GTX 295

The device driver is 256.40 (linux 64 bit RHEL5)

Thanks,

W

avidday · September 1, 2010, 6:44am

You don’t have the same problem. printf is only supported in CUDA 3.1 when using sm_20 and sm_21 (so Fermi) cards.

avidday · September 1, 2010, 6:44am

You don’t have the same problem. printf is only supported in CUDA 3.1 when using sm_20 and sm_21 (so Fermi) cards.

gthazmatt · September 1, 2010, 6:50am

Whenever I try to use printf, I get a kernel invocation error. GTX 480 and cuda 3.1.

gthazmatt · September 1, 2010, 6:50am

Whenever I try to use printf, I get a kernel invocation error. GTX 480 and cuda 3.1.

avidday · September 1, 2010, 8:36am

I just instrumented a kernel I am working on at the moment like this (it is doing some sparse matrix assembly in GPU memory):

__global__ void diffusion(  const float deltat, 

							const unsigned int dimx, const unsigned int dimy, const unsigned int dimz,

							const int blockSpace, int2 *opos, float *odata, unsigned int *onnz)

{

	__shared__ int2 buffa[64];

	__shared__ float buffb[64];

	volatile int tidx = threadIdx.x + threadIdx.y * blockDim.x;

.....

.....

.....

..... 

	buffa[tidx].x = lnnz; 

	__syncthreads();

	// First warp performs shared memory reduction to partial nnz sums

	if (tidx < warpSize) {

	   for(int i=warpSize; i<blocksize; i+=warpSize) {

			buffa[tidx].x += buffa[tidx+i].x;

		}

		buffa[tidx].x += buffa[tidx+16].x;

		buffa[tidx].x += buffa[tidx+8].x;

		buffa[tidx].x += buffa[tidx+4].x;

		buffa[tidx].x += buffa[tidx+2].x;

		buffa[tidx].x += buffa[tidx+1].x;

	}

	// Finalise and write out the results to global memory

	if (tidx == 0)  { 

		onnz[bidx] = (unsigned int)buffa[0].x;

	printf("(%d, %d)\n", bidx, buffa[0].x);

	}

}

compiled it like this:

avid@cuda:~/code/sparse$ nvcc -arch=sm_20 --ptxas-options=“-v” -g -I. -o sparse2 sparse2.cu

ptxas info : Compiling entry function ‘Z15spmv_ell_kernelIjfEvT_S0_S0_S0_PKS0_PKT0_S5_PS3’ for ‘sm_20’

ptxas info : Used 14 registers, 80 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14], 4 bytes cmem[16]

ptxas info : Compiling entry function ‘_Z12deviceMemsetIfEvPT_S0_m’ for ‘sm_20’

ptxas info : Used 7 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z12deviceMemsetI4int2EvPT_S1_m’ for ‘sm_20’

ptxas info : Used 8 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z9diffusionfjjjiP4int2PfPj’ for ‘sm_20’

ptxas info : Used 30 registers, 768+0 bytes smem, 80 bytes cmem[0], 48 bytes cmem[2], 8 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z9assemblebPK4int2PKfPjjjjS4_Pf’ for ‘sm_20’

ptxas info : Used 11 registers, 88 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

ptxas info : Compiling entry function ‘Z9assembleaPK4int2PKfjjPjS4’ for ‘sm_20’

ptxas info : Used 10 registers, 512+0 bytes smem, 72 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

and ran it:

avid@cuda:~/code/sparse$ ./sparse2

(8, 4680)

(36, 4680)

(0, 4680)

(28, 4680)

(4, 4680)

(32, 4680)

(16, 4680)

(44, 4680)

(10, 4680)

(24, 4680)

(12, 4680)

(38, 4680)

(52, 4680)

As you can see, at least for this relatively simple case, it works. That kernel was run with 784 blocks on a GTX470, so there was a fair amount of output in play as well.

avidday · September 1, 2010, 8:36am

I just instrumented a kernel I am working on at the moment like this (it is doing some sparse matrix assembly in GPU memory):

__global__ void diffusion(  const float deltat, 

							const unsigned int dimx, const unsigned int dimy, const unsigned int dimz,

							const int blockSpace, int2 *opos, float *odata, unsigned int *onnz)

{

	__shared__ int2 buffa[64];

	__shared__ float buffb[64];

	volatile int tidx = threadIdx.x + threadIdx.y * blockDim.x;

.....

.....

.....

..... 

	buffa[tidx].x = lnnz; 

	__syncthreads();

	// First warp performs shared memory reduction to partial nnz sums

	if (tidx < warpSize) {

	   for(int i=warpSize; i<blocksize; i+=warpSize) {

			buffa[tidx].x += buffa[tidx+i].x;

		}

		buffa[tidx].x += buffa[tidx+16].x;

		buffa[tidx].x += buffa[tidx+8].x;

		buffa[tidx].x += buffa[tidx+4].x;

		buffa[tidx].x += buffa[tidx+2].x;

		buffa[tidx].x += buffa[tidx+1].x;

	}

	// Finalise and write out the results to global memory

	if (tidx == 0)  { 

		onnz[bidx] = (unsigned int)buffa[0].x;

	printf("(%d, %d)\n", bidx, buffa[0].x);

	}

}

compiled it like this:

avid@cuda:~/code/sparse$ nvcc -arch=sm_20 --ptxas-options=“-v” -g -I. -o sparse2 sparse2.cu

ptxas info : Compiling entry function ‘Z15spmv_ell_kernelIjfEvT_S0_S0_S0_PKS0_PKT0_S5_PS3’ for ‘sm_20’

ptxas info : Used 14 registers, 80 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14], 4 bytes cmem[16]

ptxas info : Compiling entry function ‘_Z12deviceMemsetIfEvPT_S0_m’ for ‘sm_20’

ptxas info : Used 7 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z12deviceMemsetI4int2EvPT_S1_m’ for ‘sm_20’

ptxas info : Used 8 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z9diffusionfjjjiP4int2PfPj’ for ‘sm_20’

ptxas info : Used 30 registers, 768+0 bytes smem, 80 bytes cmem[0], 48 bytes cmem[2], 8 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z9assemblebPK4int2PKfPjjjjS4_Pf’ for ‘sm_20’

ptxas info : Used 11 registers, 88 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

ptxas info : Compiling entry function ‘Z9assembleaPK4int2PKfjjPjS4’ for ‘sm_20’

ptxas info : Used 10 registers, 512+0 bytes smem, 72 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]

and ran it:

avid@cuda:~/code/sparse$ ./sparse2

(8, 4680)

(36, 4680)

(0, 4680)

(28, 4680)

(4, 4680)

(32, 4680)

(16, 4680)

(44, 4680)

(10, 4680)

(24, 4680)

(12, 4680)

(38, 4680)

(52, 4680)

As you can see, at least for this relatively simple case, it works. That kernel was run with 784 blocks on a GTX470, so there was a fair amount of output in play as well.

theblur · October 22, 2010, 1:35pm

I have the following kernel:

global void sayHello_kernel(void){
printf(“Hello from thread %d”, threadIdx.x);
}

It’s compiled. But when I run it, it doesn’t throw out any Hello, like:

Using device 0: GeForce GTX 460

Press ENTER to exit…

I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.

Am I missing something to use the new function?

theblur · October 22, 2010, 1:35pm

I have the following kernel:

global void sayHello_kernel(void){
printf(“Hello from thread %d”, threadIdx.x);
}

It’s compiled. But when I run it, it doesn’t throw out any Hello, like:

Using device 0: GeForce GTX 460

Press ENTER to exit…

I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.

Am I missing something to use the new function?

genewagenbreth · March 31, 2011, 6:22pm

I had this problem. After stumbling around for a while I found that putting

cudaThreadSynchronize();

immediately after the kernel invocation got my prints to magically appear.

Gene W

I have the following kernel:

global void sayHello_kernel(void){
printf("Hello from thread %d", threadIdx.x);
}

It’s compiled. But when I run it, it doesn’t throw out any Hello, like:

Using device 0: GeForce GTX 460

Press ENTER to exit…

I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.

Am I missing something to use the new function?

Jimmy_Pettersson · May 27, 2011, 10:06pm

Good catch! Thanks! :-)

Topic		Replies	Views
CUDA 3.1 device printf? ...does not work at all... CUDA Programming and Performance	8	5616	June 30, 2010
How to use printf for Fermi CUDA Programming and Performance	19	28415	October 11, 2010
Unspecified driver error using device printf on Fermi? CUDA Programming and Performance	4	7996	September 12, 2011
Problems with printf() in kernel CUDA Programming and Performance	2	873	May 18, 2011
printf statements from cuda's __global__ and __device__ functions CUDA Programming and Performance	3	14977	June 8, 2009
display in Cuda CUDA Programming and Performance	10	1868	March 23, 2012
printf() in a kernel: the manual says yes CUDA Programming and Performance	0	746	May 29, 2011
printf in CUDA CUDA Programming and Performance	0	4860	December 22, 2009
Problems with cudaMalloc(), on printf() just in the kernelcode CUDA Programming and Performance	6	5208	November 29, 2010
Use print in CUDA kernel CUDA Programming and Performance	9	30306	December 19, 2014

device printf not working cuda 3.1 printf not working on tesla c2050

Related topics