device printf not working cuda 3.1 printf not working on tesla c2050

Since emulation has been deprecated from cuda 3.0 and printf is now supported within the device code there would be no difference in what I’m trying to do. However, when I run the kernel the device stalls, the fan peaks from 30% to 80% and no output is sent to stdout.

I compiled the code using the SDK makefile with -arch sm_20 and the device is of compute capability 2.0, it’s a Tesla C2050. Is this a bug? Driver is 256.35 for 64bit Linux.

You might try reducing the number of printfs. I had a similar problem when I was trying to print from a large number of threads (probably thousands), but when I was only trying to execute a couple of printfs it worked fine.

(This was also on a c2050, linux, same driver.)

That leaves the question of why it should behave so…

[ double post ]

Have you maybe figured out how to fix this problem?

Reducing the number of printfs doesn’t help either - I launch the kernel with only a few threads and one printf statement…

Have you maybe figured out how to fix this problem?

Reducing the number of printfs doesn’t help either - I launch the kernel with only a few threads and one printf statement…

I have the same problem:

error: calling a host function from a device/global function is not allowed

but on GEForce GTX 295

The device driver is 256.40 (linux 64 bit RHEL5)

Thanks,

W

I have the same problem:

error: calling a host function from a device/global function is not allowed

but on GEForce GTX 295

The device driver is 256.40 (linux 64 bit RHEL5)

Thanks,

W

You don’t have the same problem. printf is only supported in CUDA 3.1 when using sm_20 and sm_21 (so Fermi) cards.

You don’t have the same problem. printf is only supported in CUDA 3.1 when using sm_20 and sm_21 (so Fermi) cards.

Whenever I try to use printf, I get a kernel invocation error. GTX 480 and cuda 3.1.

Whenever I try to use printf, I get a kernel invocation error. GTX 480 and cuda 3.1.

I just instrumented a kernel I am working on at the moment like this (it is doing some sparse matrix assembly in GPU memory):

__global__ void diffusion(  const float deltat, 

							const unsigned int dimx, const unsigned int dimy, const unsigned int dimz,

							const int blockSpace, int2 *opos, float *odata, unsigned int *onnz)

{

	__shared__ int2 buffa[64];

	__shared__ float buffb[64];

	volatile int tidx = threadIdx.x + threadIdx.y * blockDim.x;

.....

.....

.....

..... 

	buffa[tidx].x = lnnz; 

	__syncthreads();

	// First warp performs shared memory reduction to partial nnz sums

	if (tidx < warpSize) {

	   for(int i=warpSize; i<blocksize; i+=warpSize) {

			buffa[tidx].x += buffa[tidx+i].x;

		}

		buffa[tidx].x += buffa[tidx+16].x;

		buffa[tidx].x += buffa[tidx+8].x;

		buffa[tidx].x += buffa[tidx+4].x;

		buffa[tidx].x += buffa[tidx+2].x;

		buffa[tidx].x += buffa[tidx+1].x;

	}

	// Finalise and write out the results to global memory

	if (tidx == 0)  { 

		onnz[bidx] = (unsigned int)buffa[0].x;

	printf("(%d, %d)\n", bidx, buffa[0].x);

	}

}

compiled it like this:

and ran it:

avid@cuda:~/code/sparse$ ./sparse2

(8, 4680)

(36, 4680)

(0, 4680)

(28, 4680)

(4, 4680)

(32, 4680)

(16, 4680)

(44, 4680)

(10, 4680)

(24, 4680)

(12, 4680)

(38, 4680)

(52, 4680)

As you can see, at least for this relatively simple case, it works. That kernel was run with 784 blocks on a GTX470, so there was a fair amount of output in play as well.

I just instrumented a kernel I am working on at the moment like this (it is doing some sparse matrix assembly in GPU memory):

__global__ void diffusion(  const float deltat, 

							const unsigned int dimx, const unsigned int dimy, const unsigned int dimz,

							const int blockSpace, int2 *opos, float *odata, unsigned int *onnz)

{

	__shared__ int2 buffa[64];

	__shared__ float buffb[64];

	volatile int tidx = threadIdx.x + threadIdx.y * blockDim.x;

.....

.....

.....

..... 

	buffa[tidx].x = lnnz; 

	__syncthreads();

	// First warp performs shared memory reduction to partial nnz sums

	if (tidx < warpSize) {

	   for(int i=warpSize; i<blocksize; i+=warpSize) {

			buffa[tidx].x += buffa[tidx+i].x;

		}

		buffa[tidx].x += buffa[tidx+16].x;

		buffa[tidx].x += buffa[tidx+8].x;

		buffa[tidx].x += buffa[tidx+4].x;

		buffa[tidx].x += buffa[tidx+2].x;

		buffa[tidx].x += buffa[tidx+1].x;

	}

	// Finalise and write out the results to global memory

	if (tidx == 0)  { 

		onnz[bidx] = (unsigned int)buffa[0].x;

	printf("(%d, %d)\n", bidx, buffa[0].x);

	}

}

compiled it like this:

and ran it:

avid@cuda:~/code/sparse$ ./sparse2

(8, 4680)

(36, 4680)

(0, 4680)

(28, 4680)

(4, 4680)

(32, 4680)

(16, 4680)

(44, 4680)

(10, 4680)

(24, 4680)

(12, 4680)

(38, 4680)

(52, 4680)

As you can see, at least for this relatively simple case, it works. That kernel was run with 784 blocks on a GTX470, so there was a fair amount of output in play as well.

I have the following kernel:

global void sayHello_kernel(void){
printf(“Hello from thread %d”, threadIdx.x);
}

It’s compiled. But when I run it, it doesn’t throw out any Hello, like:

Using device 0: GeForce GTX 460

Press ENTER to exit…

I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.

Am I missing something to use the new function?

I have the following kernel:

global void sayHello_kernel(void){
printf(“Hello from thread %d”, threadIdx.x);
}

It’s compiled. But when I run it, it doesn’t throw out any Hello, like:

Using device 0: GeForce GTX 460

Press ENTER to exit…

I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.

Am I missing something to use the new function?

I had this problem. After stumbling around for a while I found that putting

cudaThreadSynchronize();

immediately after the kernel invocation got my prints to magically appear.

Gene W

Good catch! Thanks! :-)