Since emulation has been deprecated from cuda 3.0 and printf is now supported within the device code there would be no difference in what I’m trying to do. However, when I run the kernel the device stalls, the fan peaks from 30% to 80% and no output is sent to stdout.
I compiled the code using the SDK makefile with -arch sm_20 and the device is of compute capability 2.0, it’s a Tesla C2050. Is this a bug? Driver is 256.35 for 64bit Linux.
lc412
July 13, 2010, 3:53pm
2
Since emulation has been deprecated from cuda 3.0 and printf is now supported within the device code there would be no difference in what I’m trying to do. However, when I run the kernel the device stalls, the fan peaks from 30% to 80% and no output is sent to stdout.
I compiled the code using the SDK makefile with -arch sm_20 and the device is of compute capability 2.0, it’s a Tesla C2050. Is this a bug? Driver is 256.35 for 64bit Linux.
You might try reducing the number of printfs. I had a similar problem when I was trying to print from a large number of threads (probably thousands), but when I was only trying to execute a couple of printfs it worked fine.
(This was also on a c2050, linux, same driver.)
That leaves the question of why it should behave so…
Have you maybe figured out how to fix this problem?
Reducing the number of printfs doesn’t help either - I launch the kernel with only a few threads and one printf statement…
Have you maybe figured out how to fix this problem?
Reducing the number of printfs doesn’t help either - I launch the kernel with only a few threads and one printf statement…
Leeuw
August 31, 2010, 11:11pm
6
I have the same problem:
error: calling a host function from a device /global function is not allowed
but on GEForce GTX 295
The device driver is 256.40 (linux 64 bit RHEL5)
Thanks,
W
Leeuw
August 31, 2010, 11:11pm
7
I have the same problem:
error: calling a host function from a device /global function is not allowed
but on GEForce GTX 295
The device driver is 256.40 (linux 64 bit RHEL5)
Thanks,
W
avidday
September 1, 2010, 6:44am
8
You don’t have the same problem. printf is only supported in CUDA 3.1 when using sm_20 and sm_21 (so Fermi) cards.
avidday
September 1, 2010, 6:44am
9
You don’t have the same problem. printf is only supported in CUDA 3.1 when using sm_20 and sm_21 (so Fermi) cards.
Whenever I try to use printf, I get a kernel invocation error. GTX 480 and cuda 3.1.
Whenever I try to use printf, I get a kernel invocation error. GTX 480 and cuda 3.1.
avidday
September 1, 2010, 8:36am
12
I just instrumented a kernel I am working on at the moment like this (it is doing some sparse matrix assembly in GPU memory):
__global__ void diffusion( const float deltat,
const unsigned int dimx, const unsigned int dimy, const unsigned int dimz,
const int blockSpace, int2 *opos, float *odata, unsigned int *onnz)
{
__shared__ int2 buffa[64];
__shared__ float buffb[64];
volatile int tidx = threadIdx.x + threadIdx.y * blockDim.x;
.....
.....
.....
.....
buffa[tidx].x = lnnz;
__syncthreads();
// First warp performs shared memory reduction to partial nnz sums
if (tidx < warpSize) {
for(int i=warpSize; i<blocksize; i+=warpSize) {
buffa[tidx].x += buffa[tidx+i].x;
}
buffa[tidx].x += buffa[tidx+16].x;
buffa[tidx].x += buffa[tidx+8].x;
buffa[tidx].x += buffa[tidx+4].x;
buffa[tidx].x += buffa[tidx+2].x;
buffa[tidx].x += buffa[tidx+1].x;
}
// Finalise and write out the results to global memory
if (tidx == 0) {
onnz[bidx] = (unsigned int)buffa[0].x;
printf("(%d, %d)\n", bidx, buffa[0].x);
}
}
compiled it like this:
avid@cuda:~/code/sparse$ nvcc -arch=sm_20 --ptxas-options=“-v” -g -I. -o sparse2 sparse2.cu
ptxas info : Compiling entry function ‘Z15spmv_ell_kernelIjfEvT_S0_S0_S0_PKS0_PKT0_S5_PS3 ’ for ‘sm_20’
ptxas info : Used 14 registers, 80 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z12deviceMemsetIfEvPT_S0_m’ for ‘sm_20’
ptxas info : Used 7 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z12deviceMemsetI4int2EvPT_S1_m’ for ‘sm_20’
ptxas info : Used 8 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z9diffusionfjjjiP4int2PfPj’ for ‘sm_20’
ptxas info : Used 30 registers, 768+0 bytes smem, 80 bytes cmem[0], 48 bytes cmem[2], 8 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z9assemblebPK4int2PKfPjjjjS4_Pf’ for ‘sm_20’
ptxas info : Used 11 registers, 88 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
ptxas info : Compiling entry function ‘Z9assembleaPK4int2PKfjjPjS4 ’ for ‘sm_20’
ptxas info : Used 10 registers, 512+0 bytes smem, 72 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
and ran it:
avid@cuda:~/code/sparse$ ./sparse2
(8, 4680)
(36, 4680)
(0, 4680)
(28, 4680)
(4, 4680)
(32, 4680)
(16, 4680)
(44, 4680)
(10, 4680)
(24, 4680)
(12, 4680)
(38, 4680)
(52, 4680)
As you can see, at least for this relatively simple case, it works. That kernel was run with 784 blocks on a GTX470, so there was a fair amount of output in play as well.
avidday
September 1, 2010, 8:36am
13
I just instrumented a kernel I am working on at the moment like this (it is doing some sparse matrix assembly in GPU memory):
__global__ void diffusion( const float deltat,
const unsigned int dimx, const unsigned int dimy, const unsigned int dimz,
const int blockSpace, int2 *opos, float *odata, unsigned int *onnz)
{
__shared__ int2 buffa[64];
__shared__ float buffb[64];
volatile int tidx = threadIdx.x + threadIdx.y * blockDim.x;
.....
.....
.....
.....
buffa[tidx].x = lnnz;
__syncthreads();
// First warp performs shared memory reduction to partial nnz sums
if (tidx < warpSize) {
for(int i=warpSize; i<blocksize; i+=warpSize) {
buffa[tidx].x += buffa[tidx+i].x;
}
buffa[tidx].x += buffa[tidx+16].x;
buffa[tidx].x += buffa[tidx+8].x;
buffa[tidx].x += buffa[tidx+4].x;
buffa[tidx].x += buffa[tidx+2].x;
buffa[tidx].x += buffa[tidx+1].x;
}
// Finalise and write out the results to global memory
if (tidx == 0) {
onnz[bidx] = (unsigned int)buffa[0].x;
printf("(%d, %d)\n", bidx, buffa[0].x);
}
}
compiled it like this:
avid@cuda:~/code/sparse$ nvcc -arch=sm_20 --ptxas-options=“-v” -g -I. -o sparse2 sparse2.cu
ptxas info : Compiling entry function ‘Z15spmv_ell_kernelIjfEvT_S0_S0_S0_PKS0_PKT0_S5_PS3 ’ for ‘sm_20’
ptxas info : Used 14 registers, 80 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z12deviceMemsetIfEvPT_S0_m’ for ‘sm_20’
ptxas info : Used 7 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z12deviceMemsetI4int2EvPT_S1_m’ for ‘sm_20’
ptxas info : Used 8 registers, 56 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z9diffusionfjjjiP4int2PfPj’ for ‘sm_20’
ptxas info : Used 30 registers, 768+0 bytes smem, 80 bytes cmem[0], 48 bytes cmem[2], 8 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z9assemblebPK4int2PKfPjjjjS4_Pf’ for ‘sm_20’
ptxas info : Used 11 registers, 88 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
ptxas info : Compiling entry function ‘Z9assembleaPK4int2PKfjjPjS4 ’ for ‘sm_20’
ptxas info : Used 10 registers, 512+0 bytes smem, 72 bytes cmem[0], 48 bytes cmem[2], 4 bytes cmem[14]
and ran it:
avid@cuda:~/code/sparse$ ./sparse2
(8, 4680)
(36, 4680)
(0, 4680)
(28, 4680)
(4, 4680)
(32, 4680)
(16, 4680)
(44, 4680)
(10, 4680)
(24, 4680)
(12, 4680)
(38, 4680)
(52, 4680)
As you can see, at least for this relatively simple case, it works. That kernel was run with 784 blocks on a GTX470, so there was a fair amount of output in play as well.
theblur
October 22, 2010, 1:35pm
14
I have the following kernel:
global void sayHello_kernel(void){
printf(“Hello from thread %d”, threadIdx.x);
}
It’s compiled. But when I run it, it doesn’t throw out any Hello, like:
Using device 0: GeForce GTX 460
Press ENTER to exit…
I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.
Am I missing something to use the new function?
theblur
October 22, 2010, 1:35pm
15
I have the following kernel:
global void sayHello_kernel(void){
printf(“Hello from thread %d”, threadIdx.x);
}
It’s compiled. But when I run it, it doesn’t throw out any Hello, like:
Using device 0: GeForce GTX 460
Press ENTER to exit…
I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.
Am I missing something to use the new function?
I had this problem. After stumbling around for a while I found that putting
cudaThreadSynchronize();
immediately after the kernel invocation got my prints to magically appear.
Gene W
I have the following kernel:
global void sayHello_kernel(void){
printf("Hello from thread %d", threadIdx.x);
}
It’s compiled. But when I run it, it doesn’t throw out any Hello, like:
Using device 0: GeForce GTX 460
Press ENTER to exit…
I’m using CUDA 3.2. In the example I have 8 threads in only 1 block.
Am I missing something to use the new function?