Oops, I must have been really tired, the code should be:
if ( blockIdx.x * blockDim.x + threadIdx.x == 0 )
for ( int k = 0; k < 9; k++ )
cuPrintf("%d,",k);
and the output is: 4,5,6,7,8,
It seems that if I call cuPrintf too many times in a row, the beginning of the buffer disappears.
I played with the number of blocks allocated for the kernel, and that seems where the problem lies. If I lower the number of blocks from 63 to 2, the problem disappears. I should play with this more.
I should also note that I need to ‘init’ the application a few times by running it, and getting garbage results before I get real results.
Okay, I changed the buffer size from the default to:
cudaPrintfInit(1048576*4);
and now it works. I’m not sure why it needs such a large buffer even if I write in the end less than 20 characters. I think I need to study how the cuprintf mechanism works. I would have expected the buffer size to be dependent on the size of text I write, and not on the number of blocks and threads in the kernel.
Does anyone else use cuPrintf, or everyone has two gpus and more?