A simple question about printf() inside a kernel with no convincing answer on google or nvidia docs

amintahertalari · August 4, 2019, 7:05am

Hi. im writing a code for accelerated pulse compression and im having trouble with just printig the results. everything else works just fine. the code used to work all fine before i turned off windows WDDM TDR. the problem was execution time limitation since this code can go on for hours because of the while(1) loop. when windows terminated the process the already printed result were correct so the code works just fine. i disabled TDR through regedit as instructed here : https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
but now just nothing is printed. neither the process is terminated nor anything is printed.

i suspect something is wrong with the printf being inside a loop. i think it wont print until the buffer is full or process is being terminated. any idea why is this happening or how it can be avoided? heres the code:

__global__ void chk() {
int id = threadIdx.x + blockDim.x*blockIdx.x;

int NBITS = 96;
int RPSL = 9;
unsigned long long int CFF = 0xffffffffffffffff;
unsigned long long int C33 = 0x3333333333333333;
unsigned long long int C55 = 0x5555555555555555;
unsigned long long int C0F = 0x0f0f0f0f0f0f0f0f;
unsigned long long int C01 = 0x0101010101010101;
unsigned long long int ZERO64 = 0x0000000000000000;
unsigned long long int LSB64 = 0x0000000000000001;
unsigned long long int MSB64 = 0x8000000000000000;

unsigned long long int n0, n1;
unsigned long long int b0, b1;
unsigned long long int c0, c1;
unsigned long long int ones0;
unsigned long long int ones1;
unsigned long long int cnt;
int PSL = 0;
int i;
int SL;
unsigned long long int msb;
unsigned long long int partialset;

curandState s;
partialset = CFF >> (128 - NBITS);
curand_init(id , 0ULL, 0ULL, &s);
n0 = 0x1234567812345678ULL;
n1 = n0;
	while (1) {

		n0++;
		n1++;
		n0 += curand(&s);
		n1 += curand(&s);
//printf("%x \n%llx   BEFORE PARTIALSET thread id is: %d\n",n1 , n0 , id);
		n1 &= partialset;
//printf("%x \n%llx   thread id is: %d\n",n1 , n0 , id);
		b0 = n0;
		b1 = n1;
		ones0 = CFF;
		ones1 = partialset;

		PSL = 0;
		i = 1;
		
		while ((PSL <= RPSL) && (i <= (NBITS - 64))) {

			b0 >>= 1;
			msb = (b1 & LSB64) ? MSB64 : ZERO64;
			b0 |= msb;
			b1 >>= 1;

			ones1 >>= 1;


			c0 = n0 ^ b0;
			c1 = n1 ^ b1;

			c1 &= ones1;

			c0 -= (c0 >> 1) & C55;              // put count of each 2 bits into those 2 bits
			c0 = (c0 & C33) + ((c0 >> 2) & C33);// put count of each 4 bits into those 4 bits
			c0 = (c0 + (c0 >> 4)) & C0F;        // put count of each 8 bits into those 8 bits
			cnt = (c0 * C01) >> 56;            // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...

			c1 -= (c1 >> 1) & C55;              // put count of each 2 bits into those 2 bits
			c1 = (c1 & C33) + ((c1 >> 2) & C33);// put count of each 4 bits into those 4 bits
			c1 = (c1 + (c1 >> 4)) & C0F;        // put count of each 8 bits into those 8 bits
			cnt += (c1 * C01) >> 56;            // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...

			SL = NBITS - i - 2 * cnt;
			SL = SL > 0 ? SL : -SL;

			if (SL > PSL) {
				PSL = SL;
			}

			i++;

		}

		while ((PSL <= RPSL) && (i < (NBITS - RPSL))) {

			b0 >>= 1;
			ones0 >>= 1;

			c0 = n0 ^ b0;
			c0 &= ones0;

			c0 -= (c0 >> 1) & C55;              // put count of each 2 bits into those 2 bits
			c0 = (c0 & C33) + ((c0 >> 2) & C33);// put count of each 4 bits into those 4 bits
			c0 = (c0 + (c0 >> 4)) & C0F;        // put count of each 8 bits into those 8 bits
			cnt = (c0 * C01) >> 56;            // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...

			SL = NBITS - i - 2 * cnt;

			SL = SL > 0 ? SL : -SL;

			if (SL > PSL) {
				PSL = SL;
			}

			i++;
		}

		if (PSL <= RPSL) {
			printf("%x%llx	,	", n1, n0);
			// cout << hex << setw(16) << n1 << n0 << ", ";

			printf("PSL = %d	, thread id = %d \n", PSL, threadIdx.x);
			//cout << "PSL = " << dec << PSL << ", thread id = " << id << endl;
			cudaDeviceSynchronize();
		}	
	}	
}


void main(int argc, char **argv) {
	cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 1024*1024*80);
	chk<<<32, 64 >>>();

	cudaDeviceSynchronize();

	}

amintahertalari · August 4, 2019, 7:12am

forget the algorithm its at early stage and needs a lot of improvements. i would just appreciate if anyone can help with the printing. and antother thing to mention is that although 2048 threads are executing at same time, the odds of finding suitable answers are low enough in order for the buffer not to be filled even at 1 megabytes.

amintahertalari · August 4, 2019, 7:19am

well i just tested something else. when using a finite but high magnitude for loop instead of while(1), all the results are printed to the output at the same time at the end of execution when for loop finishes. so the problem is that nothing is printed out of the while loop because it never finishes. am i right?

Robert_Crovella · August 4, 2019, 1:00pm

You may not get printout while a kernel is running. The printout is not guaranteed to occur until the kernel is finished. Adding cudaDeviceSynchronize to your kernel won’t fix that. So having an infinite loop in your kernel means you may never see any printout from your kernel. That’s the way printf from the kernel works.

Because of these differences with the behavior of host-side printf, the printf from the kernel is not really a reliable mechanism, when your expectations about printf usage are formed from experience with host-side printf. For these reasons, using it for extended, long-term, or large-scale printout may be simply not a good idea.

Transfer the data you need to printout to the host and print it out from there. And of course the usual methods to do this will also require your kernel to finish, before you can see the printout.

This limitation is documented:

[url]Programming Guide :: CUDA Toolkit Documentation

“It is flushed only when one of these actions is performed:…”

Your attempt to use cudaDeviceSynchronize() in-kernel does not meet the definition given. The requirement is for a host-side call to one of the synchronizing functions, after the kernel launch. While that may not be very clear for that particular item, every other indication given there is that the kernel must finish, before it is guaranteed that the buffer will be flushed.

amintahertalari · August 4, 2019, 1:26pm

Robert_Crovella:

You may not get printout while a kernel is running. The printout is not guaranteed to occur until the kernel is finished. Adding cudaDeviceSynchronize to your kernel won’t fix that. So having an infinite loop in your kernel means you may never see any printout from your kernel. That’s the way printf from the kernel works.

Because of these differences with the behavior of host-side printf, the printf from the kernel is not really a reliable mechanism, when your expectations about printf usage are formed from experience with host-side printf. For these reasons, using it for extended, long-term, or large-scale printout may be simply not a good idea.

Transfer the data you need to printout to the host and print it out from there. And of course the usual methods to do this will also require your kernel to finish, before you can see the printout.

thank you for the answer.
So are you saying that there’s no way to interrupt the stream 0? not even if i add another stream for memcpy?

if thats it, then i think i gotta go the hard way; making loops long but finite, storing the latest curandState of each thread on host side, then launching the entire kernel again but with the latest state of each thread. that would just kill so much time.

Robert_Crovella · August 4, 2019, 1:42pm

I didn’t say any of those things.

You showed a kernel that had a

while(1)

statement in it, and wondered why you are not getting printout. I was trying to answer that.

First, I don’t think using in-kernel printf should be considered a production worthy method of delivering results from a CUDA kernel. We can disagree on that point. Do as you wish. Second, if you’re claiming that you are writing a code for a serious purpose that requires a

while(1)

statement in it, then I can’t help you. I’ve never witnessed a code of any type in all of computer science, that needs to run forever.

It is possible to transfer data from a running kernel, and do something useful with it (e.g. print it out from the host side) while the kernel is still running. The methodology requires more than what you have shown here, but it works pretty well on linux and GPUs in windows TCC driver mode. For GPUs in windows WDDM driver mode, its quite a bit more difficult to get it to work “correctly” and so I usually don’t even try to make it work on windows WDDM. You are evidently on a GPU in windows WDDM mode.

amintahertalari · August 4, 2019, 2:13pm

Robert_Crovella:

I didn’t say any of those things.

You showed a kernel that had a

while(1)

statement in it, and wondered why you are not getting printout. I was trying to answer that.

First, I don’t think using in-kernel printf should be considered a production worthy method of delivering results from a CUDA kernel. We can disagree on that point. Do as you wish. Second, if you’re claiming that you are writing a code for a serious purpose that requires a

while(1)

statement in it, then I can’t help you. I’ve never witnessed a code of any type in all of computer science, that needs to run forever.

It is possible to transfer data from a running kernel, and do something useful with it (e.g. print it out from the host side) while the kernel is still running. The methodology requires more than what you have shown here, but it works pretty well on linux and GPUs in windows TCC driver mode. For GPUs in windows WDDM driver mode, its quite a bit more difficult to get it to work “correctly” and so I usually don’t even try to make it work on windows WDDM. You are evidently on a GPU in windows WDDM mode.

well, thanks a lot, your answer made many things clear.
unfortunately the code is supposed to run on a Geforce device which makes omits TCC from my options.
and in case you are curious, the code is only going to run forever if we use random start points to search all 96 bit integers. if we do a full search from 0x0…0 to 0xff…f then it would have a known end.

either way it would be better to see the results as they are generated.

thanks again.
lets see what i can do with host calls

Robert_Crovella · August 4, 2019, 2:31pm

You can see results as they are generated from a running kernel using a methodology something like this:

[url]cuda - How can I check the progress of matrix multiplication? - Stack Overflow

However it’s more difficult to get it working correctly on windows WDDM, due to command batching. I usually don’t even try to make it work on WDDM.

amintahertalari · August 4, 2019, 3:14pm

that indeed looks difficult.
thank you for your time. this was really helpful.

Topic		Replies	Views
No printf(".") output from the kernel CUDA Programming and Performance	6	3170	March 15, 2023
Doubts in using printf in kernel code CUDA Programming and Performance	1	941	October 9, 2011
Cuda 8.0 - Simple question regarding the use of printf from kernel. CUDA Programming and Performance	2	3664	April 4, 2017
Printf and thread idx problems CUDA Programming and Performance	8	8683	March 30, 2012
Printf isn't working in my kernel? CUDA Programming and Performance	1	7335	January 10, 2011
device printf not working cuda 3.1 printf not working on tesla c2050 CUDA Programming and Performance	16	63495	May 27, 2011
printf inside a kernel is not working nVIDIA Quadro 4000 CUDA Programming and Performance	2	3764	November 7, 2011
printf problem CUDA Programming and Performance	3	1125	December 5, 2012
printf not working CUDA Programming and Performance	0	2309	November 2, 2007
printf vs cuPrintf in kernels CUDA Programming and Performance	2	5951	February 5, 2013

A simple question about printf() inside a kernel with no convincing answer on google or nvidia docs

Related topics