A simple question about printf() inside a kernel with no convincing answer on google or nvidia docs

Hi. im writing a code for accelerated pulse compression and im having trouble with just printig the results. everything else works just fine. the code used to work all fine before i turned off windows WDDM TDR. the problem was execution time limitation since this code can go on for hours because of the while(1) loop. when windows terminated the process the already printed result were correct so the code works just fine. i disabled TDR through regedit as instructed here : https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
but now just nothing is printed. neither the process is terminated nor anything is printed.

i suspect something is wrong with the printf being inside a loop. i think it wont print until the buffer is full or process is being terminated. any idea why is this happening or how it can be avoided? heres the code:

__global__ void chk() {
int id = threadIdx.x + blockDim.x*blockIdx.x;

int NBITS = 96;
int RPSL = 9;
unsigned long long int CFF = 0xffffffffffffffff;
unsigned long long int C33 = 0x3333333333333333;
unsigned long long int C55 = 0x5555555555555555;
unsigned long long int C0F = 0x0f0f0f0f0f0f0f0f;
unsigned long long int C01 = 0x0101010101010101;
unsigned long long int ZERO64 = 0x0000000000000000;
unsigned long long int LSB64 = 0x0000000000000001;
unsigned long long int MSB64 = 0x8000000000000000;

unsigned long long int n0, n1;
unsigned long long int b0, b1;
unsigned long long int c0, c1;
unsigned long long int ones0;
unsigned long long int ones1;
unsigned long long int cnt;
int PSL = 0;
int i;
int SL;
unsigned long long int msb;
unsigned long long int partialset;

curandState s;
partialset = CFF >> (128 - NBITS);
curand_init(id , 0ULL, 0ULL, &s);
n0 = 0x1234567812345678ULL;
n1 = n0;
	while (1) {

		n0++;
		n1++;
		n0 += curand(&s);
		n1 += curand(&s);
//printf("%x \n%llx   BEFORE PARTIALSET thread id is: %d\n",n1 , n0 , id);
		n1 &= partialset;
//printf("%x \n%llx   thread id is: %d\n",n1 , n0 , id);
		b0 = n0;
		b1 = n1;
		ones0 = CFF;
		ones1 = partialset;

		PSL = 0;
		i = 1;
		
		while ((PSL <= RPSL) && (i <= (NBITS - 64))) {

			b0 >>= 1;
			msb = (b1 & LSB64) ? MSB64 : ZERO64;
			b0 |= msb;
			b1 >>= 1;

			ones1 >>= 1;


			c0 = n0 ^ b0;
			c1 = n1 ^ b1;

			c1 &= ones1;

			c0 -= (c0 >> 1) & C55;              // put count of each 2 bits into those 2 bits
			c0 = (c0 & C33) + ((c0 >> 2) & C33);// put count of each 4 bits into those 4 bits
			c0 = (c0 + (c0 >> 4)) & C0F;        // put count of each 8 bits into those 8 bits
			cnt = (c0 * C01) >> 56;            // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...

			c1 -= (c1 >> 1) & C55;              // put count of each 2 bits into those 2 bits
			c1 = (c1 & C33) + ((c1 >> 2) & C33);// put count of each 4 bits into those 4 bits
			c1 = (c1 + (c1 >> 4)) & C0F;        // put count of each 8 bits into those 8 bits
			cnt += (c1 * C01) >> 56;            // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...

			SL = NBITS - i - 2 * cnt;
			SL = SL > 0 ? SL : -SL;

			if (SL > PSL) {
				PSL = SL;
			}

			i++;

		}

		while ((PSL <= RPSL) && (i < (NBITS - RPSL))) {

			b0 >>= 1;
			ones0 >>= 1;

			c0 = n0 ^ b0;
			c0 &= ones0;

			c0 -= (c0 >> 1) & C55;              // put count of each 2 bits into those 2 bits
			c0 = (c0 & C33) + ((c0 >> 2) & C33);// put count of each 4 bits into those 4 bits
			c0 = (c0 + (c0 >> 4)) & C0F;        // put count of each 8 bits into those 8 bits
			cnt = (c0 * C01) >> 56;            // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...

			SL = NBITS - i - 2 * cnt;

			SL = SL > 0 ? SL : -SL;

			if (SL > PSL) {
				PSL = SL;
			}

			i++;
		}

		if (PSL <= RPSL) {
			printf("%x%llx	,	", n1, n0);
			// cout << hex << setw(16) << n1 << n0 << ", ";

			printf("PSL = %d	, thread id = %d \n", PSL, threadIdx.x);
			//cout << "PSL = " << dec << PSL << ", thread id = " << id << endl;
			cudaDeviceSynchronize();
		}	
	}	
}


void main(int argc, char **argv) {
	cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 1024*1024*80);
	chk<<<32, 64 >>>();

	cudaDeviceSynchronize();

	}

forget the algorithm its at early stage and needs a lot of improvements. i would just appreciate if anyone can help with the printing. and antother thing to mention is that although 2048 threads are executing at same time, the odds of finding suitable answers are low enough in order for the buffer not to be filled even at 1 megabytes.

well i just tested something else. when using a finite but high magnitude for loop instead of while(1), all the results are printed to the output at the same time at the end of execution when for loop finishes. so the problem is that nothing is printed out of the while loop because it never finishes. am i right?

You may not get printout while a kernel is running. The printout is not guaranteed to occur until the kernel is finished. Adding cudaDeviceSynchronize to your kernel won’t fix that. So having an infinite loop in your kernel means you may never see any printout from your kernel. That’s the way printf from the kernel works.

Because of these differences with the behavior of host-side printf, the printf from the kernel is not really a reliable mechanism, when your expectations about printf usage are formed from experience with host-side printf. For these reasons, using it for extended, long-term, or large-scale printout may be simply not a good idea.

Transfer the data you need to printout to the host and print it out from there. And of course the usual methods to do this will also require your kernel to finish, before you can see the printout.

This limitation is documented:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#limitations

“It is flushed only when one of these actions is performed:…”

Your attempt to use cudaDeviceSynchronize() in-kernel does not meet the definition given. The requirement is for a host-side call to one of the synchronizing functions, after the kernel launch. While that may not be very clear for that particular item, every other indication given there is that the kernel must finish, before it is guaranteed that the buffer will be flushed.

thank you for the answer.
So are you saying that there’s no way to interrupt the stream 0? not even if i add another stream for memcpy?

if thats it, then i think i gotta go the hard way; making loops long but finite, storing the latest curandState of each thread on host side, then launching the entire kernel again but with the latest state of each thread. that would just kill so much time.

I didn’t say any of those things.

You showed a kernel that had a

while(1)

statement in it, and wondered why you are not getting printout. I was trying to answer that.

First, I don’t think using in-kernel printf should be considered a production worthy method of delivering results from a CUDA kernel. We can disagree on that point. Do as you wish. Second, if you’re claiming that you are writing a code for a serious purpose that requires a

while(1)

statement in it, then I can’t help you. I’ve never witnessed a code of any type in all of computer science, that needs to run forever.

It is possible to transfer data from a running kernel, and do something useful with it (e.g. print it out from the host side) while the kernel is still running. The methodology requires more than what you have shown here, but it works pretty well on linux and GPUs in windows TCC driver mode. For GPUs in windows WDDM driver mode, its quite a bit more difficult to get it to work “correctly” and so I usually don’t even try to make it work on windows WDDM. You are evidently on a GPU in windows WDDM mode.

well, thanks a lot, your answer made many things clear.
unfortunately the code is supposed to run on a Geforce device which makes omits TCC from my options.
and in case you are curious, the code is only going to run forever if we use random start points to search all 96 bit integers. if we do a full search from 0x0…0 to 0xff…f then it would have a known end.

either way it would be better to see the results as they are generated.

thanks again.
lets see what i can do with host calls

You can see results as they are generated from a running kernel using a methodology something like this:

https://stackoverflow.com/questions/20345702/how-can-i-check-the-progress-of-matrix-multiplication/20381924#20381924

However it’s more difficult to get it working correctly on windows WDDM, due to command batching. I usually don’t even try to make it work on WDDM.

that indeed looks difficult.
thank you for your time. this was really helpful.