Size of printf buffer

Hi all,

I am aware than the size of the printf buffer is 1MB. cudaDeviceGetLimit(&printfBufferSz,cudaLimitPrintfFifoSize) also reports 1048576 on my system.
However, i have some code that generates quite a lot of printf data and when i run it i get roughly 120KB of output. What’s more weird is that if i set cudaThreadSetLimit(cudaLimitPrintfFifoSize, printfBufferSz * 2) i indeed get roughly 240KB of output. Is there any reason for this incosistency? Am i missing something?

I am working on a gtx 950 and cuda 10.

The buffer contains an unprocessed version of the printf data which has overhead. Just because you have a 1MB buffer does not mean that after processing that will translate to 1048576 characters of print-out.

You can get an idea of this from a careful read of the the docs:

Unlike the C-standard printf(), which returns the number of characters printed, CUDA’s printf() returns the number of arguments parsed.

Final formatting of the printf() output takes place on the host system.

The following API functions get and set the size of the buffer used to transfer the printf() arguments and internal metadata

@Robert_Crovella Thanks for the response. I actually read that, but i get 1/9-th of the buffer. Could this internal metadata account for that much? Even if the format string was copied along with the arguments to the buffer for each individual printf, then i still don’t understand how that discrepancy is justified.

I would presume that it does. I haven’t measured it myself. I also imagine that the overhead will vary depending on probably a lot of undocumented parameters, such as the exact format string, number of arguments, type of arguments, etc.

I don’t have any further explanation. I generally advise people that the in-kernel printf is not really suitable for large scale “bulk” output.

With a very simple test application, outputting 10 bytes per thread (no arguments beyond the format string) I get 40960 bytes output for 1MB buffer, so I am getting 1/25 of the available space.

I wouldn’t be surprised based on my read of the docs if the highest “throughput” or “efficiency” would come about via sending multiple number of arguments, perhaps up to the limit (32).

When I change to a format string with 5 %s arguments, 10 bytes each, I get ~250k out of the 1MB possible.

This suggests to me that the “efficiency” is a function of exactly how you format things.