Can I print-to-file from a kernel?

I know I can printf from a kernel if I #include <iostream>, but that goes to the console. I want to print to file. Yes, yes, I know it will then be in random order in the file, I don’t mind that because I’ll read it in with another program that can easily sort by the first column (thread id).

There is only a device-side printf(), there is no device-side fprintf(). The way that device-side printf works is by depositing data into a buffer that is copied back to the host, and processed there via stdout. Note that the buffer can overflow if a kernel produces a lot of output. Programmers can select a size different from the default size (I seem to recall it is 1 MB) by specifying the desired size with a call to:

cudaDeviceSetLimit (cudaLimitPrintfFifoSize, size_t size)

On the host side, one can re-direct the stdout stream using the standard freopen() function of cstdio. A simple example:

#include <cstdio>
#include <cstdlib>

__global__ void kernel1 (void)
{
    printf ("Written by kernel 1\n");
}
__global__ void kernel2 (void)
{
    printf ("Written by kernel 2\n");
}

int main ()
{
    fflush (stdout);
    fclose (stdout);
    freopen ("kernel1_output.txt", "w", stdout);
    kernel1<<<1,1>>>();
    cudaDeviceSynchronize();
    fflush (stdout);
    fclose (stdout);
    freopen ("kernel2_output.txt", "w", stdout);
    kernel2<<<1,1>>>();
    cudaDeviceSynchronize();
    fflush (stdout);
    fclose (stdout);
    return EXIT_SUCCESS;
}

After running this program, and assuming permissions and disk-space requirements allowed the files to be written, there should now be two files in the current directory, one containing the output from kernel1 and the other containing the output from kernel2

Thank you very much, that is very helpful. Very educational.

But that doesn’t seem to solve my problem, even with something like cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 999999999999999999); my program exits with <terminated>(exit value: -1) when I make a certain array too large. 500 works, 510 already fails. I can see in nvtop that it doesn’t even launch with 510, it just throws in the towel. With 500 I see GPU activity and GPU memory activity come up nicely, and then come down again.

So my problem is not some output limitation of the print, and this is for testing anyway. I have to find the problem that a certain array can’t be made too big, otherwise it fails no matter what, and apparently unrelated to the print matter. So it’s an unrelated problem. While I have you, is it even possible that I can make an array size in the kernel even of variable size? Say, instead of double myarray[1000] I want it double myarray[n], where n is a parameter in the kernel call? Probably not directly like that, but I’m sure there is another way, this can’t be an uninteresting use case. I want that array to be longer/shorter based on a certain size integer. Should perhaps be a separate, different question in this forum.

Minor comment: apparently you can’t use cudaSetDevice(4); with that, I get an empty file if I use that, before or after the cudaDeviceSetLimit. But that’s ok, I only print for testing, I can do that on my default GPU. Without that line it works.

The printf buffer size you specified exceeds the amount of memory on the GPU, so it is no surprise at all that the call to cudaDeviceSetLimit failed. More reasonable values will work fine (see code below).

Re arrays in kernel: There is a device-side malloc() if I recall correctly, although I haven’t used it. Take a look at the documentation to confirm.

Re cudaSetDevice: Obviously you can only select devices that are physically present in your machine and that are recognized by the software stack and not excluded from CUDA use by user action. I have no idea how your system is set up.

#include <cstdio>
#include <cstdlib>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

__global__ void kernel1 (void)
{
    printf ("Written by kernel 1\n");
}

__global__ void kernel2 (void)
{
    printf ("Written by kernel 2\n");
}

int main ()
{
    size_t buffer_size;
    CUDA_SAFE_CALL (cudaDeviceGetLimit (&buffer_size, cudaLimitPrintfFifoSize));
    printf ("Size of device-side printf buffer is:           %llu\n", buffer_size);
    buffer_size *= 10;
    CUDA_SAFE_CALL (cudaDeviceSetLimit (cudaLimitPrintfFifoSize, buffer_size));
    CUDA_SAFE_CALL (cudaDeviceGetLimit (&buffer_size, cudaLimitPrintfFifoSize));
    printf ("Size of device-side printf buffer increased to: %llu\n", buffer_size);
    buffer_size *= 10;
    CUDA_SAFE_CALL (cudaDeviceSetLimit (cudaLimitPrintfFifoSize, buffer_size));
    CUDA_SAFE_CALL (cudaDeviceGetLimit (&buffer_size, cudaLimitPrintfFifoSize));
    printf ("Size of device-side printf buffer increased to: %llu\n", buffer_size);
    fflush (stdout);
    fclose (stdout);
    freopen ("kernel1_output.txt", "w", stdout);
    kernel1<<<1,1>>>();
    cudaDeviceSynchronize();
    fflush (stdout);
    fclose (stdout);
    freopen ("kernel2_output.txt", "w", stdout);
    kernel2<<<1,1>>>();
    cudaDeviceSynchronize();
    fflush (stdout);
    fclose (stdout);
    return EXIT_SUCCESS;
}

I see nothing about it in the docs, so I’m assuming when an overflow occurs, there is no way of checking this has happened?

Anecdotally, when the device-side buffer overflows, new incoming data is discarded. Not all programmers know this, but printf() is actually defined by the C++ standard to return the number of characters written, which would theoretically allow programmers to detect data loss in some instances. But I do not know what device-side printf() does in this regard.

Even if the device-side printf() actually returns a meaningful value (I have never checked that it does, one would have to run some experiments), it cannot be the number of printed characters because it does not print anything, it merely puts data into a buffer to be printed later on the host side. If we are lucky, it might return the number of bytes inserted into the buffer, with zero being a clear indication that the buffer was full.

If it does not do that as of now, it seems like a reasonable enhancement request to add such functionality that interested parties could file.

Thanks.

OK, thank you again for your helpful reply.

It seems the file size can be larger than the size used in cudaDeviceSetLimit. With 104857600 from your code and file size of 12101351 the file size is larger. I guess that’s ok and expected.
What I find intriguing is that the binary log of 104857600 is 26.6xxxx, rounded down gives me the 26 that minGridSize gives me when I use the occupancy API: cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, compute, 0, 0);
And when I print what I really want from the host the file size is much smaller. So I think print buffers are not the problem, not on the host and not from the kernel.
I’m running into some other problem, ideally I’d find out what exit value: -1 means and how I can have arrays in the kernel have different array sizes.

The cudaSetDeviceLimit call puts a limit on the size of the buffer device-side printf uses to store the raw unformatted data before it is retrieved by the host. In a typical situation, when this data is formatted according to printf() rules on the host, one would see an expansion of the amount of data from raw to printed. As an example, consider “%d” expanding into “-123456789”.

For the record and according to the Programming Guide :

“Unlike the C-standard printf() , which returns the number of characters printed,
CUDA’s printf() returns the number of arguments parsed. If no arguments follow the
format string, 0 is returned. If the format string is NULL, -1 is returned. If an internal
error occurs, -2 is returned.”

Interesting! Right now I am drawing a blank as to what CUDA programmers could possibly do with that information …