Guidance regarding output processing

rs277 · October 5, 2020, 12:02am

I have a task that involves a number of kernels in a loop that could be running from 1 to 64k times, that randomly produces an 8 byte result quite sporadically - roughly 260 occurences per 2.5 seconds, one loop = 11.8ms, in the current test setup on a single GTX 1060.

Due to the various variabilities in output and the small size, my thoughts were to obtain the results via a printf(“%16llx”,…) in the last kernel.

Is this a realistic and efficient option?

Due to a lack of both Cuda and C coding experience, I am having difficulty capturing the output stream for further processing, which entails loading an eight byte buffer, testing it and repeating for as long as results are arriving:

       for(i = start; i < end; i++){ 
           kernel1<<<BLOCKS,THREADS,0,0>>>();
           kernel2<<<BLOCKS,THREADS,0,0>>>();
           kernel3<<<BLOCKS,THREADS,0,0>>>();
           kernel4<<<BLOCKS,THREADS,0,0>>>();
    	}

    	uint32_t fd = 0;
    	uint8_t buf[8];

    	fd = dup(1); //dup stdout and close it to prevent screen clutter.
    	close(1);

    	if (fd){
    		while((read(fd, buf, 8)) == 8){	
                //Test results in buf
    		}
    		exit(EXIT_SUCCESS);
        else{
               ..........

From this point messages are being sent via fprintf(stderr,…) and unfortunately testing to this point, printing to both file and screen seems to indicate nothing getting into buf, with and without close(1) .

Looking at the man page for read();
“On files that support seeking, the read operation commences at the file
offset, and the file offset is incremented by the number of bytes read.
If the file offset is at or past the end of file, no bytes are read,
and read() returns zero.”
is read() not appropriate here? (is seeking possible?).

I have been spending a lot of time reading the last few days, so any help very much appreciated. Am certainly not expecting a fully worked solution, just a few pointers (pun not intended) and indeed, whether this is even the right approach.

Kind regards.

njuffa · October 5, 2020, 12:16am

I wouldn’t recommend printing from device code except for debugging purposes. Why not simply copy the buffer from the device to the host first? You can apply whatever post-processing is desired or required after that.

FWIW, I recently posted a minimal example of printing from device code into a file here:

rs277 · October 5, 2020, 12:24am

Thanks, I did participate in that thread and did consider this, but thought it would be cleaner to capture it directly, rather than managing variable file sizes.

njuffa · October 5, 2020, 12:33am

What does “capturing it directly” mean? The device-side printf shuffles data into a ring buffer. The CUDA runtime takes the ring-buffer contents and forwards it to stdouton the host. To the best of my knowledge, you cannot intercept data on this path. The earliest possible intercept means re-directing stdout to some other place, which is what I demonstrated in the previous thread.

Again, the direct path that offers the most control is not to use device-side printf, but to copy the raw binary output of the kernel content to the host yourself, where you can slice and dice it however you see fit.

rs277 · October 5, 2020, 4:19am

I was under the impression that that was what I was doing above, "dup"ing the stdout fd, closing stdout, and reading from the "dup"ed fd.

Is that not redirecting?

njuffa · October 5, 2020, 4:26am

Somehow we are going in circles here. I pointed out how to go about redirecting stdout. You replied: “I did … consider this, but thought it would be cleaner to capture it directly”. What do you mean by “capture it directly” and why is what I suggested earlier in terms of redirectingstdout “not clean enough”?

rs277 · October 5, 2020, 4:35am

I meant reading directly from the stdout stream, rather than writing a file to disk.

njuffa · October 5, 2020, 4:37am

I meant reading directly from the stdout stream, rather than writing a file to disk.

Redirecting from the stdout stream to where? stdin?

rs277 · October 5, 2020, 4:38am

buf[8]

njuffa · October 5, 2020, 4:56am

I certainly wouldn’t know how to do this, and I am not sure it is doable at all. Maybe it can be done through some streambuf magic by creating your own custom stream buffer? Even if you can get that to work, it seems an overly convoluted way of transporting a bit of data.

I would stick to simply copying the raw data from the device to the host with cudaMemcpy{Async}.

rs277 · October 5, 2020, 5:04am

Thanks I appreciate the help. I’ve based what I’ve done around something similar I’ve seen in the textbook I’m learning from, as I thought it might be more efficient, given roughly 259 out of 260 cudaMemcpy* would be empty, if the cpy is at the end of each loop.

EDIT: The above is not correct - it’s actually closer to 1 empty per 260 loops, so not a problem.

Topic		Replies	Views
How is it possible to print to file on the device? CUDA Programming and Performance	1	2204	October 23, 2018
Can I print-to-file from a kernel? CUDA Programming and Performance cuda	10	4388	September 29, 2020
redirect printf to file Is it possible within a kernel? CUDA Programming and Performance	6	3162	June 28, 2010
routing printf output to file CUDA Programming and Performance	1	2955	April 15, 2011
No printf(".") output from the kernel CUDA Programming and Performance	6	3170	March 15, 2023
A simple question about printf() inside a kernel with no convincing answer on google or nvidia docs CUDA Programming and Performance	8	5929	August 4, 2019
Where did my cuda printf output go? CUDA Programming and Performance	3	9604	January 11, 2013
printf problem CUDA Programming and Performance	3	1125	December 5, 2012
How to redirect the stdout in order to use printfs in multi-GPU system ? CUDA Programming and Performance	0	790	December 30, 2010
printf behavior change on driver update CUDA Programming and Performance	0	2303	March 18, 2012

Guidance regarding output processing

Related topics