Guidance regarding output processing

I have a task that involves a number of kernels in a loop that could be running from 1 to 64k times, that randomly produces an 8 byte result quite sporadically - roughly 260 occurences per 2.5 seconds, one loop = 11.8ms, in the current test setup on a single GTX 1060.

Due to the various variabilities in output and the small size, my thoughts were to obtain the results via a printf("%16llx",…) in the last kernel.

Is this a realistic and efficient option?

Due to a lack of both Cuda and C coding experience, I am having difficulty capturing the output stream for further processing, which entails loading an eight byte buffer, testing it and repeating for as long as results are arriving:

       for(i = start; i < end; i++){ 
           kernel1<<<BLOCKS,THREADS,0,0>>>();
           kernel2<<<BLOCKS,THREADS,0,0>>>();
           kernel3<<<BLOCKS,THREADS,0,0>>>();
           kernel4<<<BLOCKS,THREADS,0,0>>>();
    	}

    	uint32_t fd = 0;
    	uint8_t buf[8];

    	fd = dup(1); //dup stdout and close it to prevent screen clutter.
    	close(1);

    	if (fd){
    		while((read(fd, buf, 8)) == 8){	
                //Test results in buf
    		}
    		exit(EXIT_SUCCESS);
        else{
               ..........

From this point messages are being sent via fprintf(stderr,…) and unfortunately testing to this point, printing to both file and screen seems to indicate nothing getting into buf, with and without close(1) .

Looking at the man page for read();
“On files that support seeking, the read operation commences at the file
offset, and the file offset is incremented by the number of bytes read.
If the file offset is at or past the end of file, no bytes are read,
and read() returns zero.”
is read() not appropriate here? (is seeking possible?).

I have been spending a lot of time reading the last few days, so any help very much appreciated. Am certainly not expecting a fully worked solution, just a few pointers (pun not intended) and indeed, whether this is even the right approach.

Kind regards.

I wouldn’t recommend printing from device code except for debugging purposes. Why not simply copy the buffer from the device to the host first? You can apply whatever post-processing is desired or required after that.

FWIW, I recently posted a minimal example of printing from device code into a file here:

Thanks, I did participate in that thread and did consider this, but thought it would be cleaner to capture it directly, rather than managing variable file sizes.

What does “capturing it directly” mean? The device-side printf shuffles data into a ring buffer. The CUDA runtime takes the ring-buffer contents and forwards it to stdouton the host. To the best of my knowledge, you cannot intercept data on this path. The earliest possible intercept means re-directing stdout to some other place, which is what I demonstrated in the previous thread.

Again, the direct path that offers the most control is not to use device-side printf, but to copy the raw binary output of the kernel content to the host yourself, where you can slice and dice it however you see fit.

I was under the impression that that was what I was doing above, "dup"ing the stdout fd, closing stdout, and reading from the "dup"ed fd.

Is that not redirecting?

Somehow we are going in circles here. I pointed out how to go about redirecting stdout. You replied: “I did … consider this, but thought it would be cleaner to capture it directly”. What do you mean by “capture it directly” and why is what I suggested earlier in terms of redirectingstdout “not clean enough”?

I meant reading directly from the stdout stream, rather than writing a file to disk.

I meant reading directly from the stdout stream, rather than writing a file to disk.

Redirecting from the stdout stream to where? stdin?

buf[8]

I certainly wouldn’t know how to do this, and I am not sure it is doable at all. Maybe it can be done through some streambuf magic by creating your own custom stream buffer? Even if you can get that to work, it seems an overly convoluted way of transporting a bit of data.

I would stick to simply copying the raw data from the device to the host with cudaMemcpy{Async}.

Thanks I appreciate the help. I’ve based what I’ve done around something similar I’ve seen in the textbook I’m learning from, as I thought it might be more efficient, given roughly 259 out of 260 cudaMemcpy* would be empty, if the cpy is at the end of each loop.

EDIT: The above is not correct - it’s actually closer to 1 empty per 260 loops, so not a problem.