GTX480 Streams Issues

Jnesp · April 9, 2014, 11:33am

Hello,

I’m experiencing issues with this GPU card when I try to use Streams on my code. I’ve checked the three requisites to stream (deviceOverlap OK, Kernel execution and Data Transfers to be overlapped occurring in different-non-default streams and host memory involved as pinned memory) but I can see on nVidia Profiler that overlapping between data transfers and kernels is unsuccessful.

At last I’ve tried to run this basic example just to make sure that it’s not my fault…

github.com

NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu

/* Copyright (c) 1993-2015, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

This file has been truncated. show original

But it still does not work. Memcpy’s and Kernels do not overlap.

Does anyone knows if there is some kind of problems with this GPU to use Streams?

Sumit_Kumar · April 9, 2014, 2:25pm

It may help you.
I just copied the code and run on my Tesla K20c and the overlap is working perfectly fine. H2D, Kernel and D2H, all are overlapping.

Jnesp · April 9, 2014, 5:38pm

Ok, thanks!

So, you simply copied the code, compiled it and run .exe on the profiler, you did not add anything, is this correct?

Sumit_Kumar · April 9, 2014, 6:17pm

Yes.

njuffa · April 9, 2014, 6:56pm

I am not familiar with the GTX 480 or the linked example. A couple of thoughts:

(1) The GTX 480 has a single Copy Engine (DMA engine), while the Tesla K20c has two. This means that the GTX 480 can overlap kernel execution with a copy in one direction (either host->device, OR device->host), but cannot perform simultaneous up- and downloads. The Tesla K20c can, at the same time, execute a kernel, transfer data host->device, and transfer device->host.

(2) There are various ways in which concurrent copies could be disabled. For example setting the environment variable CUDA_LAUNCH_BLOCKING=1, invoking the profiler with --concurrent-kernels-off, or enabling serialized trace. Check the option under Nsight|Options…|Analysis|CUDA Kernel Trace Mode

Jnesp · April 10, 2014, 3:02pm

Ok, I review all these ideas and it is not working. One question, what operating system are you using? Linux or Windows?

Sumit_Kumar · April 10, 2014, 7:23pm

Me?
Anyways, Linux.

Jnesp · April 24, 2014, 3:15pm

Mark this: With Linux, my code can use Streams without problems. I do not know if there is some kind of issue between this card (GTX480) and Windows in order to overlap data transfers and kernels.

Thanks for your collaboration!

Robert_Crovella · April 24, 2014, 3:26pm

In my experience, it’s difficult to get a WDDM GPU in windows to work correctly with concurrency. One of the issues is that WDDM batches commands to the GPU. This batching of operations can interfere with expected sequencing of operations, visible when you try to profile the app.

You won’t be able to put your GeForce device in TCC mode, but for GPUs that can be run in TCC mode, it’s usually easier to get expected results in these cases.

Topic		Replies	Views
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3598	May 10, 2017
streams not overlapping CUDA Programming and Performance	1	1599	May 23, 2011
Concurrent execution problem Try to understand how to achieve the data and execution concurrency CUDA Programming and Performance	4	1573	July 9, 2010
Overlapping data transfers with kernel execution CUDA Programming and Performance	9	4650	March 13, 2009
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2513	January 18, 2023
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	4013	October 19, 2011
Question about overlapping data transfer while stream execution CUDA Programming and Performance	0	382	October 10, 2019
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1751	October 30, 2015
Could someone helpme to achieve overlapping between computation and transfer in GTX Titan card? CUDA Programming and Performance	3	1017	October 26, 2013
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6138	April 2, 2013

GTX480 Streams Issues

Related topics