If I have a for loop invoking cudaMemcpyAsync where I always use the zero stream (the default stream), can I expect the data to be copied to the destination in parallel and asynchronously, and therefore see a speedup in my program? Or do I need to associate a distinct stream with each value of i to see a speedup? For example:
cudaMemcpy operations, issued in the same direction (i.e. host to device) will always serialize. The data will not be ācopied in parallelā. This is due to the characteristics of the PCIE bus: only one outstanding operations can be transmitted at a time.
Itās not really clear what you are trying to accomplish. The usual reasons for use of the async API are for overlap:
kernel - kernel
memcpy - kernel
memcpy - memcpy (one is one direction, the other is in the other direction)
host - device
There are many nuances to get this correct. I would suggest that you start by reading the section on asynchronous concurrency in the programming guide.
Iām going to bump this because Iām doing something similar, except Iām trying to use cudaMemcpyAsync to pull back data from the GPU randomly. What I want to do is launch a kernel on the default null stream, and then create another stream to handle the async memory copies. Iām trying to get a kernel running thatās just constantly copying back data to the host until it completes.
For example:
cudaMallocHost(img); //fixed typo and moved before kernel launch
kernel<<<blocks, threads>>>(d_img);
while(...) {
cudaMemcpyAsync(img, d_img, size, cudaMemcpyDeviceToHost, stream1);
cudaStreamSynchronize(stream1);
cpuFunction(img);
}
cudaMemcpy(finalimg, d_img, size, cudaMemcpyDeviceToHost);
cudaFreeHost(img);
But when I do this, I get back an array of zeros in āimgā, which it cannot be because the final processed image with real and complex values always comes out fine at the end when I do a regular cudaMemcpy.
Does this just not work in the way I am intending? Iām not even sure if the portion between the kernel and cudaMemcpy is happening asynchronously to the kernel computing, because if I add a āSleep(5000)ā right under kernel, it adds almost exactly that much time to the program run time. That doesnāt make sense to me because I thought kernel launches were asynchronous with respect to the host, and because the kernel takes a lot longer than 5 seconds to complete, I figured I wouldnāt see the increase of total run time at all.
Kernel launches are asynchronous with the host. Kernel launches are synchronous with respect to other activity in the same stream, i.e. all operations in the same stream execute in-order and consecutively. If you want to overlap of kernel execution and asynchronous copies, avoid the null-stream as it has special properties.
I am reasonably confident that all host calls to CUDA memory allocation and de-allocation API calls are completely synchronous with the GPU, i.e. they only take place when the GPU is idle (otherwise one might āpull out the rugā from underneath running kernels).
The logic of pulling data from the GPU at random times, while a kernel is running, escapes me. How do you make sure each data set pulled is consistent (e.g. from the same iteration of an algorithm running on the kernel)?
There is potentially a lot of ground to cover and a lot of things to unravel here. You might want to run things with the visual profiler to understand what is overlapping with what.
In general I donāt recommend using the default null stream at all when you are trying to do multi-stream concurrency. If you read the programming guide, it explains why. So my suggestion would be to create a non-default stream for the kernel.
There is no cudaHostMalloc. I assume you meant cudaHostAlloc. (there is also cudaMallocHost) A cudaHostAlloc issued after a kernel like this might not begin until the kernel has completed. The reason for this is that cudaHostAlloc can mess with the GPU memory map (depends on parameters of call you havenāt shown as well as whether UVA is active in your setup), and modification of the GPU memory map cannot occur while the kernel is running. Therefore I would expect this to be a blocking call, which might be inconsistent with your statements about adding the sleep() function. Apart from that, I would do the cudaHostAlloc prior to launching the kernel.
Given the above, your kernel should have completed by the point at which your very first cudaMemcpyAsync begins. This means that the first call should retrieve whatever is in d_img. Again, inconsistent with your report.
In a producer-consumer model like this, there is no guarantee that global memory holds what you think it does. This depends on the states of the caches and what your code looks like exactly. If you want global memory to definitely have data, there are various code mechanisms like volatile, __syncthreads(), __threadfence(), and atomics, which will affect visibility of data in global memory. Having said that, L2 cache should be a pretty good proxy, and your cudaMemcpyAsync function will hit in L2 before it goes to global.
So my suggestion would be a short, complete test case that demonstrates your behavior. And if this is on windows, WDDM can have all sorts of interfering effects on concurrency, so a description of your test setup would be in order also.
Itās doesnāt need to be consistent, but only to show ācompute progressā. No matter when or what I pull out, I see should some kind of image coming that slowly ācomes into focusā. The further the kernel gets, the more clear the image will be. I want to be able to do this without having to break at some point during the computation.
Iāll take that into account.
Sorry, I meant cudaMallocHost. Iāll fix that typo. And Iāve also moved it to before the kernel launch.
Visual Profiler just doesnāt want to work with this program, for reasons unknown, and itās definitely not fun going without it. Itās some larger extremely embedded C program that Iāve now added some GPU acceleration into. But when I run it through Profiler, I see the program outputting in the console, with an error that comes out saying no cuda application was profiled (Iāve added cudaProfilerStart() and Stop() in the relevant portions of the code).
Iām going to work on this a little bit more and see if I can get anywhere. Iāll probably just start a new CUDA project and try to make something similar where profiler should at least work.
āVisual Profiler just doesnāt want to work with this programā
your kernel/ code sounds persistent, or semi-persistent
i am wondering whether the profiler can indeed profile persistent code - it seems to hate kernels that hardly terminate, as it then struggles to āwrapā the kernel
The profiler cannot profile a kernel that never terminates. You may be able to work around this in some long-running scenarios with profiler start/stop controls, but a kernel that never terminates cannot be profiled.
The profiler also has a variety of limits in what it can successfully profile. Many of these limits arise from the underlying counters used to keep track of statistics. The underlying counters have finite widths, and when an overflow/rollover condition occurs, profiling will be impacted. The usual suggestion is to trim down your kernel in some way:
Iām just thinking it may not be possible to do what I want.
Hereās some code I quickly wrote up demonstrating what I am trying to accomplish: I want to launch a kernel on stream0, and then continuously do some cudaMemcpyAsync on stream1 to work on the data on the host side. The problem is that if I do not call cudaStreamSychronize() directly after the D2H async copy, I get an array of zeros on host. And if I do call cudaStreamSychronize(), it completely blocks the host, and in a really awkward way!
Case 1:
Without cudaStreamSychronize()
Result: D2H overlaps as it should, but I cannot work with the data as I get zeros - notice the sums in the lower left in the image below. Iām assuming this is because the D2H doesnāt finish before I try working on it. External Media
Case2:
With cudaStreamSychronize()
Result: Completely blocked, and somehow the D2H copies are happening before the kernel launch, even though the kernel launch comes first in the code? What is happening??? External Media
The code is below. This should just compile and run as is on Windows/VS2013/CUDA 7.5RC.
#include <stdio.h>
#include <cstdlib>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda_profiler_api.h>
#define NPS 512
#define NX 4096
#define NY 4096
#define BLOCKSIZEX 16
#define BLOCKSIZEY 16
__global__ void fKernel(float *arrout) {
__shared__ float var[BLOCKSIZEX][BLOCKSIZEY];
//Global coordinates and index
size_t i = (blockIdx.x * blockDim.x) + threadIdx.x;
size_t j = (blockIdx.y * blockDim.y) + threadIdx.y;
size_t index = j * NX + i;
if (index < NX * NY) {
//Local tile coordinates
size_t idx = threadIdx.x;
size_t idy = threadIdx.y;
//Copy to shared memory
var[idx][idy] = arrout[index];
__syncthreads();
//Work on the shared memory
for (int i = 0; i < NPS; i++)
{
var[idx][idy] *= 1.01;
}
//Copy back to the global memory
arrout[index] = var[idx][idy];
}
}
float sum(float *arr, int len)
{
float sum = 0;
for (int i = 0; i < len; i++) {
sum += abs(arr[i]);
}
return sum;
}
int main()
{
cudaProfilerStart();
dim3 numThreads(BLOCKSIZEX, BLOCKSIZEY);
dim3 numBlocks(NX / numThreads.x, NY / numThreads.y);
float *h_outVector = (float *)malloc(NX * NY * sizeof(float));
for (int i = 0; i < NX * NY; i++)
{
h_outVector[i] = (float)i;
}
float *d_outVector;
cudaMalloc(&d_outVector, NX * NY*sizeof(float));
cudaMemcpy(d_outVector, h_outVector, NX * NY *
sizeof(float), cudaMemcpyHostToDevice);
float *res00, *res01, *res02;
cudaMallocHost(&res00, NX * NY * sizeof(float));
cudaMallocHost(&res01, NX * NY * sizeof(float));
cudaMallocHost(&res02, NX * NY * sizeof(float));
cudaStream_t stream00;
cudaStreamCreate(&stream00);
fKernel << <numBlocks, numThreads, 0, stream00 >> >(d_outVector);
cudaStream_t stream01;
cudaStreamCreate(&stream01);
size_t asyncCounter = 0;
double sum00 = 5, sum01 = 3, sum02 = 0;
/*The reason for this loop is to break when two subsequent copies are identical
which should only be possible after kernel has completed. But I cannot test that
yet since my overlapped transfers always return array of zeros */
while (asyncCounter < 6) {
if (asyncCounter % 2 == 0) {
cudaMemcpyAsync(res00, d_outVector, NX * NY *
sizeof(float), cudaMemcpyDeviceToHost, stream01);
cudaStreamSynchronize(stream01); //Result is array of zeros without this
sum00 = sum(res00, NX * NY);
printf("sum00: %.2f\n", sum00);
}
else {
cudaMemcpyAsync(res01, d_outVector, NX * NY *
sizeof(float), cudaMemcpyDeviceToHost, stream01);
cudaStreamSynchronize(stream01); //Result is array of zeros without this
sum01 = sum(res01, NX * NY);
printf("sum01: %.2f\n", sum01);
}
asyncCounter += 1;
}
cudaMemcpy(res00, d_outVector, NX * NY *
sizeof(float), cudaMemcpyDeviceToHost);
sum02 = sum(res00, NX * NY);
printf("Final sum02: %.2f\n", sum02);
cudaProfilerStop();
cudaDeviceReset();
cudaFreeHost(res00); cudaFreeHost(res01); cudaFreeHost(res02);
cudaFree(d_outVector);
free(h_outVector);
return 0;
}
So itās perhaps closer to what you were expecting (I think.)
The problem is that independent streams are asynchronous to each other, so they can have any timing relationship at all, with respect to each other, including the behavior youāve demonstrated, where a CUDA kernel, issued before a cudaMemcpyAsync operation, executes after the cudaMemcpyAsync operation. Windows isnāt helping you here, but the problem is not windows, the problem is that you are assuming a particular kind of synchronization should occur, which your application does not enforce.
Try taking a look at this example instead (i.e. the code in the answer that provides a worked code), which does something similar:
The use of the āmailboxesā there forces the host-issued cudaMemcpyAsync operations to āwaitā until the data is ready to be consumed.
As an aside, I would create all my streams up front, in an application like this. Avoid creating streams, doing cudaMalloc, initialiizing libraries, doing cudaHostAlloc or other operations like these in time-critical loops or in areas where you are trying to achieve a particular concurrency.
And not calling cudaStreamSynchronize,after a cudaMemcpyAsync operation, when you intend to āconsumeā the data on the host, is completely broken. Youāre getting all zeroes because without that barrier, your sum function is using data that the cudaMemcpyAsync operation has not even written to yet, because that operation has not even run yet. The cudaStreamSynchronize forces the operation to complete before the sum routine tries to use the data.
Thatās the behavior I was looking to see (increasing random values every time after an async memory copy is made, because that is what the kernel is doing). Iāll take a look at the link early next week and see what I can do, as Iād like the code to work multi platform. I really appreciate the time and help, txbob!