From the 3 kernels, the bottleneck clearly is H2D copies.
(we saw so far with Nsight Systems)
What would be interesting is the memory section and the Warp state section.
You can open the more detailed views by pressing the small triangle at the section titles.
The warp state statistics will distinguish, whether the threads were all blocked waiting for data.
The memory statistics will show, how much memory was transferred.
The overview says L2 Cache Throughput of 24.2% reached and sees this as low (lower than 60%). But 24.2% of L2 Cache throughput is actually a lot for PCIe.
So that throughput is much higher than expected, whereas the kernel duration is much longer than expected. Perhaps the detailed numbers give a hint, what the reason is.
You do not have to understand all numbers at once (neither do I understand all numbers). When you move your mouse over a parameter, it shows an explanation. The diagrams in the opened/detailed sections are easier to grasp.
Okay. Well. I watched two videos on how to work with Nsight Compute. There seems to be a lot to gain from using this profiler but also a lot you have to understand first.
These are the main suggestions Nsight Compute mentions on the summary page when launched to perform a full profiling. I suppose they already offer valuable information, although I don’t fully understand what they’re saying.
Here’s the Warp State section you asked for. I suppose the highlighted suggestion is a valuable hint as well. As far as I understand it, I’m not supposed to use thread-local memory. Which brings me to the next screenshot…
Nsight Compute specifically mentions lines 97 and 98 in my source code. While I don’t understand what the problem with line 97 is, my guess about line 98 was that the profiler doesn’t want me to twice refer to the same array element here. Maybe I should store the address in a local variable and access this instead? The suggestion in the screenshot mentioned before, though, makes me think it doesn’t like that I’m creating a new PointXY object within the thread. However, the point of this was to make sure the contents of the host-object get copied to device memory. I will try, though, what happens when I use ptRequestDevice[i] = ptRequestHost[i]; instead.
Another thing that came to my mind: I do have two M.2 SSDs installed in my system. As far as I know, this may affect the PCIe bandwidth available for other devices such as GPUs. I suppose this alone shouldn’t make the copy kernel take that long, though.
Hi hstlr,
thank you for sharing the images.
That is the copy kernel for H2D transfers?
I do not think it is using local memory. That was just one possible explanation/hints for some values. Local and global memory uses very similar hardware features.
First a medium issue
Why are 40 MB read and 78 MB written?
The 40 MB are from reading PCIe and the 78 MB from writing to global memory?
What size is the data format of the elements of PointXY?
In the Memory Workload Analysis it is mentioned that only 16 bytes of the 32 bytes are used per sector.
I would assume that each warp reads and writes data aligned to 32 bytes? Even when considering the offset?
In general
We cannot get full PCIe bandwidth: The pipes are full, the threads are blocking, and nearly nothing gets through.
I think you already used all threads and SMs?
We could switch to wider (vector) transfers (128 bits per access per thread instead of 32 bits), to reduce the number of transactions in the pipes. As recommended on the LG Throttle Stalls hint, which you marked in red.
We could use asynchronous transfer commands.
We could even give up zero copy for H2D transfers.
If you want to continue that avenue, I would try to use asynchronous transfers with memcpy_async.
Either
or
The data goes into shared memory. That’s why the data is copied as batch of several transfers. compute would just store the result in global memory.
There probably is a complete example in the Cuda samples?
…I mean, I can see why the profiler would think that I don’t have to copy the memory here as I’m already using mapped pinned memory. But then, this line is the whole point of having a copy kernel…
I have no idea…
sizeof(PointXY) says it’s 8 bytes which makes sense as each PointXY object contains two coordinates of type float.
I experimented with several grid configurations. I think in profiling run shown the screenshots I was using the following:
Grid configuration
int blockSize = 1024;
int numBlocks = 60;
const int blockSizeCpy = 512;
const int numBlocksCpy = 60;
int iChunkSize = 5000000;
…while using 3 Streams. While I suppose with 3 Streams it would have been more sensible to set blockSizeCpy = 256 in order to make H2D and D2H copying and ptInPoly execution possible all at the same time, I don’t think this should be the issue here, right?
As mentioned earlier, I’m exploring possibilities to employ the GPU for excessive geometrical calculations on behalf of my company. So, part of what I want to achieve is well-documented proof of concept code, ideally following the most important best practices and - on the other hand - offering reasonably good performance. That’s why I tried to dig into overlapping transfers in the first place. Obviously, this doesn’t work in all possible environments, which is a bummer. This brought me to implementing zero-copy kernels in order to enforce overlapping “manually” which seemed to be a promising way. While digging in further is really interesting and I see there’s a whole world to explore there, you kind of lost me in-depth analyzing how the kernels work.
Right now, I feel like I’m opening parenthesis after parenthesis and can’t seem to find the closing ones. I don’t really know which would the best way to go on from here.
Probably for some reason, Cuda reads first the x coordinates than the y coordinates instead of combining those into a vector command.
When looking at memory read by the whole warp, each of the two memory accesses has holes (for the other coordinate), that is why 50% of bandwidth is lost.
Quick-fixes for this: reinterpret_cast as float2 or handling the array as float array with twice as many elements.
A reason could also be that Cuda cannot be sure that PointXY is aligned at 8 byte boundaries. Could you add a manual alignment? (alignas(8) __host__ __device__ class PointXY {).
If you do those profiling runs, I would just use one single stream to focus on that specific kernel.
I (Munich based) could offer Cuda consulting/implementation work for your company, if needed.
I think zero-copy is working for D2H-Transfers fast enough.
So one option would be to use
cudamemcpyasync for H2D (perhaps even with write-combined, if it is better working with cudamemcpyasync), even if they do not overlap, and
zero-copy for D2H, perhaps even integrated into the kernel (instead of a separate copy kernel).
When writing, the threads are not blocked, as they do not have to wait for a result. So it would be simpler to integrate with computations.
Another option is to try to get memcpy_async with zero-copy kernels to run. memcpy_async copies larger blocks of memory and could be faster and better for H2D.
…and now use getCoordinates() in the copy kernel in order to make CUDA load all the needed information at once.
copy kernel
__global__ void cpyToDeviceKernel(PointXY* __restrict__ ptRequestHost, PointXY* __restrict__ ptRequestDevice, long iChunkSize, int offset) {
int startIndex = offset + blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (long i = startIndex; i < offset + iChunkSize; i += stride) {
ptRequestDevice[i] = PointXY(ptRequestHost[i].getCoordinates());
}
}
Now, I would like to check if that makes any difference. (Although, my first test using Nsight System suggests it actually does!)
I will try this in the next step, thank you!
Understood!
Thank you for your offer. If we do actually implement serious, production-aimed software using CUDA in the future, this may be something we will consider. Right now, I suppose, that would be a bit over-the-top.
So, you don’t think this is merely due to the fact that on D2H, only booleans - meaning: wayyy less data - get transferred?
Interesting suggestions. Thank you! I will look at those once I evaluated whether your quick-fixes had any impact.
…using float2 member and reading it using getCoordinates() method.
So, apparently, when I use a float2 property, CUDA reads both bits of it immediately and caches the not-yet needed component somewhere. That would explain why there’s only a slight difference between using getX() and getY() and getCoordinates() respectively. What do you think?
[EDIT 2:]
Plus, now I get partly new hints on optimizations in Nsight Compute!
I suppose you did hit something important there :-)
[EDIT 3:]
Btw, I marked @njuffa’s reply as solution because it really answers the original question of the thread. I hope it’s still okay to keep this discussion going. If not, I can open a separate thread.
In the source code image you showed, where it refers to the following lines:
You can see two ldg operations on the righthand side, one with an offset of 4.
Also we got those reads, which were double as much memory (78 MB instead of 40 MB).
And the 50% coalescing errors.
By using a member variable of type float2, you implicitly changed the alignment to 8 bytes. So it could be better, because of the alignment alone or because of getCoordinates().
But the improvement seems to be much more than 2x?
Are the new speed numbers from Nsight Systems with a single stream?
Better, but still much longer than expected from the amount of transferred memory and the speed of PCIe.
No problem, will be glad to help if needed, just give me a message.
Surely that is one reason, but perhaps also, because when writing, the Cuda threads do not expect a return value, and are more efficient; being able to continue the program at once instead of blocking like for reads.
The hint is that you are using 60 * 256 threads. That is half the maximum.
Which could be fine to run in parallel to compute, but not fine for getting the highest speed possible for the kernel.
Could you please compare with the duration of cudaMemcpyAsync with the current amount of data to be copied in both directions? (Gladly from a past measurement).
The expectation is that the speed in the end only depends on the PCIe bandwidth.
For efficient background loading, the data transfer should be at least as fast as compute (if possible), otherwise they would still be the bottleneck, when run in the background.
The memcpy_async could lead to a further speed-up, as the threads do not have to wait for each element. On the other hand, I am not sure, if memcpy_async was performance-optimized by Nvidia for this use case (zero copy memory).
Currently you get 40 MB / 19 ms = 2.1 GB/s?
PCIe should give around 10x as much speed. Depending on with how many lanes your GPU is connected.
Right! Reviewing that screenshot, I notice there is also an STG operation with 4 bytes of offset. I suppose this is when I call the PointXY constructor in order to create a deep copy of the object. In the new version, I’m also calling the constructor with a float2 object, so that’s something else that got improved.
Yes and yes.
Here's two comparing profilings.
Both runs used:
1 cudaStream for copying
20,000,000 PointXY objects to copy
a chunksize of also 20,000,000 objects
60 blocks with 1024 threads each for the copy kernel
the first run used two float properties and respective methods
the second run used one float2 property and respective methods
So, this refactoring alone made the copying almost 23 times faster, if my math is right here. In this example, about 152 MiB of data get copied within 74 ms. That’s still around 2 GiB/s of speed.
Here's a test using cudaMemcpyAsync
Here, also 20,000,000 elements got copied using 1 stream.
Thank you for the graphs and results.
The GPU seems to be connected either with PCIe3.0x16 or PCIe4.0x8, both with 15.75 GB/s theoretical peak rate. The speed is bidirectional: Both directions can sustain this bandwidth concurrently.
So you could continue in the following way, either
find existing examples of high bandwidth zero copy memory read sample programs or benchmarks and compare, or
give up zero memory copy for H2D, but keep it for D2H (either as separate copy kernel or integrated with computation for true zero copy), accepting non-perfect overlap between transfers and computation, or
implement memcpy_async in the H2D copy kernel, which could get higher bandwidth with less threads, but copies to shared memory first, and is a small bit more involved, but doable, or
depending on the duration of computation, accept the current copy speed (fast enough). And try to make the H2D copy kernel use less threads and work in parallel to computation
Maybe this now is due to me having two M.2 SSDs installed. Regretfully, I can’t find any further information on the concrete mainboard built into my laptop, so I can’t definitely confirm that. Anyway…
I will check those suggestions and see which works best for my project. Thank you.
I suppose with the last point you’re suggesting to get back to what we had in mind earlier, using this combined zero-copy approach in order to make copying and kernel execution work in parallel, right? That would require the copy kernels to work with similar speed when using less threads, right? Meaning: Copying would still have to take up approximately the same duration as the calculations.
Probably it is just the way the GPU is connected. There are some Mini PCs, which limit their PCIe to x8.
But it was important to know to compare the copy kernel speed to what is possible.
Yes, you got 74.3 ms for copying 160 MB H2D with the copy kernel on a single stream with maximum threads. If your computation for that amount of data takes a similar time or longer or only slightly shorter, the copy would not have to be faster. It can just happen in the background. If not then even non-overlapping copies are faster.
If you want to go that way, you have the two options to
make two or three streams and use the copy kernel and use less threads (so they can run concurrently), or
try to integrate the copy kernel into your computation kernel again, either
directly reading the data as needed (zero copy fashion) or
use dedicated warps to copy to shared memory (or to global memory) and read the data from there with the computation threads, or
have warps sometimes do computation, sometimes do data loading
But probably 74.3ms is too long and one of the other three approaches in my previous post would be better.
Okay. Thank you.
I will have another look at Nsight Compute at this point. He tells me something about Warp Stalls, resolving this supposedly gives more opportunity to optimize copying speed. Maybe I will be able to get more out of that.
I just tried using our custom copy kernels (former zero-copy) with the same kind of pinned memory as the original overlapping implementation instead of using mapped memory which we needed for using zero-copy. Meaning:
Code snippets for pinned memory with "zero-copy" approach
switch (testProperties.gpuMemoryStrategy) {
case OVERLAP_TRANSACTIONS:
checkCuda(cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY))); // Speicher für PointXY-Objekte
checkCuda(cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool))); // Speicher für Ergebnisrückgabe
break;
case ZERO_COPY:
if (devProps.canMapHostMemory) {
//checkCuda(cudaHostAlloc(&ptRequest, cPtRequest * sizeof(PointXY), cudaHostAllocMapped /* | cudaHostAllocWriteCombined*/)); // Speicher für PointXY-Objekte
//checkCuda(cudaHostAlloc(&bPtRequestIn, cPtRequest * sizeof(bool), cudaHostAllocMapped /* | cudaHostAllocWriteCombined */ )); // Speicher für Ergebnisrückgabe
checkCuda(cudaMallocHost((void**)&ptRequest, cPtRequest * sizeof(PointXY))); // Speicher für PointXY-Objekte
checkCuda(cudaMallocHost((void**)&bPtRequestIn, cPtRequest * sizeof(bool))); // Speicher für Ergebnisrückgabe
}
else {
return false;
}
break;
}
So, just using pinned memory makes this 2x as fast. Does this make any sense at all?
Seeing that I don’t actually use “zero-copied” data in this approach I wouldn’t really need mapped host memory, right?
Hm, okay. How is this still zero-copy? I thought zero-copy means that the GPU directly accesses the data stored in host RAM?
Nice. I’ll keep it like this then.
Yes, I know. I didn’t think this was relevant here as I just wanted to directly compare mapped host memory and pinned memory. For further tests concerning overall performance I will experiment with a few grid configurations for H2D, D2H and kernel execution.
The H2D copy kernel itself is seen by Cuda like a normal kernel. When it reads, it does so directly from host memory. (Storing the values afterwards is just a coincidence.)
It does not need an explicit copy like cudaMemcpy(Async) before the H2D copy kernel.
You can argue that it is not an actual zero copy using a copy kernel; than it is just the naming. But the memory requirements actually should be the same for a compute kernel using zero copy and the copy kernel using zero copy memory to read data from the host.
I see! This makes sense. Thank you for your explanation. In that case, I’m surprised that zero-copy seems to be possible without using mapped host memory and the cudaHostGetDevicePointer call. Anyway, I will dig into this a little bit more.
I’m currently going away from in-depth analyzing the copy kernel a bit. Instead, I am looking at efficiently parallelizing the copying and kernel executions.
I can’t seem to get CUDA to execute the first ptInPoly kernel in parallel with the second H2D device copying operation though.
const int iNumStreams = 2;
int blockSize = 1024;
int numBlocks = 20;
const int blockSizeCpy = 256;
const int numBlocksCpy = 20;
This is for testing purposes only. I am aware that this configuration is not ideal in respect to device occupancy.
In the screenshot, we can see that the later ptInPoly calls do in fact happen in parallel to H2D or D2H copying operations. However, the second H2D copying operation only starting after the ptInPoly execution has finished, to my mind suggests that the device is fully occupied with ptInPoly. This shouldn’t be the case, though. Or am I overlooking something here?
According to the device properties, my GPU should support the following:
Could you reduce blockSize to 896? (896 is also divisible by 128, but still larger than 1536/2) and reduce blockSizeCpy to 128?
As your numBlocks is 20, you could even go lower with blockSize
Just to make it easier for the kernels to be run in parallel.
Reasons that they are not parallel could be
rounding (OTOH 1024 and 256 are quite round numbers) that the blocks cannot be combined as arbitrary numbers, but only a multiple of some size.
Another reason could be small CUDA kernels running in the background by the operating system (but wouldn’t Nsight Systems also detect those?)
The aim should be to get any configuration, which runs the streams or the kernels in parallel. And afterwards increase the block sizes again.
Same result… Even with blockSize = 640, the second H2D copy operation only starts after the ptInPoly kernel has finished. Strangely, the overall runtime decreased by about 0.1 seconds for blockSize = 640; blockSizeCpy = 128 compared to blockSize = 896; blockSizeCpy = 128. Apparently, the ptInPoly get executed somewhat quicker. Anyway, maybe this is another issue.
Even the following configuration doesn’t give me the best possible parallelism:
int blockSize = 640;
int numBlocks = 8;
const int blockSizeCpy = 128;
const int numBlocksCpy = 8;
I mean…this way the GPU should be under-occupied all the time, right?