Jamming lots of little things into a big thing, quickly. We have lots of images. We need them in a s

Bomadeno · November 11, 2010, 5:32pm

Hi, we have a bandwidth/latency problem.

We have a stack of images, typically ~500, with sizes of about 512*512 2bytes per pixel (512kb a piece). These are scattered about in the host memory. We want to upload them into a single 3d array on the card, and we want to do it fast. Worst case, we’ll want to upload them again for each rendering we do. The overhead of memcpying each image in the stack to the GPU is about 230microseconds, landing up a massive overhead of 125 milliseconds. There doesn’t seem to be any kind of queueing or overlap between these calls either.

I’ve tried memcpying on the host into a contiguous block, and this results in a refreshingly fast upload (a pinned contiguous block is even better but less practical). However, the host-host memcpy is really slow and rules this out. (Not to mention the fact doubling these image stacks is rather unfriendly.)

So, I’m looking for any possible solutions to the problem! Any suggestions are greatly appreciated. (or comments from Nvidia regarding why multiple transfers in this pattern won’t overlap nicely?)

Bomadeno · November 11, 2010, 5:32pm

Hi, we have a bandwidth/latency problem.

We have a stack of images, typically ~500, with sizes of about 512*512 2bytes per pixel (512kb a piece). These are scattered about in the host memory. We want to upload them into a single 3d array on the card, and we want to do it fast. Worst case, we’ll want to upload them again for each rendering we do. The overhead of memcpying each image in the stack to the GPU is about 230microseconds, landing up a massive overhead of 125 milliseconds. There doesn’t seem to be any kind of queueing or overlap between these calls either.

I’ve tried memcpying on the host into a contiguous block, and this results in a refreshingly fast upload (a pinned contiguous block is even better but less practical). However, the host-host memcpy is really slow and rules this out. (Not to mention the fact doubling these image stacks is rather unfriendly.)

So, I’m looking for any possible solutions to the problem! Any suggestions are greatly appreciated. (or comments from Nvidia regarding why multiple transfers in this pattern won’t overlap nicely?)

Debdatta · November 11, 2010, 7:07pm

Assuming you have access to the code that allocates memory for the images on the host, modify it… allocate a huge array for the images you will need beforehand, and initialize all your images into this array… will eliminate need for host to host memcpy…

-Debdatta Basu.

Debdatta · November 11, 2010, 7:07pm

Assuming you have access to the code that allocates memory for the images on the host, modify it… allocate a huge array for the images you will need beforehand, and initialize all your images into this array… will eliminate need for host to host memcpy…

-Debdatta Basu.

Bomadeno · November 12, 2010, 10:42am

Thanks for the reply Debdatta. The tricky thing is, while we do have access to the image stack code (in fact it’s our code), it’s common to many upstream users who are (very) resistant to changes in the memory layout… (some still use 32 bit and need to worry about memory fragmentation) From our point of view allocating big contiguous chunks from the word go is a great solution - the memcpy is kind of a cheap simulation for changing the image stack structure. While we will keep trying to persuade people in favour of an altered layout, an ideal solution would be to do something that won’t change the existing memory layout.

If you or anyone else have any alternative ideas, go ahead. (no matter how crazy they might be, if they’re faster than the current method, I’ll try them!)

Bomadeno · November 12, 2010, 10:42am

Thanks for the reply Debdatta. The tricky thing is, while we do have access to the image stack code (in fact it’s our code), it’s common to many upstream users who are (very) resistant to changes in the memory layout… (some still use 32 bit and need to worry about memory fragmentation) From our point of view allocating big contiguous chunks from the word go is a great solution - the memcpy is kind of a cheap simulation for changing the image stack structure. While we will keep trying to persuade people in favour of an altered layout, an ideal solution would be to do something that won’t change the existing memory layout.

If you or anyone else have any alternative ideas, go ahead. (no matter how crazy they might be, if they’re faster than the current method, I’ll try them!)

cbuchner1 · November 12, 2010, 12:01pm

How about making the memory layout change optional, depending on whether CUDA is being used or not. Certainly makes the image stack code more complicated, but at the same time more flexible for different usage scenarios.

cbuchner1 · November 12, 2010, 12:01pm

How about making the memory layout change optional, depending on whether CUDA is being used or not. Certainly makes the image stack code more complicated, but at the same time more flexible for different usage scenarios.

Debdatta · November 13, 2010, 5:51pm

@Bomadeno…

Wont allocating huge contiguous chunks of memory for the image stack REDUCE fragmentation? (assuming of course that the images are never deallocated)…
And I dont understand how it can be a problem to upstream users if it improves performance… I mean, they probably access the image stack through your API, and the
contiguous memory thing can be incorporated without any changes to it…

-Debdatta Basu

Debdatta · November 13, 2010, 5:51pm

@Bomadeno…

Wont allocating huge contiguous chunks of memory for the image stack REDUCE fragmentation? (assuming of course that the images are never deallocated)…
And I dont understand how it can be a problem to upstream users if it improves performance… I mean, they probably access the image stack through your API, and the
contiguous memory thing can be incorporated without any changes to it…

-Debdatta Basu

happyjack272 · November 14, 2010, 1:13am

so it sounds like you want to eliminate a host-to-host memory transfer. the only way to not do that is to have it already where you want it. i.e. either take it from where it is in memory already or have it already in a better memory layout.

i don’t understand how putting it in a better memory layout could generate so much friction. they should change.

happyjack272 · November 14, 2010, 1:13am

so it sounds like you want to eliminate a host-to-host memory transfer. the only way to not do that is to have it already where you want it. i.e. either take it from where it is in memory already or have it already in a better memory layout.

i don’t understand how putting it in a better memory layout could generate so much friction. they should change.

E.D_Riedijk · November 14, 2010, 6:15am

Instead of cudamemcpy-ing lots of times, make a kernel that gathers the scattered images into a contiguous block on the GPU using zero-copy memory. You only have kernel-call overhead once, and afterwards you can just bind the texture to the gpu memory.

E.D_Riedijk · November 14, 2010, 6:15am

Instead of cudamemcpy-ing lots of times, make a kernel that gathers the scattered images into a contiguous block on the GPU using zero-copy memory. You only have kernel-call overhead once, and afterwards you can just bind the texture to the gpu memory.

Nighthawk13 · November 14, 2010, 9:15pm

Host-Host-Mempcy should be faster Host-Device-Memcpy. So you could do the upload in parallel to the in-host memcpy.

Copy images 0…9 to chunk A
upload A & Copy images 10…19 to chunk B
upload B & Copy images 20…29 to chunk A
[repeat]

(Use pinned memory for Chunk A+B for faster upload)

Edit: Multithreaded Host-Host memcpy seems to be faster than SingleThreaded:
http://stackoverflow.com/questions/4036806/what-is-my-compiler-doing-optimizing-memcpy

Nighthawk13 · November 14, 2010, 9:15pm

Host-Host-Mempcy should be faster Host-Device-Memcpy. So you could do the upload in parallel to the in-host memcpy.

Copy images 0…9 to chunk A
upload A & Copy images 10…19 to chunk B
upload B & Copy images 20…29 to chunk A
[repeat]

(Use pinned memory for Chunk A+B for faster upload)

Edit: Multithreaded Host-Host memcpy seems to be faster than SingleThreaded:
http://stackoverflow.com/questions/4036806/what-is-my-compiler-doing-optimizing-memcpy

Bomadeno · November 15, 2010, 10:00am

Wow, thanks for all the replies people.

@cbuchner1, that’s got the same problem as changing the image stack. While we own the code, our upstream users use it, and often they just make the image stack point to some previously allocated memory. (our API, their rules) Again, we can and will try pressuring them to change, if there is no alternative ‘fast way’ with the current memory layout.

@Debdatta So you’d think. But we’re the bottom of the pile, while we are essential to everything, we get our memory allocation last, after the upstream user have (literally) filled the 32 bit memory space. We have to rely on slipping the image stack into that. If the user has 200mb free, but it’s in 10 chunks of 20mb, we still need to be able to load a 190mb image stack.

@E.D. Riedijk Wouldn’t that require a host-host memcpy to pinned memory first?

@Nighthawk13 Sounds like a plan… I’m going to give that a try and I’ll let you know how it goes. Thanks!

Bomadeno · November 15, 2010, 10:00am

Wow, thanks for all the replies people.

@cbuchner1, that’s got the same problem as changing the image stack. While we own the code, our upstream users use it, and often they just make the image stack point to some previously allocated memory. (our API, their rules) Again, we can and will try pressuring them to change, if there is no alternative ‘fast way’ with the current memory layout.

@Debdatta So you’d think. But we’re the bottom of the pile, while we are essential to everything, we get our memory allocation last, after the upstream user have (literally) filled the 32 bit memory space. We have to rely on slipping the image stack into that. If the user has 200mb free, but it’s in 10 chunks of 20mb, we still need to be able to load a 190mb image stack.

@E.D. Riedijk Wouldn’t that require a host-host memcpy to pinned memory first?

@Nighthawk13 Sounds like a plan… I’m going to give that a try and I’ll let you know how it goes. Thanks!

E.D_Riedijk · November 15, 2010, 11:08am

Yes, it would require that, but given the fact you own the API, you can allocate the memory as pinned in the first place as I understand it? Or they do the memory allocation themselves? that would completely invalidate the option External Image

Apparently on linux it should not be required to allocate pinned memory to be able to do zero-copy, but I doubt NVIDIA has put a lot of effort into making it happen, and it would only be an option on linux…

E.D_Riedijk · November 15, 2010, 11:08am

Yes, it would require that, but given the fact you own the API, you can allocate the memory as pinned in the first place as I understand it? Or they do the memory allocation themselves? that would completely invalidate the option External Image

Apparently on linux it should not be required to allocate pinned memory to be able to do zero-copy, but I doubt NVIDIA has put a lot of effort into making it happen, and it would only be an option on linux…

Topic		Replies	Views
Poor Memcpy Performance Copying To Pinned Memory On Host CUDA Programming and Performance	16	8000	April 2, 2014
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6063	November 16, 2010
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6282	January 22, 2025
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2825	June 24, 2016
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	806	November 3, 2020
Slow memory transfers CUDA Programming and Performance	7	1999	May 23, 2011
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1216	July 10, 2020
cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst CUDA Programming and Performance	23	5658	May 8, 2012
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
CUDA header input file Declaring device variables in a separate .h or .cuh file CUDA Programming and Performance	7	8507	June 11, 2011

Jamming lots of little things into a big thing, quickly. We have lots of images. We need them in a s

Related topics