Jamming lots of little things into a big thing, quickly. We have lots of images. We need them in a s

Hi, we have a bandwidth/latency problem.

We have a stack of images, typically ~500, with sizes of about 512*512 2bytes per pixel (512kb a piece). These are scattered about in the host memory. We want to upload them into a single 3d array on the card, and we want to do it fast. Worst case, we’ll want to upload them again for each rendering we do. The overhead of memcpying each image in the stack to the GPU is about 230microseconds, landing up a massive overhead of 125 milliseconds. There doesn’t seem to be any kind of queueing or overlap between these calls either.

I’ve tried memcpying on the host into a contiguous block, and this results in a refreshingly fast upload (a pinned contiguous block is even better but less practical). However, the host-host memcpy is really slow and rules this out. (Not to mention the fact doubling these image stacks is rather unfriendly.)

So, I’m looking for any possible solutions to the problem! Any suggestions are greatly appreciated. (or comments from Nvidia regarding why multiple transfers in this pattern won’t overlap nicely?)

Hi, we have a bandwidth/latency problem.

We have a stack of images, typically ~500, with sizes of about 512*512 2bytes per pixel (512kb a piece). These are scattered about in the host memory. We want to upload them into a single 3d array on the card, and we want to do it fast. Worst case, we’ll want to upload them again for each rendering we do. The overhead of memcpying each image in the stack to the GPU is about 230microseconds, landing up a massive overhead of 125 milliseconds. There doesn’t seem to be any kind of queueing or overlap between these calls either.

I’ve tried memcpying on the host into a contiguous block, and this results in a refreshingly fast upload (a pinned contiguous block is even better but less practical). However, the host-host memcpy is really slow and rules this out. (Not to mention the fact doubling these image stacks is rather unfriendly.)

So, I’m looking for any possible solutions to the problem! Any suggestions are greatly appreciated. (or comments from Nvidia regarding why multiple transfers in this pattern won’t overlap nicely?)

Assuming you have access to the code that allocates memory for the images on the host, modify it… allocate a huge array for the images you will need beforehand, and initialize all your images into this array… will eliminate need for host to host memcpy…

-Debdatta Basu.

Assuming you have access to the code that allocates memory for the images on the host, modify it… allocate a huge array for the images you will need beforehand, and initialize all your images into this array… will eliminate need for host to host memcpy…

-Debdatta Basu.

Thanks for the reply Debdatta. The tricky thing is, while we do have access to the image stack code (in fact it’s our code), it’s common to many upstream users who are (very) resistant to changes in the memory layout… (some still use 32 bit and need to worry about memory fragmentation) From our point of view allocating big contiguous chunks from the word go is a great solution - the memcpy is kind of a cheap simulation for changing the image stack structure. While we will keep trying to persuade people in favour of an altered layout, an ideal solution would be to do something that won’t change the existing memory layout.

If you or anyone else have any alternative ideas, go ahead. (no matter how crazy they might be, if they’re faster than the current method, I’ll try them!)

Thanks for the reply Debdatta. The tricky thing is, while we do have access to the image stack code (in fact it’s our code), it’s common to many upstream users who are (very) resistant to changes in the memory layout… (some still use 32 bit and need to worry about memory fragmentation) From our point of view allocating big contiguous chunks from the word go is a great solution - the memcpy is kind of a cheap simulation for changing the image stack structure. While we will keep trying to persuade people in favour of an altered layout, an ideal solution would be to do something that won’t change the existing memory layout.

If you or anyone else have any alternative ideas, go ahead. (no matter how crazy they might be, if they’re faster than the current method, I’ll try them!)

How about making the memory layout change optional, depending on whether CUDA is being used or not. Certainly makes the image stack code more complicated, but at the same time more flexible for different usage scenarios.

How about making the memory layout change optional, depending on whether CUDA is being used or not. Certainly makes the image stack code more complicated, but at the same time more flexible for different usage scenarios.

@Bomadeno

Wont allocating huge contiguous chunks of memory for the image stack REDUCE fragmentation? (assuming of course that the images are never deallocated)…
And I dont understand how it can be a problem to upstream users if it improves performance… I mean, they probably access the image stack through your API, and the
contiguous memory thing can be incorporated without any changes to it…

-Debdatta Basu

@Bomadeno

Wont allocating huge contiguous chunks of memory for the image stack REDUCE fragmentation? (assuming of course that the images are never deallocated)…
And I dont understand how it can be a problem to upstream users if it improves performance… I mean, they probably access the image stack through your API, and the
contiguous memory thing can be incorporated without any changes to it…

-Debdatta Basu

so it sounds like you want to eliminate a host-to-host memory transfer. the only way to not do that is to have it already where you want it. i.e. either take it from where it is in memory already or have it already in a better memory layout.

i don’t understand how putting it in a better memory layout could generate so much friction. they should change.

so it sounds like you want to eliminate a host-to-host memory transfer. the only way to not do that is to have it already where you want it. i.e. either take it from where it is in memory already or have it already in a better memory layout.

i don’t understand how putting it in a better memory layout could generate so much friction. they should change.

Instead of cudamemcpy-ing lots of times, make a kernel that gathers the scattered images into a contiguous block on the GPU using zero-copy memory. You only have kernel-call overhead once, and afterwards you can just bind the texture to the gpu memory.

Instead of cudamemcpy-ing lots of times, make a kernel that gathers the scattered images into a contiguous block on the GPU using zero-copy memory. You only have kernel-call overhead once, and afterwards you can just bind the texture to the gpu memory.

Host-Host-Mempcy should be faster Host-Device-Memcpy. So you could do the upload in parallel to the in-host memcpy.

Copy images 0…9 to chunk A
upload A & Copy images 10…19 to chunk B
upload B & Copy images 20…29 to chunk A
[repeat]

(Use pinned memory for Chunk A+B for faster upload)

Edit: Multithreaded Host-Host memcpy seems to be faster than SingleThreaded:

Host-Host-Mempcy should be faster Host-Device-Memcpy. So you could do the upload in parallel to the in-host memcpy.

Copy images 0…9 to chunk A
upload A & Copy images 10…19 to chunk B
upload B & Copy images 20…29 to chunk A
[repeat]

(Use pinned memory for Chunk A+B for faster upload)

Edit: Multithreaded Host-Host memcpy seems to be faster than SingleThreaded:

Wow, thanks for all the replies people.

@cbuchner1, that’s got the same problem as changing the image stack. While we own the code, our upstream users use it, and often they just make the image stack point to some previously allocated memory. (our API, their rules) Again, we can and will try pressuring them to change, if there is no alternative ‘fast way’ with the current memory layout.

@Debdatta So you’d think. But we’re the bottom of the pile, while we are essential to everything, we get our memory allocation last, after the upstream user have (literally) filled the 32 bit memory space. We have to rely on slipping the image stack into that. If the user has 200mb free, but it’s in 10 chunks of 20mb, we still need to be able to load a 190mb image stack.

@E.D. Riedijk Wouldn’t that require a host-host memcpy to pinned memory first?

@Nighthawk13 Sounds like a plan… I’m going to give that a try and I’ll let you know how it goes. Thanks!

Wow, thanks for all the replies people.

@cbuchner1, that’s got the same problem as changing the image stack. While we own the code, our upstream users use it, and often they just make the image stack point to some previously allocated memory. (our API, their rules) Again, we can and will try pressuring them to change, if there is no alternative ‘fast way’ with the current memory layout.

@Debdatta So you’d think. But we’re the bottom of the pile, while we are essential to everything, we get our memory allocation last, after the upstream user have (literally) filled the 32 bit memory space. We have to rely on slipping the image stack into that. If the user has 200mb free, but it’s in 10 chunks of 20mb, we still need to be able to load a 190mb image stack.

@E.D. Riedijk Wouldn’t that require a host-host memcpy to pinned memory first?

@Nighthawk13 Sounds like a plan… I’m going to give that a try and I’ll let you know how it goes. Thanks!

Yes, it would require that, but given the fact you own the API, you can allocate the memory as pinned in the first place as I understand it? Or they do the memory allocation themselves? that would completely invalidate the option ;)

Apparently on linux it should not be required to allocate pinned memory to be able to do zero-copy, but I doubt NVIDIA has put a lot of effort into making it happen, and it would only be an option on linux…

Yes, it would require that, but given the fact you own the API, you can allocate the memory as pinned in the first place as I understand it? Or they do the memory allocation themselves? that would completely invalidate the option ;)

Apparently on linux it should not be required to allocate pinned memory to be able to do zero-copy, but I doubt NVIDIA has put a lot of effort into making it happen, and it would only be an option on linux…