Performance penalty for images with 'non-proper' start address ?

Hi, i have an image I on the device (properly allocated via ‘cudaMallocPitch’).
Now, I want to create a ‘sub-image’ S for a rectangular part of the orignal image I by ‘shallow copy’, so the buffer address in S simply refers to a specific offset in orignal image I.
In this way, i don’t have to copy anyhting, and any modifications to the pixels (e.g. call a NPP function working on S) in S actually change the pixel values in I.
On the CPU, this can be done nicely (e.g. fine for doing IPP functions only one a specific rectangular part of the image).

So, my questions:

  • Can this be done? I suppose it can be done in the same way as for CPU image buffers by doing some offset calculations. Note the pitch of S and I will be the same.
  • Does it incur any (significant) performance penalty, when a kernel processes a specific image buffer which does not start at a ‘properly aligned’ address (as it will be with the start address for the buffer in S) ? I suppose there will be some coalescing issues, and it might depend on the compute capability (e.g. 1.3 vs. 2.0).

Hi, i have an image I on the device (properly allocated via ‘cudaMallocPitch’).
Now, I want to create a ‘sub-image’ S for a rectangular part of the orignal image I by ‘shallow copy’, so the buffer address in S simply refers to a specific offset in orignal image I.
In this way, i don’t have to copy anyhting, and any modifications to the pixels (e.g. call a NPP function working on S) in S actually change the pixel values in I.
On the CPU, this can be done nicely (e.g. fine for doing IPP functions only one a specific rectangular part of the image).

So, my questions:

  • Can this be done? I suppose it can be done in the same way as for CPU image buffers by doing some offset calculations. Note the pitch of S and I will be the same.
  • Does it incur any (significant) performance penalty, when a kernel processes a specific image buffer which does not start at a ‘properly aligned’ address (as it will be with the start address for the buffer in S) ? I suppose there will be some coalescing issues, and it might depend on the compute capability (e.g. 1.3 vs. 2.0).

Yes that should potentially cause some coalescing issues on 1.x devices. Which would be seen as a pretty major performance penalty.

From my understanding doing an offset copy doesn’t affect the 2.x devices as much due to the caches if the contiguous data segments you are copying are large enough. Some testing i saw put the penalty at about 5% which isn’t to shabby. But maybe in your 2D dataset you will switch row often which means the caches will be able to do less to dampen the effects?

I hope I understood your problem correctly.

Yes that should potentially cause some coalescing issues on 1.x devices. Which would be seen as a pretty major performance penalty.

From my understanding doing an offset copy doesn’t affect the 2.x devices as much due to the caches if the contiguous data segments you are copying are large enough. Some testing i saw put the penalty at about 5% which isn’t to shabby. But maybe in your 2D dataset you will switch row often which means the caches will be able to do less to dampen the effects?

I hope I understood your problem correctly.

For clarification:
Actually, I don’t want to copy anything. My subimage S refers to a rectangular part of the image buffer of the image I. It’s really a ‘reference’, meaning that changing the pixel values in S (by a CUDA kernel) actually changes the pixel values in I.

For clarification:
Actually, I don’t want to copy anything. My subimage S refers to a rectangular part of the image buffer of the image I. It’s really a ‘reference’, meaning that changing the pixel values in S (by a CUDA kernel) actually changes the pixel values in I.

For clarification:
Actually, I don’t want to copy anything. My subimage S refers to a rectangular part of the image buffer of the image I. It’s really a ‘reference’, meaning that changing the pixel values in S (by a CUDA kernel) actually changes the pixel values in I.

For clarification:
Actually, I don’t want to copy anything. My subimage S refers to a rectangular part of the image buffer of the image I. It’s really a ‘reference’, meaning that changing the pixel values in S (by a CUDA kernel) actually changes the pixel values in I.

You want to read the image S which is a subset of larger image ‘I’, modify the data, and write back to same adress, yes?

I was referring to an example where they read from global memory with an offset (uncoalesced) and wrote to another address without offset (coalesced).

1.x - Unaligned memory access should give you worse peformance.
2.x - Should be better if reading long contiguous segments.

Cheers,

You want to read the image S which is a subset of larger image ‘I’, modify the data, and write back to same adress, yes?

I was referring to an example where they read from global memory with an offset (uncoalesced) and wrote to another address without offset (coalesced).

1.x - Unaligned memory access should give you worse peformance.
2.x - Should be better if reading long contiguous segments.

Cheers,

Actually, I would subdivide this further:

1.0,1.1 - Unaligned gives you 1/16th throughput

1.2,1.3 - Unaligned gives you 1/2 throughput

2.0,2.1 - Unaligned penalty is small thanks to on-chip cache

Actually, I would subdivide this further:

1.0,1.1 - Unaligned gives you 1/16th throughput

1.2,1.3 - Unaligned gives you 1/2 throughput

2.0,2.1 - Unaligned penalty is small thanks to on-chip cache

Yep, that’s it :)

I’m curious though, is it really true that the unaligned penalty is allways small on 2.x devices? If you’re for example accessing each row of a 128x128 submatrix (where each element is a 1 byte char) unaligned within a larger dataset wouldn’t the unaligned penalty become greater? Wouldn’t that require two 128 byte accessess ?

I guess if you set caching to only use L2 only it would be serviced with an extra 32 byte access only. So it would be 128+32 instead of 128+128 ?

  • PG 3.2, G.4.2 Global Memory

Thanks!

Yep, that’s it :)

I’m curious though, is it really true that the unaligned penalty is allways small on 2.x devices? If you’re for example accessing each row of a 128x128 submatrix (where each element is a 1 byte char) unaligned within a larger dataset wouldn’t the unaligned penalty become greater? Wouldn’t that require two 128 byte accessess ?

I guess if you set caching to only use L2 only it would be serviced with an extra 32 byte access only. So it would be 128+32 instead of 128+128 ?

  • PG 3.2, G.4.2 Global Memory

Thanks!

I’m sure there’s some overhead from two trips to the cache due to misalignment, but not as bad as two trips to the global memory. This sounds like a good test for a microbenchmark. :)

I’m sure there’s some overhead from two trips to the cache due to misalignment, but not as bad as two trips to the global memory. This sounds like a good test for a microbenchmark. :)