ignore this thread, `cudaMallocHost` appears to work as expected

I had assumed that the post here [url]cudaMallocHost and alignment question - CUDA Programming and Performance - NVIDIA Developer Forums was accurate because it made sense, however it appears it is not actually a guarantee. I had written some code to use SSE (128-bit) RGB → {RGBA,BGRA} copies and vice versa, and it turns out the byte alignment is not guaranteed when using memory allocated with cudaMallocHost.

I’ve tested this by replacing the memory I allocated with cudaMallocHost with memory requested using _mm_malloc and a 16 byte alignment parameter, and the bus errors go away. The buffer in question is best used with cudaMallocHost, though, since I want to take advantage of cudaMemcpyAsync (new data gets copied every frame).

For my purposes I just introduced an intermediate buffer

  1. Read data into mRGB
  2. Use my SSE copy from mRGB to mRGBA
  3. std::copy from mRGBA to the desired buffer being used with cudaMallocHost

This has some overhead, but the benefit outweighs the overhead (I think, not thoroughly tested).

It would be really swell if we could specify the alignment needed with cudaMallocHost :)

Thanks for any consideration!

If you want to file a feature request, use the bug reporting portal at developer.nvidia.com (you have to register as a registered developer, then log in), and note in your bug report that it is a request for enhancement (RFE).

note that cudaMallocHost can be replaced by cudaHostAlloc, although I’m not aware of any difference in this respect.

Frankly, I’m a little doubtful that cudaMallocHost returns a pointer that is not 16-byte aligned, as this is pretty much expected/mandatory for cuda accesses to global space, however you’ve provided no actual example supporting your claim. You might try specifying cudaHostAlloc with the cudaHostAllocMapped flag, although this should be default for UVM setups.

Also note that you should be able to achieve pinned memory by using a system malloc allocator, and then calling cudaHostRegister. It might be worth a try on your _mm_malloc allocated buffer. (Be sure to use proper CUDA error checking.)

My simple test sure looks like an aligned pointer to me:

$ cat t1265.cu
#include <stdio.h>

int main(){

  int4 *data;
  cudaHostAlloc(&data, 1024*sizeof(int4), cudaHostAllocDefault);
  printf("ptr: %p\n", data);
}
$ nvcc -o t1265 t1265.cu
$ cuda-memcheck ./t1265
========= CUDA-MEMCHECK
ptr: 0x2049e0000
========= ERROR SUMMARY: 0 errors
$

Any time you are having trouble with a CUDA code, I recommend doing proper CUDA error checking on all CUDA runtime API calls and all kernel calls.

Hi txbob,

Thanks for the response. I will need to investigate further, but I also did not realize there was an explicit cudaHostRegister function! I really didn’t want to figure out how to pin memory manually in a cross-platform way ;)

It may be an issue of how what is requested when with cudaMallocHost. One possible scenario would look like this

  1. 640 x 480 x sizeof(uint16_t) allocated (depth feed)
  2. 640 x 480 x sizeof(uint8_t) allocated (infrared feed)
  3. 640 x 480 x 4 x sizeof(uint8_t) allocated (video feed)

I’m a little rusty on when things would go across page boundaries, and what that would mean for pinned memory, but maybe that is what the problem is?

I will try and create a much more concise version of the issue and submit an official report, assuming the trickled down version reproduces the error…

Thank you for your response, sorry for posting this in the wrong place!

I think it’s unlikely that cudaMallocHost is returning a pointer that is not 16-byte aligned (assuming it is returning a runtime API status of cudaSuccess).

If you can demonstrate that, I would just call it a bug and file it as such. It should be simple to demonstrate - just print out the pointer value as I have done. No need to drag SSE into it.

I think it’s likely that the bus errors are arising out of some mechanism that is not related to whether cudaMallocHost is returning a 16-byte aligned pointer or not.

Hmmmmm. You are (as always :D) correct. I went through and checked all of them, and they are all 16 byte aligned. I am left perplexed, but since this is now well beyond the scope of these forums (and solely a clear bug in my own program), I shall follow it no further here.

Thank you for your responses!