ignore this thread, `cudaMallocHost` appears to work as expected

svennevs · December 27, 2017, 9:43pm

I had assumed that the post here [url]cudaMallocHost and alignment question - CUDA Programming and Performance - NVIDIA Developer Forums was accurate because it made sense, however it appears it is not actually a guarantee. I had written some code to use SSE (128-bit) RGB → {RGBA,BGRA} copies and vice versa, and it turns out the byte alignment is not guaranteed when using memory allocated with cudaMallocHost.

I’ve tested this by replacing the memory I allocated with cudaMallocHost with memory requested using _mm_malloc and a 16 byte alignment parameter, and the bus errors go away. The buffer in question is best used with cudaMallocHost, though, since I want to take advantage of cudaMemcpyAsync (new data gets copied every frame).

For my purposes I just introduced an intermediate buffer

Read data into mRGB
Use my SSE copy from mRGB to mRGBA
std::copy from mRGBA to the desired buffer being used with cudaMallocHost

This has some overhead, but the benefit outweighs the overhead (I think, not thoroughly tested).

It would be really swell if we could specify the alignment needed with cudaMallocHost :)

Thanks for any consideration!

Robert_Crovella · December 27, 2017, 10:06pm

If you want to file a feature request, use the bug reporting portal at developer.nvidia.com (you have to register as a registered developer, then log in), and note in your bug report that it is a request for enhancement (RFE).

note that cudaMallocHost can be replaced by cudaHostAlloc, although I’m not aware of any difference in this respect.

Frankly, I’m a little doubtful that cudaMallocHost returns a pointer that is not 16-byte aligned, as this is pretty much expected/mandatory for cuda accesses to global space, however you’ve provided no actual example supporting your claim. You might try specifying cudaHostAlloc with the cudaHostAllocMapped flag, although this should be default for UVM setups.

Also note that you should be able to achieve pinned memory by using a system malloc allocator, and then calling cudaHostRegister. It might be worth a try on your _mm_malloc allocated buffer. (Be sure to use proper CUDA error checking.)

My simple test sure looks like an aligned pointer to me:

$ cat t1265.cu
#include <stdio.h>

int main(){

  int4 *data;
  cudaHostAlloc(&data, 1024*sizeof(int4), cudaHostAllocDefault);
  printf("ptr: %p\n", data);
}
$ nvcc -o t1265 t1265.cu
$ cuda-memcheck ./t1265
========= CUDA-MEMCHECK
ptr: 0x2049e0000
========= ERROR SUMMARY: 0 errors
$

Any time you are having trouble with a CUDA code, I recommend doing proper CUDA error checking on all CUDA runtime API calls and all kernel calls.

svennevs · December 27, 2017, 10:30pm

Hi txbob,

Thanks for the response. I will need to investigate further, but I also did not realize there was an explicit cudaHostRegister function! I really didn’t want to figure out how to pin memory manually in a cross-platform way ;)

It may be an issue of how what is requested when with cudaMallocHost. One possible scenario would look like this

640 x 480 x sizeof(uint16_t) allocated (depth feed)
640 x 480 x sizeof(uint8_t) allocated (infrared feed)
640 x 480 x 4 x sizeof(uint8_t) allocated (video feed)

I’m a little rusty on when things would go across page boundaries, and what that would mean for pinned memory, but maybe that is what the problem is?

I will try and create a much more concise version of the issue and submit an official report, assuming the trickled down version reproduces the error…

Thank you for your response, sorry for posting this in the wrong place!

Robert_Crovella · December 27, 2017, 10:48pm

I think it’s unlikely that cudaMallocHost is returning a pointer that is not 16-byte aligned (assuming it is returning a runtime API status of cudaSuccess).

If you can demonstrate that, I would just call it a bug and file it as such. It should be simple to demonstrate - just print out the pointer value as I have done. No need to drag SSE into it.

I think it’s likely that the bus errors are arising out of some mechanism that is not related to whether cudaMallocHost is returning a 16-byte aligned pointer or not.

svennevs · December 28, 2017, 12:05am

Hmmmmm. You are (as always :D) correct. I went through and checked all of them, and they are all 16 byte aligned. I am left perplexed, but since this is now well beyond the scope of these forums (and solely a clear bug in my own program), I shall follow it no further here.

Thank you for your responses!

Topic		Replies	Views
cudaMallocHost and alignment question CUDA Programming and Performance	1	5017	January 20, 2009
cudaMallocHost confusion CUDA Programming and Performance	6	9814	June 24, 2011
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8848	November 17, 2021
CUDA 4.0: linux malloc for page-aligned memory and cudaHostRegister How to malloc page-aligned memor CUDA Programming and Performance	9	19337	March 11, 2011
Program stucks on cudaErrorMemoryAllocation after failing a cudaMallocHost CUDA Programming and Performance	12	116	August 23, 2024
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4509	February 8, 2010
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	605	March 7, 2019
How to determine the base adress alignment and pitch alignment used by 'cudaMallocPitch' ? CUDA Programming and Performance	4	2507	June 9, 2016
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37700	April 23, 2009
Using cudaHostRegister() in CUDA 4.0 CUDA 4.0 CUDA Programming and Performance	16	30223	January 25, 2018

ignore this thread, `cudaMallocHost` appears to work as expected

Related topics