CUDA 4.0: linux malloc for page-aligned memory and cudaHostRegister How to malloc page-aligned memor

heipei · March 4, 2011, 11:49am

Hi guys,

jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)

cudaHostRegister((void *)out_arg,nbytes,0);

I’m not quite sure why, and it might have some other reason, but I suspect that the memory pointed to is not page-aligned (4KB). The question is how I check whether it is aligned, and if not, how to best proceed in this case. One way I tried discovering this is by examining the pointer (0x7f7801ee0010 for one run, which is not a multiple of 4096), but I don’t know whether this is valid and the reason for my problem.

Furthermore I’d like to ask if there is an online CUDA library documentation like for 3.2: http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/index.html

I found the documentation for cudaHostRegister in the cuda_runtime_api.h, which is not convenient External Image

MrNightLifeLover · March 4, 2011, 12:10pm

yep couldn’t find the online lib documentation too :(

Damien_Dubuc · March 5, 2011, 2:30pm

Hi guys,

jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)
cudaHostRegister((void *)out_arg,nbytes,0);

Hi,

I have the same problem, I’ve also tried to test the example simpleZeroCopy with the cudaHostRegister function (just change the pin_memory boolean to true) and I have the same error 11 (invalid argument). I don’t think it’s a alignement memory problem since malloc returns a aligned memory address.

It sounds like a bug ?

Whatever it is, I think we aren’t the only guys who have this error, at least I hope…

For information I’m trying and working on a IBM server with two Tesla 2070M with rhel 5.5

Thanks a lot for your answers

avidday · March 5, 2011, 3:49pm

The 4.0rc SDK simpleStreams example uses cudaHostRegister to register an allocation obtained with mmap(). I haven’t gotten it to work either, but it must be possible, because that sdk code works…

avidday · March 6, 2011, 10:35am

Well, I got a little further. Doing something like this:

const size_t msize = 64 * 1024;

cudaComplex * vals, * _vals;

    vals = (cudaComplex *) mmap( NULL, msize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0 );

    assert( vals > 0);

    cudaAssert( cudaHostRegister(vals, msize, CU_MEMHOSTALLOC_DEVICEMAP) );

    cudaAssert( cudaHostGetDevicePointer((void **)&_vals, vals, 0) );

for(int counter=0; counter<32; counter++) {

        vals[counter].r = (float)drand48();

        vals[counter].i = (float)drand48();

    }

kernel <<< 1, 32 >>> (_vals);

    cudaAssert( cudaGetLastError() );

    cudaAssert( cudaThreadSynchronize() );

compiles and works as expected. However doing this:

cudaComplex * vals;

    cudaAssert( cudaMallocHost((void **)&vals, msize) );

    assert( vals > 0);

    cudaAssert( cudaHostRegister(vals, msize, 0) );

compiles fine, but generates a pretty severe kernel protection fault and a dead GPU:

[  194.924717] general protection fault: 0000 [#1] SMP 

[  194.924727] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

[  194.924733] Dumping ftrace buffer:

[  194.924738]    (ftrace buffer empty)

[  194.924741] CPU 3 

[  194.924745] Modules linked in: binfmt_misc bridge stp bnep video output input_polldev lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd ppdev soundcore serio_raw i2c_piix4 pcspkr joydev parport_pc parport snd_page_alloc nvidia(P) usbhid ohci1394 ieee1394 r8169 mii fbcon tileblit font bitblit softcursor

[  194.924805] Pid: 4004, comm: cudaComplex Tainted: P           2.6.28-19-generic #66-Ubuntu

[  194.924811] RIP: 0010:[<ffffffff80268a0c>]  [<ffffffff80268a0c>] __wake_up_bit+0xc/0x30

[  194.924827] RSP: 0018:ffff88021f1abc88  EFLAGS: 00010286

[  194.924832] RAX: ff2617d12a18a008 RBX: ffff880227165c00 RCX: 0000000000000040

[  194.924837] RDX: 0000000000000000 RSI: ffff880227165c00 RDI: ff2617d12a18a000

[  194.924841] RBP: ffff88021f1abc98 R08: e000000000000000 R09: 0000000000000000

[  194.924846] R10: 0000000000000000 R11: 0000000000000246 R12: ffff880227165c00

[  194.924851] R13: ffff88023ec2b080 R14: ffff88023ec2b000 R15: ffff880227165300

[  194.924856] FS:  00007f55801b5700(0000) GS:ffff88023f802f80(0000) knlGS:0000000000000000

[  194.924861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[  194.924866] CR2: 00007f557e2cd000 CR3: 000000021f05f000 CR4: 00000000000006a0

[  194.924871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[  194.924875] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[  194.924881] Process cudaComplex (pid: 4004, threadinfo ffff88021f1aa000, task ffff8802217d0000)

[  194.924885] Stack:

[  194.924888]  ffff88021f1abcb8 ffff880227165c00 ffff88021f1abcb8 ffffffff802b1052

[  194.924896]  ffff880227165c00 0000000000000001 ffff88021f1abcd8 ffffffff802b9260

[  194.924904]  ffff880227165300 0000000000000001 ffff88021f1abd28 ffffffffa05af156

[  194.924913] Call Trace:

[  194.924917]  [<ffffffff802b1052>] unlock_page+0x22/0x30

[  194.924926]  [<ffffffff802b9260>] set_page_dirty_lock+0x40/0x50

[  194.924935]  [<ffffffffa05af156>] nv_unlock_user_pages+0x76/0xe0 [nvidia]

[  194.925464]  [<ffffffffa058b6ed>] _nv022854rm+0x2c/0x3b [nvidia]

[  194.925971]  [<ffffffffa00b4f42>] ? _nv003722rm+0x1a4/0x2bf [nvidia]

[  194.926243]  [<ffffffffa00b2b9f>] ? _nv003699rm+0xc7/0xef [nvidia]

[  194.926243]  [<ffffffffa00b2c52>] ? _nv002369rm+0x52/0x74 [nvidia]

[  194.926243]  [<ffffffffa00abc7e>] ? _nv001995rm+0x26/0x4e [nvidia]

[  194.926243]  [<ffffffffa059977b>] ? _nv002408rm+0x29b/0x6a0 [nvidia]

[  194.926243]  [<ffffffffa05960a7>] ? rm_ioctl+0x2f/0x67 [nvidia]

[  194.926243]  [<ffffffffa05b24e3>] ? nv_kern_ioctl+0x1a3/0x480 [nvidia]

[  194.926243]  [<ffffffffa05b27fc>] ? nv_kern_unlocked_ioctl+0x1c/0x20 [nvidia]

[  194.926243]  [<ffffffff802f6871>] ? vfs_ioctl+0x31/0xa0

[  194.926243]  [<ffffffff802f6c25>] ? do_vfs_ioctl+0x75/0x230

[  194.926243]  [<ffffffff802f6e79>] ? sys_ioctl+0x99/0xa0

[  194.926243]  [<ffffffff8021252a>] ? system_call_fastpath+0x16/0x1b

[  194.926243] Code: c0 4c 01 c8 5b 48 d3 e8 48 8d 04 40 41 5c 48 c1 e0 03 48 03 82 00 d0 90 80 c9 c3 0f 1f 40 00 55 48 8d 47 08 48 89 e5 48 83 ec 10 <48> 39 47 08 89 55 f8 74 17 48 8d 4d f0 48 89 75 f0 ba 01 00 00 

[  194.926243] RIP  [<ffffffff80268a0c>] __wake_up_bit+0xc/0x30

[  194.926243]  RSP <ffff88021f1abc88>

[  194.929592] ---[ end trace 8174815e179b6ee7 ]---

tera · March 6, 2011, 11:39am

However doing this:
cudaComplex * vals;

    cudaAssert( cudaMallocHost((void **)&vals, msize) );

    assert( vals > 0);

    cudaAssert( cudaHostRegister(vals, msize, 0) );
compiles fine, but generates a pretty severe kernel protection fault and a dead GPU:

I have no experience with CUDA 4.0 or cudaHostRegister(). But why would you call cudaHostRegister() on memory obtained by cudaMallocHost()? Isn’t that page-locked already?

avidday · March 6, 2011, 12:00pm

It is, but I was working on the misapprehension that all allocations have to be explicitly registered before they are available in the unified address space. Re-reading the documentation, it says this:

EDIT: I still will open a bug report, just because you might hope that doing something as innocent as that (even though not correct) might cause a runtime error to be returned, rather than the driver assploding and taking out the entire machine.

TRT_Sam · March 9, 2011, 2:44pm

Hi guys,

jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)
cudaHostRegister((void *)out_arg,nbytes,0);
I’m not quite sure why, and it might have some other reason, but I suspect that the memory pointed to is not page-aligned (4KB). The question is how I check whether it is aligned, and if not, how to best proceed in this case. One way I tried discovering this is by examining the pointer (0x7f7801ee0010 for one run, which is not a multiple of 4096), but I don’t know whether this is valid and the reason for my problem.

Furthermore I’d like to ask if there is an online CUDA library documentation like for 3.2: http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/index.html

I found the documentation for cudaHostRegister in the cuda_runtime_api.h, which is not convenient External Image

Hi!

I got exactly the same trouble trying cudaHostRegister. I solved it allocating my buffers with “valloc” instead of “malloc”. “valloc” make sure the pointer is aligned with the page size, which is necessary to register the buffer.

See “http://www.delorie.com/gnu/docs/glibc/libc_31.html”

Concerning documentation, you can use the “CUDA_Toolkit_reference_Manual.pdf” in cuda/doc.

Regards,

Sam

jamesmilne · March 9, 2011, 5:19pm

Try posix_memalign():

size_t bytesRequired = 6410241024;

unsigned char* ptr;

if( posix_memalign( &ptr, 4096, bytesRequired) != 0 )

printf( “posix_memalign(4096, %d) failed\n”, bytesRequired );

gshi · March 11, 2011, 8:53pm

Thanks for the tip.

Tested it and it only works when the host memory is aligned to 4k and the size is multiple of 4k

Topic		Replies	Views
Using cudaHostRegister() in CUDA 4.0 CUDA 4.0 CUDA Programming and Performance	16	30222	January 25, 2018
Pinned memory limit CUDA Programming and Performance	16	13502	May 1, 2016
Device to host data copy may not reflect on host side using graphs CUDA Programming and Performance	5	237	September 6, 2023
CUDA Zero Copy On TX1 Jetson TX1	20	6827	October 18, 2021
Upgrading to CUDA 12.4 broke down the application GPU-Accelerated Libraries cublas , cusparse	13	1139	July 21, 2024
Registering Mapped Linux Character Device Memory with cudaHostRegister Results in Invalid Argument CUDA Programming and Performance	2	2815	June 27, 2017
Using cudaHostRegister()? CUDA Programming and Performance	6	6463	August 2, 2016
Calling cudaHostUnregister() on the same 4KB page twice (CUDA 9.1) CUDA Programming and Performance	4	1252	March 22, 2019
cudaHostRegister on multiple threads CUDA Programming and Performance	15	277	June 18, 2024
cudaMalloc error in big loop CUDA Programming and Performance	12	15608	May 21, 2008

CUDA 4.0: linux malloc for page-aligned memory and cudaHostRegister How to malloc page-aligned memor

Related topics