CUDA 4.0: linux malloc for page-aligned memory and cudaHostRegister How to malloc page-aligned memor

Hi guys,

jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)

cudaHostRegister((void *)out_arg,nbytes,0);

I’m not quite sure why, and it might have some other reason, but I suspect that the memory pointed to is not page-aligned (4KB). The question is how I check whether it is aligned, and if not, how to best proceed in this case. One way I tried discovering this is by examining the pointer (0x7f7801ee0010 for one run, which is not a multiple of 4096), but I don’t know whether this is valid and the reason for my problem.

Furthermore I’d like to ask if there is an online CUDA library documentation like for 3.2: http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/index.html

I found the documentation for cudaHostRegister in the cuda_runtime_api.h, which is not convenient External Image

yep couldn’t find the online lib documentation too :(

Hi,

I have the same problem, I’ve also tried to test the example simpleZeroCopy with the cudaHostRegister function (just change the pin_memory boolean to true) and I have the same error 11 (invalid argument). I don’t think it’s a alignement memory problem since malloc returns a aligned memory address.

It sounds like a bug ?

Whatever it is, I think we aren’t the only guys who have this error, at least I hope…

For information I’m trying and working on a IBM server with two Tesla 2070M with rhel 5.5

Thanks a lot for your answers

The 4.0rc SDK simpleStreams example uses cudaHostRegister to register an allocation obtained with mmap(). I haven’t gotten it to work either, but it must be possible, because that sdk code works…

Well, I got a little further. Doing something like this:

const size_t msize = 64 * 1024;

cudaComplex * vals, * _vals;

    vals = (cudaComplex *) mmap( NULL, msize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0 );

    assert( vals > 0);

    cudaAssert( cudaHostRegister(vals, msize, CU_MEMHOSTALLOC_DEVICEMAP) );

    cudaAssert( cudaHostGetDevicePointer((void **)&_vals, vals, 0) );

for(int counter=0; counter<32; counter++) {

        vals[counter].r = (float)drand48();

        vals[counter].i = (float)drand48();

    }

kernel <<< 1, 32 >>> (_vals);

    cudaAssert( cudaGetLastError() );

    cudaAssert( cudaThreadSynchronize() );

compiles and works as expected. However doing this:

cudaComplex * vals;

    cudaAssert( cudaMallocHost((void **)&vals, msize) );

    assert( vals > 0);

    cudaAssert( cudaHostRegister(vals, msize, 0) );

compiles fine, but generates a pretty severe kernel protection fault and a dead GPU:

[  194.924717] general protection fault: 0000 [#1] SMP 

[  194.924727] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

[  194.924733] Dumping ftrace buffer:

[  194.924738]    (ftrace buffer empty)

[  194.924741] CPU 3 

[  194.924745] Modules linked in: binfmt_misc bridge stp bnep video output input_polldev lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd ppdev soundcore serio_raw i2c_piix4 pcspkr joydev parport_pc parport snd_page_alloc nvidia(P) usbhid ohci1394 ieee1394 r8169 mii fbcon tileblit font bitblit softcursor

[  194.924805] Pid: 4004, comm: cudaComplex Tainted: P           2.6.28-19-generic #66-Ubuntu

[  194.924811] RIP: 0010:[<ffffffff80268a0c>]  [<ffffffff80268a0c>] __wake_up_bit+0xc/0x30

[  194.924827] RSP: 0018:ffff88021f1abc88  EFLAGS: 00010286

[  194.924832] RAX: ff2617d12a18a008 RBX: ffff880227165c00 RCX: 0000000000000040

[  194.924837] RDX: 0000000000000000 RSI: ffff880227165c00 RDI: ff2617d12a18a000

[  194.924841] RBP: ffff88021f1abc98 R08: e000000000000000 R09: 0000000000000000

[  194.924846] R10: 0000000000000000 R11: 0000000000000246 R12: ffff880227165c00

[  194.924851] R13: ffff88023ec2b080 R14: ffff88023ec2b000 R15: ffff880227165300

[  194.924856] FS:  00007f55801b5700(0000) GS:ffff88023f802f80(0000) knlGS:0000000000000000

[  194.924861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[  194.924866] CR2: 00007f557e2cd000 CR3: 000000021f05f000 CR4: 00000000000006a0

[  194.924871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[  194.924875] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[  194.924881] Process cudaComplex (pid: 4004, threadinfo ffff88021f1aa000, task ffff8802217d0000)

[  194.924885] Stack:

[  194.924888]  ffff88021f1abcb8 ffff880227165c00 ffff88021f1abcb8 ffffffff802b1052

[  194.924896]  ffff880227165c00 0000000000000001 ffff88021f1abcd8 ffffffff802b9260

[  194.924904]  ffff880227165300 0000000000000001 ffff88021f1abd28 ffffffffa05af156

[  194.924913] Call Trace:

[  194.924917]  [<ffffffff802b1052>] unlock_page+0x22/0x30

[  194.924926]  [<ffffffff802b9260>] set_page_dirty_lock+0x40/0x50

[  194.924935]  [<ffffffffa05af156>] nv_unlock_user_pages+0x76/0xe0 [nvidia]

[  194.925464]  [<ffffffffa058b6ed>] _nv022854rm+0x2c/0x3b [nvidia]

[  194.925971]  [<ffffffffa00b4f42>] ? _nv003722rm+0x1a4/0x2bf [nvidia]

[  194.926243]  [<ffffffffa00b2b9f>] ? _nv003699rm+0xc7/0xef [nvidia]

[  194.926243]  [<ffffffffa00b2c52>] ? _nv002369rm+0x52/0x74 [nvidia]

[  194.926243]  [<ffffffffa00abc7e>] ? _nv001995rm+0x26/0x4e [nvidia]

[  194.926243]  [<ffffffffa059977b>] ? _nv002408rm+0x29b/0x6a0 [nvidia]

[  194.926243]  [<ffffffffa05960a7>] ? rm_ioctl+0x2f/0x67 [nvidia]

[  194.926243]  [<ffffffffa05b24e3>] ? nv_kern_ioctl+0x1a3/0x480 [nvidia]

[  194.926243]  [<ffffffffa05b27fc>] ? nv_kern_unlocked_ioctl+0x1c/0x20 [nvidia]

[  194.926243]  [<ffffffff802f6871>] ? vfs_ioctl+0x31/0xa0

[  194.926243]  [<ffffffff802f6c25>] ? do_vfs_ioctl+0x75/0x230

[  194.926243]  [<ffffffff802f6e79>] ? sys_ioctl+0x99/0xa0

[  194.926243]  [<ffffffff8021252a>] ? system_call_fastpath+0x16/0x1b

[  194.926243] Code: c0 4c 01 c8 5b 48 d3 e8 48 8d 04 40 41 5c 48 c1 e0 03 48 03 82 00 d0 90 80 c9 c3 0f 1f 40 00 55 48 8d 47 08 48 89 e5 48 83 ec 10 <48> 39 47 08 89 55 f8 74 17 48 8d 4d f0 48 89 75 f0 ba 01 00 00 

[  194.926243] RIP  [<ffffffff80268a0c>] __wake_up_bit+0xc/0x30

[  194.926243]  RSP <ffff88021f1abc88>

[  194.929592] ---[ end trace 8174815e179b6ee7 ]---

I have no experience with CUDA 4.0 or cudaHostRegister(). But why would you call cudaHostRegister() on memory obtained by cudaMallocHost()? Isn’t that page-locked already?

It is, but I was working on the misapprehension that all allocations have to be explicitly registered before they are available in the unified address space. Re-reading the documentation, it says this:

EDIT: I still will open a bug report, just because you might hope that doing something as innocent as that (even though not correct) might cause a runtime error to be returned, rather than the driver assploding and taking out the entire machine.

Hi!

I got exactly the same trouble trying cudaHostRegister. I solved it allocating my buffers with “valloc” instead of “malloc”. “valloc” make sure the pointer is aligned with the page size, which is necessary to register the buffer.

See “http://www.delorie.com/gnu/docs/glibc/libc_31.html

Concerning documentation, you can use the “CUDA_Toolkit_reference_Manual.pdf” in cuda/doc.

Regards,

Sam

Try posix_memalign():

size_t bytesRequired = 6410241024;

unsigned char* ptr;

if( posix_memalign( &ptr, 4096, bytesRequired) != 0 )

printf( “posix_memalign(4096, %d) failed\n”, bytesRequired );

Thanks for the tip.

Tested it and it only works when the host memory is aligned to 4k and the size is multiple of 4k