heipei
March 4, 2011, 11:49am
1
Hi guys,
jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)
cudaHostRegister((void *)out_arg,nbytes,0);
I’m not quite sure why, and it might have some other reason, but I suspect that the memory pointed to is not page-aligned (4KB). The question is how I check whether it is aligned, and if not, how to best proceed in this case. One way I tried discovering this is by examining the pointer (0x7f7801ee0010 for one run, which is not a multiple of 4096), but I don’t know whether this is valid and the reason for my problem.
Furthermore I’d like to ask if there is an online CUDA library documentation like for 3.2: http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/index.html
I found the documentation for cudaHostRegister in the cuda_runtime_api.h, which is not convenient External Image
yep couldn’t find the online lib documentation too :(
Hi guys,
jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)
cudaHostRegister((void *)out_arg,nbytes,0);
Hi,
I have the same problem, I’ve also tried to test the example simpleZeroCopy with the cudaHostRegister function (just change the pin_memory boolean to true) and I have the same error 11 (invalid argument). I don’t think it’s a alignement memory problem since malloc returns a aligned memory address.
It sounds like a bug ?
Whatever it is, I think we aren’t the only guys who have this error, at least I hope…
For information I’m trying and working on a IBM server with two Tesla 2070M with rhel 5.5
Thanks a lot for your answers
The 4.0rc SDK simpleStreams example uses cudaHostRegister to register an allocation obtained with mmap(). I haven’t gotten it to work either, but it must be possible, because that sdk code works…
Well, I got a little further. Doing something like this:
const size_t msize = 64 * 1024;
cudaComplex * vals, * _vals;
vals = (cudaComplex *) mmap( NULL, msize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0 );
assert( vals > 0);
cudaAssert( cudaHostRegister(vals, msize, CU_MEMHOSTALLOC_DEVICEMAP) );
cudaAssert( cudaHostGetDevicePointer((void **)&_vals, vals, 0) );
for(int counter=0; counter<32; counter++) {
vals[counter].r = (float)drand48();
vals[counter].i = (float)drand48();
}
kernel <<< 1, 32 >>> (_vals);
cudaAssert( cudaGetLastError() );
cudaAssert( cudaThreadSynchronize() );
compiles and works as expected. However doing this:
cudaComplex * vals;
cudaAssert( cudaMallocHost((void **)&vals, msize) );
assert( vals > 0);
cudaAssert( cudaHostRegister(vals, msize, 0) );
compiles fine, but generates a pretty severe kernel protection fault and a dead GPU:
[ 194.924717] general protection fault: 0000 [#1] SMP
[ 194.924727] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
[ 194.924733] Dumping ftrace buffer:
[ 194.924738] (ftrace buffer empty)
[ 194.924741] CPU 3
[ 194.924745] Modules linked in: binfmt_misc bridge stp bnep video output input_polldev lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd ppdev soundcore serio_raw i2c_piix4 pcspkr joydev parport_pc parport snd_page_alloc nvidia(P) usbhid ohci1394 ieee1394 r8169 mii fbcon tileblit font bitblit softcursor
[ 194.924805] Pid: 4004, comm: cudaComplex Tainted: P 2.6.28-19-generic #66-Ubuntu
[ 194.924811] RIP: 0010:[<ffffffff80268a0c>] [<ffffffff80268a0c>] __wake_up_bit+0xc/0x30
[ 194.924827] RSP: 0018:ffff88021f1abc88 EFLAGS: 00010286
[ 194.924832] RAX: ff2617d12a18a008 RBX: ffff880227165c00 RCX: 0000000000000040
[ 194.924837] RDX: 0000000000000000 RSI: ffff880227165c00 RDI: ff2617d12a18a000
[ 194.924841] RBP: ffff88021f1abc98 R08: e000000000000000 R09: 0000000000000000
[ 194.924846] R10: 0000000000000000 R11: 0000000000000246 R12: ffff880227165c00
[ 194.924851] R13: ffff88023ec2b080 R14: ffff88023ec2b000 R15: ffff880227165300
[ 194.924856] FS: 00007f55801b5700(0000) GS:ffff88023f802f80(0000) knlGS:0000000000000000
[ 194.924861] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 194.924866] CR2: 00007f557e2cd000 CR3: 000000021f05f000 CR4: 00000000000006a0
[ 194.924871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 194.924875] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 194.924881] Process cudaComplex (pid: 4004, threadinfo ffff88021f1aa000, task ffff8802217d0000)
[ 194.924885] Stack:
[ 194.924888] ffff88021f1abcb8 ffff880227165c00 ffff88021f1abcb8 ffffffff802b1052
[ 194.924896] ffff880227165c00 0000000000000001 ffff88021f1abcd8 ffffffff802b9260
[ 194.924904] ffff880227165300 0000000000000001 ffff88021f1abd28 ffffffffa05af156
[ 194.924913] Call Trace:
[ 194.924917] [<ffffffff802b1052>] unlock_page+0x22/0x30
[ 194.924926] [<ffffffff802b9260>] set_page_dirty_lock+0x40/0x50
[ 194.924935] [<ffffffffa05af156>] nv_unlock_user_pages+0x76/0xe0 [nvidia]
[ 194.925464] [<ffffffffa058b6ed>] _nv022854rm+0x2c/0x3b [nvidia]
[ 194.925971] [<ffffffffa00b4f42>] ? _nv003722rm+0x1a4/0x2bf [nvidia]
[ 194.926243] [<ffffffffa00b2b9f>] ? _nv003699rm+0xc7/0xef [nvidia]
[ 194.926243] [<ffffffffa00b2c52>] ? _nv002369rm+0x52/0x74 [nvidia]
[ 194.926243] [<ffffffffa00abc7e>] ? _nv001995rm+0x26/0x4e [nvidia]
[ 194.926243] [<ffffffffa059977b>] ? _nv002408rm+0x29b/0x6a0 [nvidia]
[ 194.926243] [<ffffffffa05960a7>] ? rm_ioctl+0x2f/0x67 [nvidia]
[ 194.926243] [<ffffffffa05b24e3>] ? nv_kern_ioctl+0x1a3/0x480 [nvidia]
[ 194.926243] [<ffffffffa05b27fc>] ? nv_kern_unlocked_ioctl+0x1c/0x20 [nvidia]
[ 194.926243] [<ffffffff802f6871>] ? vfs_ioctl+0x31/0xa0
[ 194.926243] [<ffffffff802f6c25>] ? do_vfs_ioctl+0x75/0x230
[ 194.926243] [<ffffffff802f6e79>] ? sys_ioctl+0x99/0xa0
[ 194.926243] [<ffffffff8021252a>] ? system_call_fastpath+0x16/0x1b
[ 194.926243] Code: c0 4c 01 c8 5b 48 d3 e8 48 8d 04 40 41 5c 48 c1 e0 03 48 03 82 00 d0 90 80 c9 c3 0f 1f 40 00 55 48 8d 47 08 48 89 e5 48 83 ec 10 <48> 39 47 08 89 55 f8 74 17 48 8d 4d f0 48 89 75 f0 ba 01 00 00
[ 194.926243] RIP [<ffffffff80268a0c>] __wake_up_bit+0xc/0x30
[ 194.926243] RSP <ffff88021f1abc88>
[ 194.929592] ---[ end trace 8174815e179b6ee7 ]---
tera
March 6, 2011, 11:39am
6
However doing this:
cudaComplex * vals;
cudaAssert( cudaMallocHost((void **)&vals, msize) );
assert( vals > 0);
cudaAssert( cudaHostRegister(vals, msize, 0) );
compiles fine, but generates a pretty severe kernel protection fault and a dead GPU:
I have no experience with CUDA 4.0 or cudaHostRegister(). But why would you call cudaHostRegister() on memory obtained by cudaMallocHost()? Isn’t that page-locked already?
It is, but I was working on the misapprehension that all allocations have to be explicitly registered before they are available in the unified address space. Re-reading the documentation, it says this:
All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible
from all devices that support unified addressing. This is the case regardless of whether or not the flags
cudaHostAllocPortable and cudaHostAllocMapped are specified.
EDIT: I still will open a bug report, just because you might hope that doing something as innocent as that (even though not correct) might cause a runtime error to be returned, rather than the driver assploding and taking out the entire machine.
Hi guys,
jumping right into CUDA 4.0 with my project External Image I have a memory buffer (passed to me by OpenSSL, so I’m not allocating it myself), which I want to register for DMA / cudaMemcpyAsync with the new cudaHostRegister. Right now my call fails with CUDA error 11 (invalid argument)
cudaHostRegister((void *)out_arg,nbytes,0);
I’m not quite sure why, and it might have some other reason, but I suspect that the memory pointed to is not page-aligned (4KB). The question is how I check whether it is aligned, and if not, how to best proceed in this case. One way I tried discovering this is by examining the pointer (0x7f7801ee0010 for one run, which is not a multiple of 4096), but I don’t know whether this is valid and the reason for my problem.
Furthermore I’d like to ask if there is an online CUDA library documentation like for 3.2: http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/index.html
I found the documentation for cudaHostRegister in the cuda_runtime_api.h, which is not convenient External Image
Hi!
I got exactly the same trouble trying cudaHostRegister. I solved it allocating my buffers with “valloc” instead of “malloc”. “valloc” make sure the pointer is aligned with the page size, which is necessary to register the buffer.
See “http://www.delorie.com/gnu/docs/glibc/libc_31.html ”
Concerning documentation, you can use the “CUDA_Toolkit_reference_Manual.pdf” in cuda/doc.
Regards,
Sam
Hi!
I got exactly the same trouble trying cudaHostRegister. I solved it allocating my buffers with “valloc” instead of “malloc”. “valloc” make sure the pointer is aligned with the page size, which is necessary to register the buffer.
See “http://www.delorie.com/gnu/docs/glibc/libc_31.html ”
Concerning documentation, you can use the “CUDA_Toolkit_reference_Manual.pdf” in cuda/doc.
Regards,
Sam
Try posix_memalign():
size_t bytesRequired = 641024 1024;
unsigned char* ptr;
if( posix_memalign( &ptr, 4096, bytesRequired) != 0 )
printf( “posix_memalign(4096, %d) failed\n”, bytesRequired );
gshi
March 11, 2011, 8:53pm
10
Thanks for the tip.
Tested it and it only works when the host memory is aligned to 4k and the size is multiple of 4k
Try posix_memalign():
size_t bytesRequired = 641024 1024;
unsigned char* ptr;
if( posix_memalign( &ptr, 4096, bytesRequired) != 0 )
printf( “posix_memalign(4096, %d) failed\n”, bytesRequired );