I’ve been trying to implement a basic vector add using CUDA C, and I was playing around with using the cudaHostRegister() function to get a feel for how to use mapped pinned memory. However, it seems like there are either some race conditions I didn’t account for, or I don’t fully understand how the mapping works. The code typically works fine for N < 100, but for higher N, it consistently gives me c[i] = 0 for all i. Here’s a pastebin with the code in question. Thanks! http://pastebin.com/p92nbQZW
In case it’s relevant (I’m new to CUDA), my compiler commands are:
nvcc -std=c++11 -c -o vec.o vec.cu
nvcc -std=c++11 -o vec vec.o
GPU kernel launches are asynchronous. This means control is returned to the host thread immediately, without waiting for the kernel to begin executing, or completing. When using ordinary device memory, this would be less of an issue for this case, since you would immediately follow the kernel call with a cudaMemcpy operation, to copy the data back to the host, and this operation is blocking/synchronizing. That means that the operation will not begin until all previous CUDA activity (e.g. the kernel) is complete, and it will not return control to the host thread until the data copy operation is complete.
All that is out the window when you use host-mapped memory (since you have no requirement to use cudaMemcpy). So you need to synchronize the host thread, so that it does not begin checking the results until the kernel operation is complete.
So put a cudaDeviceSynchronize() call in right after the kernel call. I think you’ll get better results.
Thanks. I completely forgot that kernel launches were asynchronous, and that addition seems to address at least part of the problem. However, for higher N (N > 10k now, instead of N > 100), the assertion still fails (typically at i = 1 with c = 0 instead of 1). Any ideas?
first of all, please add proper cuda error checking to your kernel call. If you don’t know what proper cuda error checking is, please google “proper cuda error checking” and take the first hit. Then study how to properly check a kernel call for errors.
After that, learn to use cuda-memcheck, any time you are having trouble with a cuda code, before asking others for help. Even if you don’t understand the output, it will be useful for others trying to help you.
This computation could not possibly be correct:
int bytes = (N * sizeof(int) + 4095)/4096;
Suppose N is 10. I need 40 bytes (at least). 10*4 +4095 is 4135. 4135/4096 (integer division) is 1.
I’m not really sure what you’re doing with valloc, the adjustment of sizes and whatnot. I don’t intend to debug that any further for you.
If you use a simple modification of the cuda vectorAdd sample code (e.g. using malloc, and correctly computing sizes of allocations) I think you’ll have better results.
Also, on the subject of error checking, using an assert against cudaSuccess is throwing away useful information. If you study the proper cuda error checking information, you’ll see how to decode that into a text string. This will be useful later when you fix the other errors and you crank BLK up to a number greater than 65535. You’ll run into the next limit and the well-decoded error checking will give you a hint.
I appreciate the pointers regarding error-checking and cuda-memcheck. I will definitely use them in the future – again, I’m new to CUDA, and I didn’t know those tools existed until now.
A final question regarding cudaHostRegister(). You’re saying I should just be able to use malloc() to do my allocations, and that my mucking about with the number 4096 seemed dumb/unnecessary. However, I read in the documentation that the function asks for a page-aligned allocation of memory. Was I reading about an old standard, or does cudaHostRegister() just page-lock all the pages that the memory I malloc()'d is resident in?
The documentation for cudaHostRegister:
“Page-locks the memory range specified by ptr and size and maps it for the device(s) as specified by flags.”
cudaHostRegister doesn’t allocate anything, and never did to my knowledge. You use cudaHostRegister after performing an allocation, e.g. via malloc, or you combine the two into one step using cudaHostAlloc.
I’m also not sure why we would need to discuss alignment at this point. Sure alignment is important, but it’s not in any way the crux of the error or really that relevant to what is going on here.
Since you were passing cudaHostRegister a size much smaller (/4096) than what was actually needed, only a small portion of the necessary memory was actually mapped, so your kernel was accessing much of the data out-of-bounds.
If you’re trying to compute a necessary amount of bytes which is apparently what you were doing, and in fact what the function needs, I have no idea why you were dividing that quantity by 4096.