(I’ve never posted to this forum before, so if I’ve posted in the wrong place, please let me know.)
Logistics: I’m using CUDA 2.2, CUDA Programming Guide 2.2.1, GPU = Quadro NVS 295 (Compute Capability 1.1)
I have a question about the correct sequence to use the texture cache / texture memory. I’m trying to write a simple example that simply writes an array of floats = 1 (eventually I want to make it more complicated, but it’s easiest to verify it’s working with a constant it seems) to the texture cache, then calls the device to read out the values from the texture cache and write them to an array of floats. I’ve followed the example provided in the Programming Guide and the projects/simpleTexture/ example provided in the SDK, and yet, when I run the code, I am simply getting all zeroes returned by the texture cache. I searched on the forum and didn’t find any issue like this, probably because it’s too basic, but I was hoping someone might be able to point out what I’m doing incorrectly? My guess is that I’m somehow loading the texture cache incorrectly, but I don’t see anything wrong with my code for that.
Additional, but unrelated question: in the projects/simpleTexture/ example in the SDK, they are using cudaSafeCall() around all of their calls in the host to malloc, etc. The programming guide makes no mention of this function though. Why should or shouldn’t I bother using this function? Is it simply a matter of “being safe” when allocating resources for the GPU?
There are several problems with that code, but the primary reason why you getting zeros is because the kernel is probably not running at all, due to invalid execution parameters. If you add error checking to your code (that is what cudaSafeCall does, although I would discourage you from using anything from the cutil library in your code) you will be much more readily be able to diagnose the problem.
Thanks for your help. Sorry to sound stupid, but if you don’t recommend using the cutil library functions, can you tell me where I’m making such a dumb mistake? Are you referring to error-checking on malloc()? Something else?
If you call cudaGetLastError() after the kernel launch, I suspect you will find it is returning an error.
Also, the way you are allocating and copying hostArray on the device won’t work. The CUDA memory copy routines expect flat, linear memory (like their C standard library counterparts), and don’t do any kind of deep copying or pointer translation or anything fancy. So you will be copying over a bunch of host pointers and whatever garbage follows them to the device (in fact I am a little surprising it doesn’t case a segfault on the host).
Thanks again for your help. If you look 3 lines down in my code from the code you pasted, you’ll see that I am calling cudaGetLastError() (it’s part of a print statement). When I run it, I am getting this (for a 1x1 array):
—Error from Kernel - no error—
i = 0 cpuOut = 1.000000 gpuOut = 0.000000
So, I’m not getting an error returned, at least as far as I can tell I’m not getting an error returned. Does putting cudaGetLastError() in the print statement cause an issue that would prevent me from seeing an error?
Also, what about my execution parameters are off? Granted, I don’t have as much experience as you do with execution parameters, but it seems to follow the convention / standards laid out in the programming guide. Perhaps I’m missing something…
Finally, as for the 2D host array not working because of what CUDA expects you to transfer it, does this mean that you can’t transfer a 2D array to the device and that you would need to make a 1D array and cast it to 2D? If that’s the case, that’s really unfortunate.
cudaGetLastError() returns the status of the last function call, which in this case is the cudaThreadSynchronize() call immediately above it. The status of the kernel launch is being lost in the code as it is written. To check for a kernel launch failure, you need to call cudaGetLastError() directly after the kernel, then check the return status of the cudaThreadSynchronize(). That will tell you both whether the launch and execution of the kernel was successful. Also each of the other runtime library functions returns a status, which you should check to see where the code is failing.
CUDA “arrays” for textures aren’t anything like the 2D you are creating in your code. It is an opaque type which uses an internal to the GPU, undocumented space filling layout (something like a Hilbert curve), which is optimal for the cache and hardware. The source data needs to be in row or column major linear memory, and the driver/hardware packs the data into the internal representation that the texture hardware can work with. So whether you think it is unfortunate or not, that is how it works.
Yeah, you’re absolutely correct. As I mentioned in an additional post before yours but after the one you were referencing, I added error checking to all of the other cudaX() runtime library functions. The cudaFree() and cudaFreeArray() errors are still occurring even after the array fix.
I changed the hostArray to use a 1D allocation that is indexed by row major order, and that fixed the issue I was initially having (i.e. I can read back the values from the device now). That doesn’t explain why the cudaFree and cudaFreeArray functions are failing, but I’ll do some digging on those.