What is the most general form for accessing a thread element using gridDim, blockDim, incorporating x, y & z dimensions for grids blocks and threads? Is there such an expression that guarantees correct element access regardless of whether 1D, 2D or 3D information is passed in?
At the moment this (from my .cu file) is what I’m trying with a texture and it does not work:
// Global declarations at the top of .cu file
cudaArray* cuArray_eB;
texture<unsigned char, 2> tex;
float *d_sB;
float *d_sG; // unused in this instance, but if it works for d_sB then will be uncommented
float *d_sR; // unused in this instance, but if it works for d_sB then will be uncommented
// Wrapper to allocate device mem for cuArray_eB and d_sB
extern "C" void alcMem(unsigned int width, unsigned int height) {
unsigned int numCells = width * height;
cudaChannelFormatDesc description = cudaCreateChannelDesc<unsigned char>();
cudaMallocArray(&cuArray_eB, &description, width, height);
cudaMalloc((void**)&d_sB, numCells * sizeof(float));
return;
}
// The kernel itself
__global__ void myKern(unsigned int width, unsigned int height,
unsigned char *d_eB, float *d_sB) {
//int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
unsigned char p = tex2D(tex, x, y);
d_sB[y*width+x] = p;
//__syncthreads();
return;
}
// Wrapper to start kernel execution
extern "C" void launchKernel(unsigned int width, unsigned int height,
unsigned char *h_eB, unsigned char *h_eG, unsigned char *h_eR) {
unsigned int numCells = width * height;
size_t sizeUchar = numCells * sizeof(unsigned char);
// Copy from host to device and bind texture
cudaMemcpy(cuArray_eB, h_eB, width*height*sizeof(unsigned char), cudaMemcpyHostToDevice);
cudaBindTextureToArray(tex, cuArray_eB);
// Launch kernel
dim3 blockDim(16, 16, 1);
dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);
myKern<<<gridDim, blockDim>>>(width, height, d_eB);
// Unbind texture
cudaUnbindTexture(tex);
return;
}
extern "C" void copyResultsToHost(unsigned int numCells, float *h_sB, float *h_sG, float *h_sR) {
size_t sizeUint = numCells * sizeof(unsigned int);
size_t sizeFloat = numCells * sizeof(float);
// Copy from device to host
cudaMemcpy(h_sB, d_sB, sizeFloat, cudaMemcpyDeviceToHost);
return;
}
And then in my .cpp file I call the following:
// ... some code to read in image into host array
// Allocate device memory
alcMem(numCols, numRows);
// Copy host data to device and Start the kernel
launchKernel(numCols, numRows, eB, eG, eR);
// Copy results from device to host
copyResultsToHost(numCells, sB, sG, sR);
// ... and then some other code to display back the image in the B channel from sB values
What I get is the same B value being copied to every location in sB, but what I expect is the entire image to show up only in the B channel.
I’m obviously accessing/assigning something wrong, but I can’t seem to identify where the mistake is. Every variation of element access I try results in the same thing. (I am reading from a live camera feed, so I know that at least one pixel, probably the first?, is being read/assigned correctly since the output image changes to different shades of uniform blue as a alter the lighting available to the camera.
I think you will find your texture memory setup is incorrect. You must use cudaMemcpyToArray to copy data into the cudaArray you are using for your texture. The coordinates you provide to the texture read should be floating point, texel centred, and the result the texture returns is always a floating point number, so there is an implicit cast from 32 bit float to 8 bit char in your code which you should be aware of.
Thanks for your reply. Couldn’t get it to work by changing everything to floats and by copying with cudaMemcpyToArray. I guess I’ll just have to keep trying different strategies.
Is it the case that passing multiple arrays to the device to work on and then trying to work on them all using the same thread identifier is what is producing incorrect results? What I’m trying is as follows:
(I know there are implicit casts in certain places, but that should not affect the results in this case)
My kernel launch is like this:
// Do calculation on device:
unsigned int numCells = width*height;
int block_size = 512;
int n_blocks = numCells/block_size + (numCells%block_size == 0 ? 0:1);
myKern<<<n_blocks, block_size>>>(width,height,d_eLabel,d_eB,d_eG,d_eR,d_sSize,d_sB,d_sG,d
_sR);
The bit I find strange is that unless I pass in the globally declared device arrays into the kernel it doesn’t seem to do any work on them. Why is this? I would have thought globally declared means the kernel should be able to access it even if it’s not a passed in parameter.
The above code works on d_sB perfectly fine, the blue channel of the image is displayed back perfectly, having been copied over by the kernel itself from the d_eB (copied from host) to the d_sB array. But the green and red channels show up strangely reshaped and tiled 4 times. Is this an indication that my grid/block dimensions are wrong?
Another very strange thing is if I comment out the first two assignments within the kernel, d_eLabel[idx] = idx and d_sSize[idx] = 1, then the blue channel malfunctions as well. It’s like the third assignment needs the first two to be there in order to work, although all three should be completely independent of each other.
I am going to guess your device memory management is wrong, although that is just a guess because you haven’t shown the code. Having global scope global pointers like that doesn’t make much sense, and I am pretty certain that a combination of incorrect host side memory allocation code and device code scope issues and causing the problems you are seeing. All the code you have posted also lacks an error checking, which is also probably not a good idea. All of those API functions you are calling return a status which you should be checking. It is quite possible that one or more of them is failing silently, and you don’t know it.
OK, as I suspected, all of those cudaMalloc calls in alcMem() are wrong.
When you allocate device memory, the address of the allocation needs to be assigned to a host pointer. The host pointer can then be passed by value to a kernel call as an argument. Alternatively, the pointer value can be copied onto a device memory symbol (a constant memory pointer value makes much more sense in that case than a global memory pointer). If you do that, then there is no need to pass the value as a kernel argument.
There are a lot of examples of how memory management should work (and the earlier texture lookup you were trying) in the CUDA SDK. You might find it instructive to spend a bit of time reading some of that code to get an idea about how to get your own code working.
Thanks very much again avidday. I am rereading the SDK, but I would like to get the general direction that you recommend clear, I understand the following should be my new function definitions?
I remove all the device pointer declarations at global scope in the .cu file, creating them instead within main.cpp on the host
I declare device pointers d_sSize, d_sB, d_sG, d_sR on the host, pass these pointers to alcMem for device level memory allocation.
cudaMemcpy eB, eG, eR over to d_eB, d_eG, d_eR
Call doKernel to launch kernel execution, at which time the kernel is launched with the following parameters: eB, eG, eR, d_eLabel, d_sSize, d_sB, d_sG, d_sR
Actually I’ve taken enough of your time already. Please disregard previous post, I’ll go the rest of the way on my own. Thanks loads for the head start.
You could also define constant pointers at global scope, do the device memory allocations with cudaMalloc() on host pointers, and then use memcpyToSymbol() to write the addresses of the memory allocations to those constant pointers. Constant memory has cache and a broadcast mechanism which makes it fast for that sort of usage, and it eliminates the need for long argument lists in you device functions (those arguments also occupy shared memory, so it frees up some shared memory in the process).
Got it working finally! Thanks for your tip about the constant ptrs. Will implement that next. I have one more quick question but I should open a new thread for that.