most general form for thread access?

genisys · February 20, 2010, 4:19am

What is the most general form for accessing a thread element using gridDim, blockDim, incorporating x, y & z dimensions for grids blocks and threads? Is there such an expression that guarantees correct element access regardless of whether 1D, 2D or 3D information is passed in?

At the moment this (from my .cu file) is what I’m trying with a texture and it does not work:

// Global declarations at the top of .cu file

cudaArray* cuArray_eB;

texture<unsigned char, 2> tex;

float *d_sB;

float *d_sG; // unused in this instance, but if it works for d_sB then will be uncommented

float *d_sR; // unused in this instance, but if it works for d_sB then will be uncommented

// Wrapper to allocate device mem for cuArray_eB and d_sB

extern "C" void alcMem(unsigned int width, unsigned int height) {

	unsigned int numCells = width * height;

	cudaChannelFormatDesc description = cudaCreateChannelDesc<unsigned char>();

	cudaMallocArray(&cuArray_eB, &description, width, height);

	cudaMalloc((void**)&d_sB, numCells * sizeof(float));

	return;

}

// The kernel itself

__global__ void myKern(unsigned int width, unsigned int height,

					unsigned char *d_eB, float *d_sB) {

	//int tid = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	unsigned char p = tex2D(tex, x, y);

	d_sB[y*width+x] = p;

	//__syncthreads();

	return;

}

// Wrapper to start kernel execution

extern "C" void launchKernel(unsigned int width, unsigned int height,

					unsigned char *h_eB, unsigned char *h_eG, unsigned char *h_eR) {

	unsigned int numCells = width * height;

	size_t sizeUchar = numCells * sizeof(unsigned char);

	// Copy from host to device and bind texture

	cudaMemcpy(cuArray_eB, h_eB, width*height*sizeof(unsigned char), cudaMemcpyHostToDevice);

	cudaBindTextureToArray(tex, cuArray_eB);

	// Launch kernel

	dim3 blockDim(16, 16, 1);

	dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);

	myKern<<<gridDim, blockDim>>>(width, height, d_eB);

	// Unbind texture

	cudaUnbindTexture(tex);

	return;

}

extern "C" void copyResultsToHost(unsigned int numCells, float *h_sB, float *h_sG, float *h_sR) {

		

	size_t sizeUint = numCells * sizeof(unsigned int);

	size_t sizeFloat = numCells * sizeof(float);

	// Copy from device to host

	cudaMemcpy(h_sB, d_sB, sizeFloat, cudaMemcpyDeviceToHost);

		return;

}

And then in my .cpp file I call the following:

// ... some code to read in image into host array

	// Allocate device memory

	alcMem(numCols, numRows);

	// Copy host data to device and Start the kernel

	launchKernel(numCols, numRows, eB, eG, eR);

	// Copy results from device to host

	copyResultsToHost(numCells, sB, sG, sR);

	// ... and then some other code to display back the image in the B channel from sB values

What I get is the same B value being copied to every location in sB, but what I expect is the entire image to show up only in the B channel.

I’m obviously accessing/assigning something wrong, but I can’t seem to identify where the mistake is. Every variation of element access I try results in the same thing. (I am reading from a live camera feed, so I know that at least one pixel, probably the first?, is being read/assigned correctly since the output image changes to different shades of uniform blue as a alter the lighting available to the camera.

avidday · February 20, 2010, 8:56am

I think you will find your texture memory setup is incorrect. You must use cudaMemcpyToArray to copy data into the cudaArray you are using for your texture. The coordinates you provide to the texture read should be floating point, texel centred, and the result the texture returns is always a floating point number, so there is an implicit cast from 32 bit float to 8 bit char in your code which you should be aware of.

genisys · February 20, 2010, 2:37pm

Thanks for your reply. Couldn’t get it to work by changing everything to floats and by copying with cudaMemcpyToArray. I guess I’ll just have to keep trying different strategies.

genisys · February 20, 2010, 4:37pm

Is it the case that passing multiple arrays to the device to work on and then trying to work on them all using the same thread identifier is what is producing incorrect results? What I’m trying is as follows:

__device__ unsigned int *d_eLabel;

__device__ unsigned char *d_eB;

__device__ unsigned char *d_eG;

__device__ unsigned char *d_eR;

__device__ float *d_sSize;

__device__ float *d_sB;

__device__ float *d_sG;

__device__ float *d_sR;

__global__ void myKern(unsigned int width, unsigned int height,

					unsigned int *d_eLabel, unsigned char *d_eB, unsigned char *d_eG, unsigned char *d_eR,

					float *d_sSize, float *d_sB, float *d_sG, float *d_sR) {

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	d_eLabel[idx] = idx;

	d_sSize[idx] = 1;

	d_sB[idx] = d_eB[idx];

	d_sG[idx] = d_eG[idx];

	d_sR[idx] = d_eR[idx];

	return;

}

(I know there are implicit casts in certain places, but that should not affect the results in this case)

My kernel launch is like this:

// Do calculation on device:

	unsigned int numCells = width*height;

	int block_size = 512;

	int n_blocks = numCells/block_size + (numCells%block_size == 0 ? 0:1);

	myKern<<<n_blocks, block_size>>>(width,height,d_eLabel,d_eB,d_eG,d_eR,d_sSize,d_sB,d_sG,d

_sR);

The bit I find strange is that unless I pass in the globally declared device arrays into the kernel it doesn’t seem to do any work on them. Why is this? I would have thought globally declared means the kernel should be able to access it even if it’s not a passed in parameter.

The above code works on d_sB perfectly fine, the blue channel of the image is displayed back perfectly, having been copied over by the kernel itself from the d_eB (copied from host) to the d_sB array. But the green and red channels show up strangely reshaped and tiled 4 times. Is this an indication that my grid/block dimensions are wrong?

Another very strange thing is if I comment out the first two assignments within the kernel, d_eLabel[idx] = idx and d_sSize[idx] = 1, then the blue channel malfunctions as well. It’s like the third assignment needs the first two to be there in order to work, although all three should be completely independent of each other.

(Apologies for staying on this topic)

avidday · February 20, 2010, 5:19pm

I am going to guess your device memory management is wrong, although that is just a guess because you haven’t shown the code. Having global scope global pointers like that doesn’t make much sense, and I am pretty certain that a combination of incorrect host side memory allocation code and device code scope issues and causing the problems you are seeing. All the code you have posted also lacks an error checking, which is also probably not a good idea. All of those API functions you are calling return a status which you should be checking. It is quite possible that one or more of them is failing silently, and you don’t know it.

genisys · February 20, 2010, 5:46pm

Sorry about that, I was trying not to fill the post with code, but I’m grateful you indicated you would rather see the code. Here it is:

// main.cpp

// includes

#include <iostream>

#include <cv.h>

#include <cxcore.h>

#include <highgui.h>

#include <sys/timeb.h>

#include <cutil.h>

#include <cutil_inline.h>

using namespace std;

// Forward declare the functions

extern "C" void alcMem(unsigned int width, unsigned int height);

extern "C" void frMem();

extern "C" void doKernel(unsigned int width, unsigned int height,

					  unsigned char *eB, unsigned char *eG, unsigned char *eR);

extern "C" void getKernelResults(unsigned int numCells,

					  unsigned int *eLabel, float *sSize, float *sB, float *sG, float *sR);

// Main

int main() {

	struct _timeb st_time_start;

	struct _timeb st_time_end;

	int key = 0;

	unsigned int count;

	unsigned int cycles = 0;

	unsigned int numRows, numCols;

	unsigned int numCells = 0;

	CvPoint tmpPoint;

	uchar *tmpPtr;

	CvCapture* capture = NULL;

	IplImage *inIm, *maskIm = 0;

	capture = cvCaptureFromCAM(0);

	if(!capture) {printf("ERROR: Couldn't use capture device."); return -1;}

	cvSetCaptureProperty(capture, CV_CAP_PROP_FRAME_WIDTH, 320);

	cvSetCaptureProperty(capture, CV_CAP_PROP_FRAME_HEIGHT, 240);

	//cvSetCaptureProperty(capture, CV_CAP_PROP_FPS, 7);

	inIm = cvQueryFrame(capture);

	if(!inIm) {cout << "Input error" << endl; return -1;}

	// Create masks

	maskIm = cvCreateImage(cvGetSize(inIm), IPL_DEPTH_8U, 3);

	numRows = inIm->height;

	numCols = inIm->width;

	numCells = numRows*numCols;

	// Allocate host memory

	unsigned char *eB=0, *eG=0, *eR=0;

	float *sB=0, *sG=0, *sR=0, *sSize=0;

	unsigned int *eLabel=0;

	eB = new unsigned char[numCells];

	eG = new unsigned char[numCells];

	eR = new unsigned char[numCells];

	eLabel = new unsigned int[numCells];

	sB = new float[numCells];

	sG = new float[numCells];

	sR = new float[numCells];

	sSize = new float[numCells];

	// Allocate device memory

	alcMem(numCols, numRows);

	// Open display windows

	cvNamedWindow("aConsole", 1);

	while(1) {

		// Start timer

		_ftime64_s(&st_time_start);

		++cycles;

		inIm = cvQueryFrame(capture);

		if(!inIm) {cvWaitKey(10); continue;}

		cvCopy(inIm, maskIm, NULL);

		// Copy data from image to host arrays

		for(int i=0;i<numCols;++i) {

			for(int j=0;j<numRows;++j) {

				tmpPoint = cvPoint(i, j);

				tmpPtr = &((uchar*)(inIm->imageData + inIm->widthStep*tmpPoint.y))[tmpPoint.x*3];

				eB[j*numCols+i] = tmpPtr[0];

				eG[j*numRows+i] = tmpPtr[1];

				eR[j*numRows+i] = tmpPtr[2];

			}

		}

		doKernel(numCols, numRows, eB, eG, eR);

		getKernelResults(numCells, eLabel, sSize, sB, sG, sR);

		// Draw output pixels on mask(s)

		unsigned int tmpIndex;

		for(int i=0;i<numCols;++i) {

			for(int j=0;j<numRows;++j) {

				tmpPtr = &((uchar*)(maskIm->imageData + maskIm->widthStep*j))[i*3];

				tmpPtr[0] = (unsigned char)sB[eLabel[j*numCols+i]];

				tmpPtr[1] = (unsigned char)sG[eLabel[j*numCols+i]];

				tmpPtr[2] = (unsigned char)sR[eLabel[j*numCols+i]];

			}

		}

		// Stop timer

		_ftime64_s(&st_time_end);

		printf("\nFPS: %f", (double) 1000/(1000*(st_time_end.time - st_time_start.time) + (st_time_end.millitm - st_time_start.millitm)));

		// Display

		cvShowImage("aConsole", maskIm);

		// Test for break key

		key=0; key = cvWaitKey(10);

		if((char)key == 27) {

			break;

		}

	}

	// Clear device memory

	frMem();

	

	// Clear host memory

	delete[] eB;

	delete[] eG;

	delete[] eR;

	delete[] eLabel;

	delete[] sB;

	delete[] sG;

	delete[] sR;

	delete[] sSize;

	// Return

	return 0;

}

… and …

// cppIntegration.cu

// includes

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil.h>

#include <cutil_inline.h>

//#include <cuda.h>

__device__ unsigned int *d_eLabel;

__device__ unsigned char *d_eB;

__device__ unsigned char *d_eG;

__device__ unsigned char *d_eR;

__device__ float *d_sSize;

__device__ float *d_sB;

__device__ float *d_sG;

__device__ float *d_sR;

__device__ cudaArray* cuArray_eB;

texture<float, 2> tex;

__global__ void myKern(unsigned int width, unsigned int height,

					unsigned int *d_eLabel, unsigned char *d_eB, unsigned char *d_eG, unsigned char *d_eR,

					float *d_sSize, float *d_sB, float *d_sG, float *d_sR) {

	//unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	//unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	//float p = tex2D(tex, x, y);

	//d_sB[y*width+x] = p;

	unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

	d_eLabel[idx] = idx;

	d_sSize[idx] = 1;

	d_sB[idx] = d_eB[idx];

	d_sG[idx] = d_eG[idx];

	//d_sR[y*width+x] = d_eR[y*width+x];

	return;

}

extern "C" void alcMem(unsigned int width, unsigned int height) {

	unsigned int numCells = width * height;

	// Allocate array

	cudaChannelFormatDesc description = cudaCreateChannelDesc<float>();

	//cudaChannelFormatDesc description = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

	cudaMallocArray(&cuArray_eB, &description, width, height);

	cudaMalloc((void**)&d_eLabel, numCells * sizeof(unsigned int));

	cudaMalloc((void**)&d_eB, numCells * sizeof(unsigned char));

	cudaMalloc((void**)&d_eG, numCells * sizeof(unsigned char));

	cudaMalloc((void**)&d_eR, numCells * sizeof(unsigned char));

	cudaMalloc((void**)&d_sSize, numCells * sizeof(float));

	cudaMalloc((void**)&d_sB, numCells * sizeof(float));

	cudaMalloc((void**)&d_sG, numCells * sizeof(float));

	cudaMalloc((void**)&d_sR, numCells * sizeof(float));

	return;

}

extern "C" void frMem() {

	CUDA_SAFE_CALL(cudaFree(d_eLabel));

	CUDA_SAFE_CALL(cudaFree(d_eB));

	CUDA_SAFE_CALL(cudaFree(d_eG));

	CUDA_SAFE_CALL(cudaFree(d_eR));

	CUDA_SAFE_CALL(cudaFree(d_sSize));

	CUDA_SAFE_CALL(cudaFree(d_sB));

	CUDA_SAFE_CALL(cudaFree(d_sG));

	CUDA_SAFE_CALL(cudaFree(d_sR));

	// Unbind texture

	//cudaUnbindTexture(tex);

	return;

}

extern "C" void getKernelResults(unsigned int numCells,

					  unsigned int *h_eLabel, float *h_sSize, float *h_sB, float *h_sG, float *h_sR) {

		

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	size_t sizeUint = numCells * sizeof(unsigned int);

	size_t sizeFloat = numCells * sizeof(float);

	// Copy from device to host

	cudaMemcpy(h_eLabel, d_eLabel, sizeUint, cudaMemcpyDeviceToHost);

	cudaMemcpy(h_sSize, d_sSize, sizeFloat, cudaMemcpyDeviceToHost);

	cudaMemcpy(h_sB, d_sB, sizeFloat, cudaMemcpyDeviceToHost);

	cudaMemcpy(h_sG, d_sG, sizeFloat, cudaMemcpyDeviceToHost);

	cudaMemcpy(h_sR, d_sR, sizeFloat, cudaMemcpyDeviceToHost);

}

extern "C" void doKernel(unsigned int width, unsigned int height,

					unsigned char *h_eB, unsigned char *h_eG, unsigned char *h_eR) {

		unsigned int numCells = width * height;

		size_t sizeUchar = numCells * sizeof(unsigned char);

		// Copy from host to device

		cudaMemcpy(d_eB, h_eB, sizeUchar, cudaMemcpyHostToDevice);

		cudaMemcpy(d_eG, h_eG, sizeUchar, cudaMemcpyHostToDevice);

		cudaMemcpy(d_eR, h_eR, sizeUchar, cudaMemcpyHostToDevice);

		// Copy image data to texture array

		//cudaMemcpyToArray(cuArray_eB, 0, 0, h_eB, width*height*sizeof(unsigned char), cudaMemcpyHostToDevice);

		// Bind the array to the texture

		//cudaBindTextureToArray(tex, cuArray_eB);

		// Launch kernel

		/*

		int blockSize = 256;

		int numBlocks = numCells/blockSize + (numCells%blockSize == 0?0:1);

		dim3 gridDim(numBlocks, 1);

		dim3 blockDim(blockSize, 1);

		//dim3 blockDim(16, 16, 1);

		//dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);

		*/

		// Do calculation on device:

		int block_size = 512;

		int n_blocks = numCells/block_size + (numCells%block_size == 0 ? 0:1);

		myKern<<<n_blocks, block_size>>>(width,height,d_eLabel,d_eB,d_eG,d_eR,d_sSize,d_sB,d_sG,d

_sR);

		return;

}

Thanks once again for looking through my problem.

avidday · February 20, 2010, 5:56pm

OK, as I suspected, all of those cudaMalloc calls in alcMem() are wrong.

When you allocate device memory, the address of the allocation needs to be assigned to a host pointer. The host pointer can then be passed by value to a kernel call as an argument. Alternatively, the pointer value can be copied onto a device memory symbol (a constant memory pointer value makes much more sense in that case than a global memory pointer). If you do that, then there is no need to pass the value as a kernel argument.

There are a lot of examples of how memory management should work (and the earlier texture lookup you were trying) in the CUDA SDK. You might find it instructive to spend a bit of time reading some of that code to get an idea about how to get your own code working.

genisys · February 20, 2010, 6:45pm

Thanks very much again avidday. I am rereading the SDK, but I would like to get the general direction that you recommend clear, I understand the following should be my new function definitions?

extern "C" void alcMem(unsigned int width, unsigned int height, float *d_sSize, float *d_sB, float *d_sG, float *d_sR);

extern "C" void frMem(float *d_sSize, float *d_sB, float *d_sG, float *d_sR);

extern "C" void doKernel(unsigned int width, unsigned int height, unsigned int *eLabel, unsigned char *eB, unsigned char *eG, unsigned char *eR,

					  float *sSize, float *sB, float *sG, float *sR);

extern "C" void getKernelResults(unsigned int numCells, unsigned int *eLabel, float *sSize, float *sB, float *sG, float *sR);

I remove all the device pointer declarations at global scope in the .cu file, creating them instead within main.cpp on the host
I declare device pointers d_sSize, d_sB, d_sG, d_sR on the host, pass these pointers to alcMem for device level memory allocation.
cudaMemcpy eB, eG, eR over to d_eB, d_eG, d_eR
Call doKernel to launch kernel execution, at which time the kernel is launched with the following parameters: eB, eG, eR, d_eLabel, d_sSize, d_sB, d_sG, d_sR
Retrieve (memcpy back) as I was doing before
Free mem as before

genisys · February 20, 2010, 6:49pm

Actually I’ve taken enough of your time already. Please disregard previous post, I’ll go the rest of the way on my own. Thanks loads for the head start.

avidday · February 21, 2010, 9:50am

That is pretty much how it should work.

You could also define constant pointers at global scope, do the device memory allocations with cudaMalloc() on host pointers, and then use memcpyToSymbol() to write the addresses of the memory allocations to those constant pointers. Constant memory has cache and a broadcast mechanism which makes it fast for that sort of usage, and it eliminates the need for long argument lists in you device functions (those arguments also occupy shared memory, so it frees up some shared memory in the process).

genisys · February 21, 2010, 1:20pm

Got it working finally! Thanks for your tip about the constant ptrs. Will implement that next. I have one more quick question but I should open a new thread for that.