Array Problem


I am developing my first big project in CUDA “Parallel morphological operations on images”. Everything was OK since i started running a cuda kernel for big picture (1600x1200 pixels) for a morphologic mask 125x125 elements.

I have 9600 GT video card

For 512 threads in 512 blocks my computing stopped, graphic card was reset and there were no changes in image. After three days I realized that there was a problem with too long computing of one thread. I increased number of blocks to 3750 and everything should goes fine but it isn’t operation steel freezes. I started to change range of pixels on which are morphologic operation is made ( for one pixel there are 125x125 compare operations), if i change range to 1200x1200 everything goes OK, so i tried to start kernel two times for a half of image with 3750 blocks and then error occurs again ! I don’t know what to do and days of debugging gives me nothing (algorithm is ok i ran it on CUDA emulator). I can also say that my morphologic algorithm works fine for image 800x600 with mask 125x125. Below is my source code:

My kernel end get_pixel device method:




device char * get_pixel(char * image, int i, int j, int image_width, int image_height)


if (i < 0 || j < 0 || i >= image_height || j >= image_width) {

	return NULL;


return image + (i*image_width + j);




device char get_pixel_value(char * image, int i, int j, int image_width, int image_height)


char * pixel = get_pixel(image, i, j, image_width, image_height);

if (pixel == NULL) {

	return GRAY;


return *pixel;




// Kernel that executes on the CUDA device

global void g_erode(char * image, char * out_image, int idx_start, int idx_end, int image_width, int image_height, int struct_element_size, int * change)


//Wyliczenie ilosci watkow

int threadNumb = BLOCK_SIZE * BLOCK_NUMBER;

change[0] = threadNumb;

int threadId = blockIdx.x * blockDim.x + threadIdx.x;

//int pixels = image_width*image_height;

int pixels = idx_end - idx_start;

int pixelStringSize = pixels/threadNumb;

int pixelMod =  pixels % threadNumb;

int idx = 0;

int i = 0;

int j = 0;

int c, d;

int add = 0;

bool bBreak = false;

int diff = (int)(SIZE/2);

if (threadId <= pixels) {

	if (threadId < pixelMod) {

		idx = idx_start + threadId*(pixelStringSize + 1);

		add = 1;

	} else {

		idx = idx_start + pixelMod*(pixelStringSize +1) + (threadId - pixelMod)*pixelStringSize;


	for (int k = idx; k < idx + pixelStringSize + add;k++) {

		i = (int) k / image_width;

		j = k % image_width;


		if (get_pixel(image, i, j, image_width, image_height) != NULL) {

			for (int x = 0; x < SIZE; x++) {

				for (int y = 0; y < SIZE; y++) {

					c = i+x-diff;

					d = j+y-diff;

					if (get_pixel(image, c, d, image_width, image_height) != NULL) {

						if (get_pixel_value(image, c, d, image_width, image_height) != BLACK) {

							*get_pixel(out_image, i, j, image_width, image_height) = WHITE;


							bBreak = true;











And the kernel call:


//src is OpenCV object with char* imageData where image is beeing kept

const int IMAGE_WIDTH = src->width;

const int IMAGE_HEIGHT = src->height;

size_t mem_size_image = IMAGE_WIDTH * IMAGE_HEIGHT * sizeof(char);

// Pointer to host & device arrays (src & dst)

char * d_src, * d_dst;

//Allocate memory

cudaMalloc((void **) &d_src, mem_size_image);

cudaMalloc((void **) &d_dst, mem_size_image);

// Copy image from host to device

cudaMemcpy(d_src, src->imageData, mem_size_image, cudaMemcpyHostToDevice);

// Initialize block size and no of blocks

int block_size = BLOCK_SIZE;

int n_blocks = BLOCK_NUMBER;

int ch_size = SIZE*SIZE;

//debug pointer

int * change = (int )malloc(ch_sizesizeof(int));

cudaMalloc((void **) &d_change, ch_size*sizeof(int));

cudaMemcpy(d_change, change, ch_size*sizeof(int), cudaMemcpyHostToDevice);

//Run erosion for first half of picture

g_erode <<< n_blocks, block_size >>> (d_src, d_dst, 0, 960000, IMAGE_WIDTH, IMAGE_HEIGHT, mask[i], d_change);

//Run erosion for second one

g_erode <<< n_blocks, block_size >>> (d_src, d_dst, 960000, 1920000, IMAGE_WIDTH, IMAGE_HEIGHT, mask[i], d_change);

//Copy mem from device to host

cudaMemcpy((*dst)->imageData, d_dst, mem_size_image, cudaMemcpyDeviceToHost);


As i said when I perform this call the monitor blinks (or even system freezes) and I get ‘The launch timed out and was terminated.’ error. When i comment second line of g_erode everything goes good but erosion goeos only for half of picture.

If anyone could help I will be very thankful.

I believe you are running it on Windows with only one GPU. Problem is that CUDA uses whole GPU for its purposes and the operating system does not like the fact that GPU is not responding for too long (like 5 seconds). If that crashes GPU is reset by OS and CUDA reports an error.
There are three options for you:

  • use linux, preferably without X window
  • use two GPUs
  • try to make your code run faster

The last element seems the easiest (cheapest) to do. For example, by fast skimming of your code and explanation I presume the following (please correct me if I am wrong):

  • You are performing some per-pixel operation which is localised, that is, for each pixel you access some of its neighbourhood (that morphologic mask 125x125 elements?)
  • You use global memory to access each pixel (the get_pixel_value function)

Since neighbouring threads will access simillar pixels (the masks do overlap, right?) you may want to use some caching mechanism. The easiest way to do it is to map your input image to a texture memory. You can only read from texture memory, but after all your kernel is not modyfying the input image right? Texture memory is cached (about 6-8KB per SM). I am not sure if texture memory may be used for chars or it has to be 32-bit int but still it should be faster… much faster!
You may also want to organise a block to be a square of your image rather than a line from one end to another to maximise the overlapping of their masks, increase locality of the block and as a result increase number of cache hits.

Remember that accessing global memory introduced latency of several hundreds of clock cycles! If you do that a lot, threads will stay idle waiting for data to come… I am not sure how long is texture cache access time but I would not be surpriced to see a value of few cycles only (assuming cache hit of course)

you can also disable the watch dog timer, which resets the display driver and hardware if it doesn’t finish some job after 5 sec. Form the size of the problem you are describing it makes sense. and remember that the 9600 is a very week card with only 64 cores and a 92 architecture.

Thank you for quick answer,

Yes You are right, each pixel uses global memory to access his neighborhood, I didn’t realized that global memory access is so slow. I will try to optimize my algorithm using your advices. Thanks a lot.

Also I tried to turn off watchdog using method: , but it isn’t working and I don’t know why.

So that you don’t have wrong impression about GPUs :) :

Global memory access has higher bandwidth than accessing RAM from CPU without using cache. However, on the CPU side, it is much easier to fetch data for one thread (or few threads) than to throw data for hundreds of threads as it is done on a GPU. Besides, on host side CPU features L1 and L2 caches which can really help in localised tasks, on GPU the global memory features no cache at all because of synchronisation problems between threads – when you write to given cell, you would have to make sure the value gets updated in all other blocks as well. Texture memory provides a cache though, hence my suggestion.


can you please explain “texture memory”? i’m currently using “shared memory” for the purpose described. is texture memory different? faster?