GTX460 Max of 1 Block?

RPG_Rushbot · June 26, 2011, 10:59pm

I’m running ubuntu 10.04, with 2 gigs of ram, gcc 4.3.3, and CUDA toolkit release from March 2010. Furthermore, I checked to make sure each variable I’m using to invoke the kernel has exactly what I want, so there are no stupid mistakes there. I’ve worked with CUDA for awhile, and I’ve never seen this before.

I’m just writing a simple program where the kernel is invoked with a random number array from global memory, array is moved to shared where its contents get overwritten by 0’s and 1’s, then the result is written back to global memory. I can easily tell from the results that one block writes 0’s and 1’s from its shared memory, but the rest contain random numbers as if they did nothing at all.
I’m not hitting my maximum blocks, maximum active blocks per MP, or shared memory limit at all when the array size is 32.

If the array size is 512, there’s an unspecified launch failure error.

What in the world is the problem? Is it because I’m running my program with x server running, and not all blocks I invoke the kernel with are turned on?

tera · June 27, 2011, 12:48am

How about posting some code?

RPG_Rushbot · June 27, 2011, 8:56am

The code has been chopped up. If you see variable names not being used, that’s because I have to be careful about I can actually post here. Also, I’m trying to learn how to use these forums, so be patient with me.

RPG_Rushbot · June 27, 2011, 9:02am

Hope this works.

bitonicSort.cu (8.14 KB)

This is the output of the code:

Memory Allocation: SUCCESSMemory Allocation: SUCCESSHere is bubblesorted array:

-12 -5 0 2 2 2 5 10 71 75 80 80 120 129 136 138 174 176 179 182 218 221 222 227 266 271 277 277 309 309 310 319

Here is random array:

176 182 179 174 2 10 2 2 310 319 309 309 222 227 221 218 71 80 80 75 277 277 271 266 0 5 -5 -12 136 138 129 120 Memory Copy: SUCCESSMemory Copy: SUCCESSHere is sequence size: 4

Here is PowerOf2: 1

Here is Blocks: 8

Here is Threads: 2

Here is SHARED: 4

Total execution time for kernel: 0.099200

Invoke Kernel: SUCCESS

Here is an attempt to bitonically sort array:

1 1 0 0 2 10 2 2 310 319 309 309 222 227 221 218 71 80 80 75 277 277 271 266 0 5 -5 -12 136 138 129 120

Here is the sorted array

-12 -5 0 2 2 2 5 10 71 75 80 80 120 129 136 138 174 176 179 182 218 221 222 227 266 271 277 277 309 309 310 319

The array is not sorted correctly.

The code is not supposed to be sorted correctly right now. There’s some formatting problems when I try to post here, so it was best to just attach the file.

tera · June 27, 2011, 11:44am

I haven’t read all the code, but it seems you have an out of bounds access in shared memory. [font=“Courier New”]sh[/font] has a fixed size of 64 elements, while you access it as [font=“Courier New”]sh[threadIdx.x][/font] and [font=“Courier New”]sh[threadIdx.x + Threads][/font] where it seems both [font=“Courier New”]threadIdx.x[/font] and [font=“Courier New”]Threads[/font] may be larger than 63.

BTW you can nicely inline code in the forums using [font=“Courier New”][code]…[/code][/font] tags.

RPG_Rushbot · June 27, 2011, 1:16pm

Thank you for the help. I’ll be able to try that this evening.

RPG_Rushbot · June 30, 2011, 2:02pm

I made a huge mistake. At first I thought I was getting correct results. Here is my output:

Memory Allocation: SUCCESSMemory Allocation: SUCCESSHere is bubblesorted array:

31 31 40 41 60 66 68 75 161 164 168 172 275 277 277 280 281 281 283 284 285 287 288 288 295 295 299 300 303 303 303 309

Here is random array:

281 285 277 275 68 75 66 60 280 288 288 284 164 172 168 161 299 300 295 295 303 309 303 303 283 287 281 277 31 41 40 31 Memory Copy: SUCCESSMemory Copy: SUCCESSHere is sequence size: 4

Here is PowerOf2: 1

Here is Blocks: 8

Here is Threads: 2

Here is SHARED: 4

Total execution time for kernel: 0.105984

Invoke Kernel: SUCCESS

Here is an attempt to bitonically sort array:

275 277 281 285 68 75 66 60 280 288 288 284 164 172 168 161 299 300 295 295 303 309 303 303 283 287 281 277 31 41 40 31

Here is the sorted array

31 31 40 41 60 66 68 75 161 164 168 172 275 277 277 280 281 281 283 284 285 287 288 288 295 295 299 300 303 303 303 309

The array is not sorted correctly.

Free Memory: SUCCESSFree Memory: SUCCESSmatthew@matthew-desktop:

Look under “Here is an attempt to bitonically sort array” at the first 4 numbers - those are sorted correctly. The rest aren’t and each block handles a bitonic sequence of 4 numbers. In fact, the rest of them are gathered from global memory without even sorting at all. I’d have been glad if they had garbage numbers, but doing nothing?!

This proves I’m only using one block when I indicated to kernel 8 blocks.

Here’s further proof of this behavior based on the code I provided in the link:

Memory Allocation: SUCCESSMemory Allocation: SUCCESSHere is bubblesorted array:

4 11 14 17 41 47 53 55 92 94 94 95 96 96 101 103 141 147 147 157 161 168 168 175 189 193 196 200 282 288 293 295

Here is random array:

14 17 11 4 196 200 193 189 168 175 168 161 53 55 47 41 92 101 96 94 293 295 288 282 94 103 96 95 147 157 147 141 Memory Copy: SUCCESSMemory Copy: SUCCESSHere is sequence size: 4

Here is PowerOf2: 1

Here is Blocks: 8

Here is Threads: 2 -Maximum number of threads per block

Here is SHARED: 4

Total execution time for kernel: 0.036128

Invoke Kernel: SUCCESS

Here is an attempt to bitonically sort array:

1 1 0 0 196 200 193 189 168 175 168 161 53 55 47 41 92 101 96 94 293 295 288 282 94 103 96 95 147 157 147 141

Here is the sorted array

4 11 14 17 41 47 53 55 92 94 94 95 96 96 101 103 141 147 147 157 161 168 168 175 189 193 196 200 282 288 293 295

The array is not sorted correctly.

Again the same behavior with these changes.

/*The following code is an experiment to see if Bitonic Sort can be done without recursion*/

#include <stdio.h>

#include <time.h>

#include <stdlib.h>

#include <cuda.h>

#include <string.h>

/*For GTX460 using occupancy calculator*/

#define N 32

#define MAX_ACTIVE_BLOCK 7

#define MAX_THREAD_PER_BLOCK 512 

#define SHARED 64

void initBArray(int *r, int SeqSize); //Fill N array with bitonic sequences

void printArray(int BSortedArray[]);

void errorCheck(cudaError_t ErrorFlag, const char *msg);

void copyArray(int *NArray, int *CopyArray);

bool sortCheck(int *NArray, int *NArrayCopy);  //Compare arrays for accuracy

void bubbleSort(int *NArrayCopy);

bool isPowerOf2();  //For future implementation

__global__ void bitonicSort(int *NArray,int *BarrierVector, int SeqSize, int PowerOf2, int Blocks, int Threads) {

	register int AllThread = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ int sh[SHARED];

	

	 sh[(blockIdx.x * Threads)+threadIdx.x] = NArray[AllThread];

         sh[(blockIdx.x * Threads)+threadIdx.x+Threads] = NArray[AllThread + Threads];

			

	sh[threadIdx.x] = 1;

	sh[threadIdx.x + Threads] = 0;			

	

	NArray[AllThread] = sh[(blockIdx.x * Threads)+threadIdx.x];

        NArray[Threads + AllThread] = sh[(blockIdx.x * Threads) + threadIdx.x + Threads];

}		

int main() {

	

	/*General declarations, summations, and initialization*/

	int NArray[N] = {0}; //random number array

	int NArrayCopy[N] = {0}; //copy of array to be checked for correct sorting		

	int Blocks; //Temp value for bug checking - it's supposed to be N / BATCHSIZE

	int Threads; 

	int PowerOf2;

	int SeqSize;

	int MaxSharedPerBlock;

	int SharedPerBlock = 1;

	int Limit = N / MAX_ACTIVE_BLOCK;

	float time1;

	bool IsCorrect;

	MaxSharedPerBlock = ( ( 16 / MAX_ACTIVE_BLOCK ) * 1000) * 2; //max integers in shared per active block

	if (Limit < MaxSharedPerBlock) {

		while ((SharedPerBlock < Limit) && (SharedPerBlock <= MAX_THREAD_PER_BLOCK)) {

			SharedPerBlock = SharedPerBlock * 2;

		}

		#undef SHARED

		#define SHARED SharedPerBlock		

		Blocks = N / SharedPerBlock;

		Threads = SharedPerBlock / 2;

		SeqSize = SharedPerBlock;

		PowerOf2 = 1;

	}

	else {

		PowerOf2 = 0;

		Blocks = N / MAX_THREAD_PER_BLOCK;

		Threads = MAX_THREAD_PER_BLOCK / 2;

	}

	int BarrierVector[Blocks];

	

	dim3 BlockGrid(Blocks,1,1);

	dim3 BlockDim(Threads,1,1);

	cudaEvent_t start;

	cudaEvent_t stop;

	cudaError_t Error;

	

	srand(time(0));

	

	//Initialize memory sizes for host and device

	size_t ArMemSize;

	size_t BlMemSize;

	int* HostNArray; //memory for fully sorted array

        int* DevNArray;  //memory for full array for device

	int* DevBarrierVector; //memory for barrier vector for device

	//Get and allocate memory size for host

	ArMemSize = N * sizeof(int);

	BlMemSize = Blocks * sizeof(int);

	HostNArray = (int*)malloc(ArMemSize);

	//Allocate memory size for device

	Error = cudaMalloc((void**)&DevNArray, ArMemSize);

	errorCheck(Error, "Memory Allocation: ");		

	Error = cudaMalloc((void**)&DevBarrierVector, BlMemSize);

	errorCheck(Error, "Memory Allocation: ");

	initBArray(NArray, SeqSize);

	copyArray(NArray, NArrayCopy);

	bubbleSort(NArrayCopy);

	

	printf("Here is bubblesorted array: \n");

	printArray(NArrayCopy);

	printf("\nHere is random array: \n");

	printArray(NArray);

	//Fill barrier array

	for (int counter = 0; counter < Blocks; counter++)

		BarrierVector[counter] = 0;

	//Copy memory from host to device

	Error = cudaMemcpy(DevNArray, (void*)NArray, ArMemSize, cudaMemcpyHostToDevice);

	errorCheck(Error, "Memory Copy: ");	

	Error = cudaMemcpy(DevBarrierVector, BarrierVector, BlMemSize, cudaMemcpyHostToDevice);

	errorCheck(Error, "Memory Copy: ");

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	cudaEventRecord(start, 0);

	/*Value Check*/

	printf("Here is sequence size: %d\n", SeqSize);

	printf("Here is PowerOf2: %d\n", PowerOf2);

	printf("Here is Blocks: %d\n", Blocks);

	printf("Here is Threads: %d\n", Threads);

	printf("Here is SHARED: %d\n", SHARED);

	bitonicSort<<<BlockGrid,BlockDim>>>(DevNArray, DevBarrierVector, SeqSize, PowerOf2, Blocks, Threads);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&time1, start, stop);

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

	printf("\n");

	printf("Total execution time for kernel: %f\n", time1);

	//Copy sorted list from device to host

	cudaMemcpy(HostNArray, DevNArray, ArMemSize, cudaMemcpyDeviceToHost);

	Error = cudaGetLastError();

	errorCheck(Error, "Invoke Kernel: ");

	printf("\nHere is an attempt to bitonically sort array: \n");

	printArray(HostNArray);

	printf("\nHere is the sorted array\n");

	printArray(NArrayCopy);

	IsCorrect = sortCheck(HostNArray, NArrayCopy);

	if (IsCorrect) {

		printf("\nThe array is correctly sorted.\n");

	}

	else {

		printf("\nThe array is not sorted correctly.\n");

	}

	//Free memory used

	Error = cudaFree(DevNArray);

	errorCheck(Error, "Free Memory: ");	

	Error = cudaFree(DevBarrierVector);

	errorCheck(Error, "Free Memory: ");

	free(HostNArray);

}	

	

//Pre-Conditions: receives an empty integer array of maximum size N

//Post-Conditions: fills array with bitonic sequences up to N

void initBArray(int *r, int SeqSize) {

	volatile int LocalMax;

	volatile int LowerLimit = 0;

	volatile int UpperLimit = SeqSize;	

	volatile int MiddleOfSeq = SeqSize/2 - 1;	

	volatile int Max = N-1;

	volatile int AfterMidSeq = SeqSize/2;

	volatile int Previous;

	//Get a local maximum

	LocalMax = N * 10;

	while (AfterMidSeq <= N) {

		

	

		for (int count = MiddleOfSeq; count >= LowerLimit; count--) {

			if (count == MiddleOfSeq) {

				r[count] = rand() % LocalMax;

			

			}

			else {

				Previous = r[count + 1];

				r[count] = Previous - (rand() % 11);

		

			}

		}

		for (int count = AfterMidSeq; count < UpperLimit; count++) { 

			Previous = r[count-1];

			r[count] = Previous - (rand() % 11);

		}

		AfterMidSeq = AfterMidSeq + SeqSize;

		MiddleOfSeq = MiddleOfSeq + SeqSize;

		LowerLimit = LowerLimit + SeqSize;

		UpperLimit = UpperLimit + SeqSize;

	}

		

}

//Pre-Conditions: receives an integer bitonically sorted array of maximum size N

//Post-Conditions: displays the contents of the bitonically sorted array

void printArray(int BSortedArray[]) {

	int count;

	for (count = 0; count < N; count++)

		

		printf("%d ", BSortedArray[count]);

}

//Pre-Conditions: receives an error message of type cudaError_t and a message in a string literal form from where the cuda error could have occured in program.

//Post-Conditions: displays an error for the cuda operation if there is one or a success message if there isn't an error.

void errorCheck(cudaError_t ErrorFlag, const char *msg) {

	if (ErrorFlag != cudaSuccess) {

		fprintf(stderr, "Cuda: error %s %s\n", msg, cudaGetErrorString(ErrorFlag));

		exit(-1);

	}

	else

		printf("%s SUCCESS", msg);   			

}

void bubbleSort(int *NArray) {

	volatile int Count;

	volatile int Pass = N-1;

	volatile int InnerCount;

	volatile int HighValueIndex;

	volatile int Temp;

	volatile int InnerPass;

	for (Count = Pass; Count >= 0 ; Count--) {

		HighValueIndex = 0;

		for (InnerPass = 1; InnerPass <= Count; InnerPass++) {

			if (NArray[HighValueIndex] < NArray[InnerPass]) {

				HighValueIndex = InnerPass;

			}

		}

		//Swap

		Temp = NArray[HighValueIndex];

		NArray[HighValueIndex] = NArray[Count];

		NArray[Count] = Temp;		

	

	}

}

bool sortCheck(int *NArray, int *NArrayCopy) {

	bool IsCorrectlySorted = true;

	int count;

	for (count = 0; count < N; count++) {

		if (NArray[count] != NArrayCopy[count]) {

			IsCorrectlySorted = false;

		}

	}

	return IsCorrectlySorted;

}

bool isPowerOf2() {

	int P2increment = 1;

	bool PowerOf2 = false;

	while (P2increment <= N) {

		if (P2increment == N) {

	

			PowerOf2 = true;

			//Exit loop		

			P2increment = P2increment + (N + 1);

		}

		P2increment = P2increment * 2;

	}

	return PowerOf2;

}

	

void copyArray(int *NArray, int *CopyArray) {

	int count;

	for (count = 0; count < N; count++) {

		CopyArray[count] = NArray[count];

	}

}

RPG_Rushbot · July 2, 2011, 11:32pm

Respectfully, this is partially incorrect but for a reason that would like splitting hairs almost. The #undefine and define within main is my way of dynamically allocating shared memory. Shared memory contains the size that would be in SharedPerBlock. My variable names are a mess simply because I’ve been struggling with this problem for awhile without any help, and have been patching together experimental code.

Now, I think you are correct when you say I have an out of bounds access in shared memory, but it is not for reason you gave. Below is new code for my kernel that confirms this theory:

register int AllThread = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ int sh[SHARED];

	

	sh[threadIdx.x] = NArray[AllThread];

sh[threadIdx.x] = 1;

NArray[AllThread] = sh[threadIdx.x];

Here is output:

Memory Allocation: SUCCESSMemory Allocation: SUCCESSHere is bubblesorted array:

34 38 40 43 53 60 61 65 65 65 65 67 104 112 112 114 145 153 157 158 172 174 179 180 180 181 189 190 230 236 240 243

Here is random array:

189 190 180 179 174 181 180 172 104 114 112 112 240 243 236 230 61 65 60 53 34 43 40 38 65 67 65 65 157 158 153 145 Memory Copy: SUCCESSMemory Copy: SUCCESSHere is sequence size: 4

Here is PowerOf2: 1

Here is Blocks: 8

Here is Threads: 2

Here is SHARED: 4

Total execution time for kernel: 0.099328

Invoke Kernel: SUCCESS

Here is an attempt to bitonically sort array:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 61 65 60 53 34 43 40 38 65 67 65 65 157 158 153 145

Here is the sorted array

34 38 40 43 53 60 61 65 65 65 65 67 104 112 112 114 145 153 157 158 172 174 179 180 180 181 189 190 230 236 240 243

The array is not sorted correctly.

This is exactly what I want to see happen for right now. All blocks are working.

My questions over my experience are now this:

I assumed I had 16k of shared memory across active blocks - maximum amount of blocks that can be run at the same time. I had also assumed 16k of shared memory was given per multiprocessor. Are these assumptions true?
Would I be “safe” in using “AllThread” for global memory to use all of the threads and threadIdx.x for shared memory to use all of the threads to index these arrays?
Given 3 blocks with 8 threads each as an example, wouldn’t sh[threadIdx.x] have all blocks reading from location 0,1,2,3,4,5,6,7 in shared memory. In other words I thought of shared memory as one big linear array that all blocks “share”. It seems to me that this is not what is happening at all in code and each block has its own “shared int sh[SHARED];”

Please confirm whether these statements are true or false and I’ll consider this matter closed. Hopefully, if anybody can answer these questions, it will spare novices like me some pain trying to understand these concepts. Thank you very much in advance for helping me!

tera · July 3, 2011, 12:35am

You cannot use preprocessor macros for dynamic memory allocation that way. Preprocessor macros are evaluated during a single pass at compile time. So inside your kernel, [font=“Courier New”]SHARED[/font] will always be replaced with 64. And below the

#undef SHARED

                #define SHARED SharedPerBlock

[font=“Courier New”]SHARED[/font] will always be replaced with [font=“Courier New”]SharedPerBlock[/font], no matter what the conditional, inside which the define was placed, evaluates to at runtime. Preprocessor macros are simple textual replacements at compile time, unrelated to control flow during program execution. You might want to read up a standard C textbook on that.

Dynamic allocation of shared memory is supported through the [font=“Courier New”]extern shared[/font] keyword and a third argument to the kernel launch configuration. See appendix B.2.3 of the Programming Guide for details.

My questions over my experience are now this:

I assumed I had 16k of shared memory across active blocks - maximum amount of blocks that can be run at the same time. I had also assumed 16k of shared memory was given per multiprocessor. Are these assumptions true?

Would I be “safe” in using “AllThread” for global memory to use all of the threads and threadIdx.x for shared memory to use all of the threads to index these arrays?

Given 3 blocks with 8 threads each as an example, wouldn’t sh[threadIdx.x] have all blocks reading from location 0,1,2,3,4,5,6,7 in shared memory. In other words I thought of shared memory as one big linear array that all blocks “share”. It seems to me that this is not what is happening at all in code and each block has its own “shared int sh[SHARED];”

Please confirm whether these statements are true or false and I’ll consider this matter closed. Hopefully, if anybody can answer these questions, it will spare novices like me some pain trying to understand these concepts. Thank you very much in advance for helping me!

I’m sorry I don’t understand what 1. and 2. are supposed to mean. I get the feeling that you have some misconception about what shared memory is.

The answer to 3. is, that the latter is true: Each block gets it’s own set of variables in shared memory, which are shared only between the threads of that block.

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	55566	March 8, 2007
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29327	April 23, 2010
Efficient use of shared memory CUDA Programming and Performance	29	4506	December 2, 2019
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
Shared memory problem CUDA Programming and Performance	10	3948	April 20, 2010
limit of computation CUDA Programming and Performance	44	32905	April 8, 2008
Device memory access bug? CUDA Programming and Performance	29	2081	June 16, 2011
Using Shared Memory in CUDA C/C++ Technical Blog	36	1997	October 8, 2020
Test Multi Threading Spinning CUDA Programming and Performance	32	4814	July 20, 2011

GTX460 Max of 1 Block?

Related topics