Isn't that Coalesced?! writing to global memory in a coalesced way

robOn · June 16, 2009, 10:18am

Hey,

i’m wondering if my kernel is accessing the global memory in a coalesced way. as i understood coalescing, the threads of an active warp have to read elements from a global array that are stored right next to each other:

thread 0: element N
thread 1: element N+1
thread 2: element N+2

…

i’m using the following grid:
blockDim.x = 256, blockDim.y = 0;
gridDim.x = gridDim.y = 256;

that makes it possible to access 256256256 elements in a “parallel” way:

my kernel is similar to this

global void someKernel(float* ArrayInGlobalMemory)
{

int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

//doing some calculations (takes around 400ms for 256256256 elements)

//then i’m writing each result to the array in global memory:

arrayInGlobalMemory[idx] = some result; // this takes about 4200 ms (for 256256256 elements)!!!

}

isn’t that coalesced??!!

there must be some serous bottlenek! but i don’t really know how to solve it!

if someone can give me a hint, i will be very thankfull!

best regards rob

_Big_Mac · June 16, 2009, 10:34am

blockDim.y = 0 is incorrect. You want the y dimension to be 1 to have a “row of 256 threads”.

The time it takes shows that it definitely is NOT coalesced. You’re getting 16MB/s… That’s three orders of magnitude less than you should get, this is much worse than simply having uncoalesced reads. I presume it’s either y=0 or the way you calculate the index that somehow results in a race condition.

robOn · June 16, 2009, 10:55am

I meant blockDim.y = 1. (sry)
how do you calculate the index when you have 1dimensional blocks (one row of 256 threads in my case) and a 2dimensional grid (256 rows/columns of blocks in my case)?
(I got this: int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x from another from another post in this forum)

Cygnus_X1 · June 16, 2009, 12:01pm

It is a long shot, but I can imagine a situation when it is not coalesced:

you are using card of Compute Compatibility 1.1 or lower
pointer arrayInGlobalMemory is not a direct result from cudaMalloc, but somehow computed value which happens not to be aligned to 64.

If that scenario is true, aligning the pointer to 64 should solve the problem.

robOn · June 16, 2009, 12:29pm

thx for the replies!

i’m doing the standard allocation process:

float* arrayInGlobalMemory; 
CUDA_SAFE_CALL(cudaMalloc((void **) &arrayInGlobalMemory, 256*256*256*sizeof(float)));

and i’m using the compiling value ‘sm_13’, so it should be alligned to 64. (with a GTX260 this should work)

can it be serious problem if i use the 32bit cuda libraries and run them on a 64bit cuda driver(vista64)?

_Big_Mac · June 16, 2009, 4:28pm

Okay, I just ran your code through the profiler and it says the writes are coalesced. It took around 1,6ms to complete, giving me 40GB/s of bandwidth (bandwidthTest from the SDK maxes out at 50GB/s for my card). Your indexing scheme is also correct.

Here’s the code

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#if __DEVICE_EMULATION__

bool InitCUDA(void){return true;}

#else

bool InitCUDA(void)

{

	int count = 0;

	int i = 0;

	cudaGetDeviceCount(&count);

	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;

	}

	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {

				break;

			}

		}

	}

	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;

	}

	cudaSetDevice(i);

	printf("CUDA initialized.\n");

	return true;

}

#endif

//using ints instead of floats for more robust correctness checking later, 

//tested with floats as well and it made no difference in bandwidth

__global__ void coalescing2d(int arrayGlobal[])

{

	int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

	arrayGlobal[idx] = idx;

}

int main(int argc, char* argv[])

{

	if(!InitCUDA()) {

		return 0;

	}

	dim3 blockDim;

	dim3 gridDim;

	blockDim.x = 256, blockDim.y = 1;

	gridDim.x = gridDim.y = 256;

	int arrayLength = gridDim.x*gridDim.y*blockDim.x*blockDim.y;

	int* arrayInGlobalMemory;

	//I can feel tmurray's wrath

	CUDA_SAFE_CALL(cudaMalloc((void **) &arrayInGlobalMemory, arrayLength*sizeof(int)));

	

	coalescing2d<<<gridDim,blockDim>>>(arrayInGlobalMemory);

	//no syncing, measure time with the profiler instead

	

	//test for correctness

	int *h_array = new int[arrayLength];

	CUDA_SAFE_CALL(cudaMemcpy(h_array, 

		arrayInGlobalMemory, 

		arrayLength*sizeof(float), 

		cudaMemcpyDeviceToHost));

	//numbers written should be incrementing if index calculations went right

	//ie. each element should be previous element + 1

	int prev = -1;

	int errors = 0;

	for(int i = 0; i < arrayLength; ++i)

	{

		if(h_array[i] != prev+1)

			++errors;

		prev = h_array[i];

	}

	printf("errors: %d\n",errors);

	cudaFree(arrayInGlobalMemory);

	delete h_array;

	cudaThreadExit();

	

	return 0;

}

How did you measure that it’s the store that takes 4000ms?

robOn · June 17, 2009, 7:05am

first of all: thank you so much for making so much effort on my problem! i really appreciate that!

i measured these 4000ms by simply removing any kind of computations from my kernel and simply store a fixed value to every array-element:

e.g. arrayInGlobalMemory[idx] = 1.0f;

the problem must be somewhere else…

for later deploy-testing i compiled my code with 32bit libraries (cutil32.dll etc.) and took the time-measurements on a pc having the 64bit nvidia driver installed. so mybe that doesn’t fit together.

Sarnath · June 17, 2009, 8:58am

32-bit libraries can work with 64-bit drivers. I have used them on XP 64-bit machine.

There is a known problem with “cudaMalloc” in such setups. cudaMalloc fails after some repeated invocation of an application (no matter how less u allocate inside that application). But that should NOT cause abysmal turn around times…as u c.

btw, not sure if this problem is there in VISTA 64.

nksureshkumar · June 24, 2009, 5:15pm

Purely out of curiosity. Did you check that you are not running in device emulation mode ?

robOn · June 28, 2009, 8:56pm

No, but the time get’s close to it.

I’m currently using CUDA v2.1. Are there any known bugs in that release that can cause these mistakes?

(Copying 256kb from device to host takes also around 1sec. … reminds me on my C64)

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2196	December 9, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11913	March 20, 2009
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3588	July 27, 2016
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8316	July 31, 2007
question in the sample code (simpleStream.cu) CUDA Programming and Performance	3	3859	November 26, 2007
Problem about Coalesced Access CUDA Programming and Performance	1	4180	July 8, 2008
coalesced access to global memory block-wise access vs element-wise access CUDA Programming and Performance	0	1534	March 17, 2010
Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads CUDA Programming and Performance	0	584	December 15, 2012
Memory coalescing in one thread CUDA Programming and Performance	17	16767	March 31, 2011
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2272	May 28, 2009

Isn't that Coalesced?! writing to global memory in a coalesced way

Related topics