Isn't that Coalesced?! writing to global memory in a coalesced way


i’m wondering if my kernel is accessing the global memory in a coalesced way. as i understood coalescing, the threads of an active warp have to read elements from a global array that are stored right next to each other:

thread 0: element N
thread 1: element N+1
thread 2: element N+2

i’m using the following grid:
blockDim.x = 256, blockDim.y = 0;
gridDim.x = gridDim.y = 256;

that makes it possible to access 256256256 elements in a “parallel” way:

my kernel is similar to this

global void someKernel(float* ArrayInGlobalMemory)

int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

//doing some calculations (takes around 400ms for 256256256 elements)

//then i’m writing each result to the array in global memory:

arrayInGlobalMemory[idx] = some result; // this takes about 4200 ms (for 256256256 elements)!!!


isn’t that coalesced??!!

there must be some serous bottlenek! but i don’t really know how to solve it!

if someone can give me a hint, i will be very thankfull!

best regards rob

blockDim.y = 0 is incorrect. You want the y dimension to be 1 to have a “row of 256 threads”.

The time it takes shows that it definitely is NOT coalesced. You’re getting 16MB/s… That’s three orders of magnitude less than you should get, this is much worse than simply having uncoalesced reads. I presume it’s either y=0 or the way you calculate the index that somehow results in a race condition.

  1. I meant blockDim.y = 1. (sry)

  2. how do you calculate the index when you have 1dimensional blocks (one row of 256 threads in my case) and a 2dimensional grid (256 rows/columns of blocks in my case)?
    (I got this: int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x from another from another post in this forum)

It is a long shot, but I can imagine a situation when it is not coalesced:

  • you are using card of Compute Compatibility 1.1 or lower
  • pointer arrayInGlobalMemory is not a direct result from cudaMalloc, but somehow computed value which happens not to be aligned to 64.

If that scenario is true, aligning the pointer to 64 should solve the problem.

thx for the replies!

i’m doing the standard allocation process:

float* arrayInGlobalMemory; 
CUDA_SAFE_CALL(cudaMalloc((void **) &arrayInGlobalMemory, 256*256*256*sizeof(float)));

and i’m using the compiling value ‘sm_13’, so it should be alligned to 64. (with a GTX260 this should work)

can it be serious problem if i use the 32bit cuda libraries and run them on a 64bit cuda driver(vista64)?

Okay, I just ran your code through the profiler and it says the writes are coalesced. It took around 1,6ms to complete, giving me 40GB/s of bandwidth (bandwidthTest from the SDK maxes out at 50GB/s for my card). Your indexing scheme is also correct.

Here’s the code

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>


bool InitCUDA(void){return true;}


bool InitCUDA(void)


	int count = 0;

	int i = 0;


	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;


	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {





	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;



	printf("CUDA initialized.\n");

	return true;



//using ints instead of floats for more robust correctness checking later, 

//tested with floats as well and it made no difference in bandwidth

__global__ void coalescing2d(int arrayGlobal[])


	int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

	arrayGlobal[idx] = idx;


int main(int argc, char* argv[])


	if(!InitCUDA()) {

		return 0;


	dim3 blockDim;

	dim3 gridDim;

	blockDim.x = 256, blockDim.y = 1;

	gridDim.x = gridDim.y = 256;

	int arrayLength = gridDim.x*gridDim.y*blockDim.x*blockDim.y;

	int* arrayInGlobalMemory;

	//I can feel tmurray's wrath

	CUDA_SAFE_CALL(cudaMalloc((void **) &arrayInGlobalMemory, arrayLength*sizeof(int)));



	//no syncing, measure time with the profiler instead


	//test for correctness

	int *h_array = new int[arrayLength];





	//numbers written should be incrementing if index calculations went right

	//ie. each element should be previous element + 1

	int prev = -1;

	int errors = 0;

	for(int i = 0; i < arrayLength; ++i)


		if(h_array[i] != prev+1)


		prev = h_array[i];


	printf("errors: %d\n",errors);


	delete h_array;



	return 0;


How did you measure that it’s the store that takes 4000ms?

first of all: thank you so much for making so much effort on my problem! i really appreciate that!

i measured these 4000ms by simply removing any kind of computations from my kernel and simply store a fixed value to every array-element:

e.g. arrayInGlobalMemory[idx] = 1.0f;

the problem must be somewhere else…

for later deploy-testing i compiled my code with 32bit libraries (cutil32.dll etc.) and took the time-measurements on a pc having the 64bit nvidia driver installed. so mybe that doesn’t fit together.

32-bit libraries can work with 64-bit drivers. I have used them on XP 64-bit machine.

There is a known problem with “cudaMalloc” in such setups. cudaMalloc fails after some repeated invocation of an application (no matter how less u allocate inside that application). But that should NOT cause abysmal turn around times…as u c.

btw, not sure if this problem is there in VISTA 64.

Purely out of curiosity. Did you check that you are not running in device emulation mode ?

No, but the time get’s close to it.

I’m currently using CUDA v2.1. Are there any known bugs in that release that can cause these mistakes?

(Copying 256kb from device to host takes also around 1sec. … reminds me on my C64)