Need some help to understand how to coalesce memory access

Hi,

I’m currently trying to find a way to remove uncoalesced access in my kernels but after spending long hours trying to find some understandable documentation or any resources on the topic I did not managed to find something making sense for me. For exemple this simple kernel:

__global__ void cudaReplacement(unsigned char* population, int* populationValues, int populationSize, int populationIndex, int qapSize, unsigned char* childrenPopulation, int* childrenPopulationValues) {

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	

	if (childrenPopulationValues[i] > 0 && populationValues[populationIndex * populationSize + i] > childrenPopulationValues[i]) {

		populationValues[populationIndex * populationSize + i] = childrenPopulationValues[i];

		for (int k = 0; k < PB_SIZE; k++) {

			population[populationIndex * populationSize + i * PB_SIZE + k] = childrenPopulation[i * PB_SIZE + k];

		}

	}

}

didn’t seem to goes well on this point and I really don’t understand. This second kernel does not go well to on this point:

__global__ void cudaCrossover(unsigned char* population, int populationIndex, int populationSize, int* index2, int* indexes, int indexesCount, unsigned char* childPopulation, int qapSize) {

	int ind = blockDim.x * blockIdx.x + threadIdx.x;

	for (int i = 0; i < PB_SIZE; i++) {

		childPopulation[ind * PB_SIZE + i] = population[populationIndex * populationSize + ind * PB_SIZE + i];

	}

	unsigned char* parent2 = &population[populationIndex * populationSize + (index2[ind] % populationSize) * qapSize];

	for (int i = 0; i < UPMX_MOD_COUNT; i++) {

		int rnd = indexes[ind + POP_SIZE * i] % qapSize;

		int pos1 = rnd;

		int pos2 = 0;

		while (childPopulation[ind * qapSize + pos2] != parent2[pos1]) {

			pos2++;

		}

		unsigned char tmp = childPopulation[ind * qapSize + pos2];

		childPopulation[ind * qapSize + pos2] = childPopulation[ind * qapSize + pos1];

		childPopulation[ind * qapSize + pos1] = tmp;

	}

}

I’m looking for any kind of advice and explanations on why this isn’t working properly. I’ve already tryied a lot of things but I think I don’t get it.

I’m working with a 9600MGT hardware.

Thank you very much for your help in advance.

Both kernels have ‘for loops’ and global memory is accessed inside the loop (there you have some calculations using the global memory variables indexed using the loop variable). This makes it non-coalesced.
In simple terms we can define coalesced memory access as ‘Adjacent cuda threads should access adjacent locations in the global memory locations’. ie. Thread 0 to 0th index of the array, thread 1 to 1st index of the global array and so on.
If it is unavoidable to have non-coalesced accesses in your logic better you opt texture memory, there you won’t have any non-coalescing issues. (Note that the texture memory is read-only, so you can use it for input buffers only).

Both kernels have ‘for loops’ and global memory is accessed inside the loop (there you have some calculations using the global memory variables indexed using the loop variable). This makes it non-coalesced.
In simple terms we can define coalesced memory access as ‘Adjacent cuda threads should access adjacent locations in the global memory locations’. ie. Thread 0 to 0th index of the array, thread 1 to 1st index of the global array and so on.
If it is unavoidable to have non-coalesced accesses in your logic better you opt texture memory, there you won’t have any non-coalescing issues. (Note that the texture memory is read-only, so you can use it for input buffers only).

Hi
I have an ask about memory coalescence.
I want to compute long row of numbers splitted to blocks. Each number needs know two previous numbers and two following numbers. So I must make blocks with size n+4 (where n numbers are really computed).

For example a have row of 12 numbers:
a b c d e f g h i j k l
and I split it to three blocks of 8 numbers:
block1: 0 0 a b c d e f
block2: c d e f g h i j
block3: g h i j k l 0 0

is possible to compute this blocks in shared memory without memory coalescence?
or if it is not possible, how to do it fastest?

thanks for every answer
and sorry for my english

Hi
I have an ask about memory coalescence.
I want to compute long row of numbers splitted to blocks. Each number needs know two previous numbers and two following numbers. So I must make blocks with size n+4 (where n numbers are really computed).

For example a have row of 12 numbers:
a b c d e f g h i j k l
and I split it to three blocks of 8 numbers:
block1: 0 0 a b c d e f
block2: c d e f g h i j
block3: g h i j k l 0 0

is possible to compute this blocks in shared memory without memory coalescence?
or if it is not possible, how to do it fastest?

thanks for every answer
and sorry for my english