question on memory coalescing and alignment

Hi, gurus,

I coded a test code to understand the coalescing global memory, I got an conclusion that the more closer the data is, the more speed we got. But I have met one problem.

In Programming Guide, chapter 5.1.2.1, the performance consideration of global memory, there is words:

I wrote a kernel code:

__global__ void test_pitch_kernel_col_path(test_pitch_data_s* d_data, int width)

{

	// in this case, we have a matrix, and we consider each col

	// the steps of one path, so paths = col_num, steps = row_num,

	// we consider one block has 256 threads, which process 256

	// paths.

	int T = blockDim.x * gridDim.x;

	for (int path = blockIdx.x * blockDim.x + threadIdx.x; path < TEST_PITCH_PATHS; path += T)

	{

  for (int step = 0; step < TEST_PITCH_STEPS; step++)

  {

  	__shared__ test_pitch_data_s l_data[256];

 	l_data[threadIdx.x] = d_data[step * width + path];

 	test_pitch_calc(l_data[threadIdx.x]);

 	d_data[step * width + path] = l_data[threadIdx.x];

  }

	}

}

The struct test_pitch_data_s has been aligned by align keyword, and I read path vertically. The code is to read many path, and process them one step by another, so from my test, to loop vertical is more effecient, case each step we read data nearby.

But notice I have a width argument, this is used for alignment, cause the Guide said the width of row should be align up to 16 * sizeof(struct), so I have a function to allocate memory, and aligned the width by a switch argument:

void test_pitch_fill_col_path(test_pitch_data_s*& h_data, test_pitch_data_s*& d_data, bool alignWidth, int& width)

{

	assert(!h_data && !d_data);

	width = TEST_PITCH_PATHS;

	if (alignWidth) width = iAlignUp(width, 16);

	size_t size = width * TEST_PITCH_STEPS * sizeof(test_pitch_data_s);

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, size));

	assert(d_data);

	h_data = (test_pitch_data_s*)malloc(size);

	assert(h_data);

	for (int i = 0; i < TEST_PITCH_PATHS; i++)

  for (int j = 0; j < TEST_PITCH_STEPS; j++)

  	test_pitch_data_fill(h_data[j * width + i]);

	CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));

}

Then, I use many test case to do the test:

typedef struct

{

	const char* pattern;

	bool alignWidth;

	int blocks;

	char report[1024];

} test_pitch_job_s;

test_pitch_job_s test_pitch_jobs[] =

{

	{ "RowPath",  false,  	8,    "N/A" },

	{ "RowPath",  true,  	8,    "N/A" },

	{ "ColPath",  false,  	8,    "N/A" },

	{ "ColPath",  true,  	8,    "N/A" },

	{ "RowPath",  false,  	16,    "N/A" },

	{ "RowPath",  true,  	16,    "N/A" },

	{ "ColPath",  false,  	16,    "N/A" },

	{ "ColPath",  true,  	16,    "N/A" },

	{ "RowPath",  false,  	64,    "N/A" },

	{ "RowPath",  true,  	64,    "N/A" },

	{ "ColPath",  false,  	64,    "N/A" },

	{ "ColPath",  true,  	64,    "N/A" },

	{ "RowPath",  false,  	256,  	"N/A" },

	{ "RowPath",  true,  	256,  	"N/A" },

	{ "ColPath",  false,  	256,  	"N/A" },

	{ "ColPath",  true,  	256,  	"N/A" },

	{ 0 }

};

extern "C" void test_pitch(bool debug)

{

	CUT_DEVICE_INIT();

	for (test_pitch_job_s* job = test_pitch_jobs; job->pattern != 0; job++)

	{

  test_pitch_data_s* h_data = NULL;

  test_pitch_data_s* d_data = NULL;

 int width = 0;

 if (!strcmp(job->pattern, "RowPath"))

  {

  	test_pitch_fill_row_path(h_data, d_data, job->alignWidth, width);

 	dim3 grid(job->blocks);

  	dim3 block(256);

 	unsigned int timer = 0;

  	float time = 0.0f;

 	CUDA_SAFE_CALL(cudaThreadSynchronize());

 	util_start_timer(timer);

 	test_pitch_kernel_row_path<<< grid, block >>>(d_data, width);

 	CUDA_SAFE_CALL(cudaThreadSynchronize());

 	time = util_stop_timer(timer);

 	bool result = test_pitch_verify_row_path(h_data, d_data, width);

 	sprintf(job->report, "%s\t%s\t%s\t%d\t%.4f",

    result ? "[OK]" : "[FAIL]",

    job->pattern,

    job->alignWidth ? "yes" : "no",

    job->blocks,

    time

    );

  }

  else if (!strcmp(job->pattern, "ColPath"))

  {

  	test_pitch_fill_col_path(h_data, d_data, job->alignWidth, width);

 	dim3 grid(job->blocks);

  	dim3 block(256);

 	unsigned int timer = 0;

  	float time = 0.0f;

 	CUDA_SAFE_CALL(cudaThreadSynchronize());

 	util_start_timer(timer);

 	test_pitch_kernel_col_path<<< grid, block >>>(d_data, width);

 	CUDA_SAFE_CALL(cudaThreadSynchronize());

 	time = util_stop_timer(timer);

 	bool result = test_pitch_verify_col_path(h_data, d_data, width);

 	sprintf(job->report, "%s\t%s\t%s\t%d\t%.4f",

    result ? "[OK]" : "[FAIL]",

    job->pattern,

    job->alignWidth ? "yes" : "no",

    job->blocks,

    time

    );

  }

  else

  {

  	assert(false);

  }

	}

	printf("================= REPORT ================\n");

	printf("CHECK\tPATTERN\tALIGN\tBLOCKS\tTIME\n");

	for (test_pitch_job_s* job = test_pitch_jobs; job->pattern != 0; job++)

	{

  printf("%s\n", job->report);

	}

	printf("=========================================\n");

	printf("\nPRESS ENTER TO CONTINUE...\n");

	fflush(stdout);

	fflush(stderr);

	getchar();

}

But very strange, the aligned row didn’t make more speed, in fact, it slow down the speed a little:

D:\dev\mcgpu\trunk\debug>memtest.exe

================= REPORT ================

CHECK   PATTERN ALIGN   BLOCKS  TIME

[OK]    RowPath no      8       265.0158

[OK]    RowPath yes     8       426.6126

[OK]    ColPath no      8       129.0094

[OK]    ColPath yes     8       129.0114

[OK]    RowPath no      16      264.6627

[OK]    RowPath yes     16      426.3832

[OK]    ColPath no      16      129.0097

[OK]    ColPath yes     16      129.0066

[OK]    RowPath no      64      264.4906

[OK]    RowPath yes     64      426.0882

[OK]    ColPath no      64      129.0206

[OK]    ColPath yes     64      129.0217

[OK]    RowPath no      256     264.4951

[OK]    RowPath yes     256     425.9530

[OK]    ColPath no      256     129.0513

[OK]    ColPath yes     256     129.0611

=========================================

PRESS ENTER TO CONTINUE...

So I was puzzled, any idea? Why the aligned row didn’t make any performance? it slow down the speed!

BTW, another question, about the cudaMallocPitch(), it returned a pitch value, which is said to be the bytes in row width. and then I call it like this:

cudaMallocPitch((void**)&d_data, &pitch, 1613 * sizeof(s), 1611);

but, very strange, the returned pitch is 1614 * sizeof(s), it’s not 16 * n * sizeof(s), and I don’t know what this pitch is for, please help.

Thansk for any words,

Regards,

Xiang.