Hi, gurus,
I coded a test code to understand the coalescing global memory, I got an conclusion that the more closer the data is, the more speed we got. But I have met one problem.
In Programming Guide, chapter 5.1.2.1, the performance consideration of global memory, there is words:
I wrote a kernel code:
__global__ void test_pitch_kernel_col_path(test_pitch_data_s* d_data, int width)
{
// in this case, we have a matrix, and we consider each col
// the steps of one path, so paths = col_num, steps = row_num,
// we consider one block has 256 threads, which process 256
// paths.
int T = blockDim.x * gridDim.x;
for (int path = blockIdx.x * blockDim.x + threadIdx.x; path < TEST_PITCH_PATHS; path += T)
{
for (int step = 0; step < TEST_PITCH_STEPS; step++)
{
__shared__ test_pitch_data_s l_data[256];
l_data[threadIdx.x] = d_data[step * width + path];
test_pitch_calc(l_data[threadIdx.x]);
d_data[step * width + path] = l_data[threadIdx.x];
}
}
}
The struct test_pitch_data_s has been aligned by align keyword, and I read path vertically. The code is to read many path, and process them one step by another, so from my test, to loop vertical is more effecient, case each step we read data nearby.
But notice I have a width argument, this is used for alignment, cause the Guide said the width of row should be align up to 16 * sizeof(struct), so I have a function to allocate memory, and aligned the width by a switch argument:
void test_pitch_fill_col_path(test_pitch_data_s*& h_data, test_pitch_data_s*& d_data, bool alignWidth, int& width)
{
assert(!h_data && !d_data);
width = TEST_PITCH_PATHS;
if (alignWidth) width = iAlignUp(width, 16);
size_t size = width * TEST_PITCH_STEPS * sizeof(test_pitch_data_s);
CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, size));
assert(d_data);
h_data = (test_pitch_data_s*)malloc(size);
assert(h_data);
for (int i = 0; i < TEST_PITCH_PATHS; i++)
for (int j = 0; j < TEST_PITCH_STEPS; j++)
test_pitch_data_fill(h_data[j * width + i]);
CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
}
Then, I use many test case to do the test:
typedef struct
{
const char* pattern;
bool alignWidth;
int blocks;
char report[1024];
} test_pitch_job_s;
test_pitch_job_s test_pitch_jobs[] =
{
{ "RowPath", false, 8, "N/A" },
{ "RowPath", true, 8, "N/A" },
{ "ColPath", false, 8, "N/A" },
{ "ColPath", true, 8, "N/A" },
{ "RowPath", false, 16, "N/A" },
{ "RowPath", true, 16, "N/A" },
{ "ColPath", false, 16, "N/A" },
{ "ColPath", true, 16, "N/A" },
{ "RowPath", false, 64, "N/A" },
{ "RowPath", true, 64, "N/A" },
{ "ColPath", false, 64, "N/A" },
{ "ColPath", true, 64, "N/A" },
{ "RowPath", false, 256, "N/A" },
{ "RowPath", true, 256, "N/A" },
{ "ColPath", false, 256, "N/A" },
{ "ColPath", true, 256, "N/A" },
{ 0 }
};
extern "C" void test_pitch(bool debug)
{
CUT_DEVICE_INIT();
for (test_pitch_job_s* job = test_pitch_jobs; job->pattern != 0; job++)
{
test_pitch_data_s* h_data = NULL;
test_pitch_data_s* d_data = NULL;
int width = 0;
if (!strcmp(job->pattern, "RowPath"))
{
test_pitch_fill_row_path(h_data, d_data, job->alignWidth, width);
dim3 grid(job->blocks);
dim3 block(256);
unsigned int timer = 0;
float time = 0.0f;
CUDA_SAFE_CALL(cudaThreadSynchronize());
util_start_timer(timer);
test_pitch_kernel_row_path<<< grid, block >>>(d_data, width);
CUDA_SAFE_CALL(cudaThreadSynchronize());
time = util_stop_timer(timer);
bool result = test_pitch_verify_row_path(h_data, d_data, width);
sprintf(job->report, "%s\t%s\t%s\t%d\t%.4f",
result ? "[OK]" : "[FAIL]",
job->pattern,
job->alignWidth ? "yes" : "no",
job->blocks,
time
);
}
else if (!strcmp(job->pattern, "ColPath"))
{
test_pitch_fill_col_path(h_data, d_data, job->alignWidth, width);
dim3 grid(job->blocks);
dim3 block(256);
unsigned int timer = 0;
float time = 0.0f;
CUDA_SAFE_CALL(cudaThreadSynchronize());
util_start_timer(timer);
test_pitch_kernel_col_path<<< grid, block >>>(d_data, width);
CUDA_SAFE_CALL(cudaThreadSynchronize());
time = util_stop_timer(timer);
bool result = test_pitch_verify_col_path(h_data, d_data, width);
sprintf(job->report, "%s\t%s\t%s\t%d\t%.4f",
result ? "[OK]" : "[FAIL]",
job->pattern,
job->alignWidth ? "yes" : "no",
job->blocks,
time
);
}
else
{
assert(false);
}
}
printf("================= REPORT ================\n");
printf("CHECK\tPATTERN\tALIGN\tBLOCKS\tTIME\n");
for (test_pitch_job_s* job = test_pitch_jobs; job->pattern != 0; job++)
{
printf("%s\n", job->report);
}
printf("=========================================\n");
printf("\nPRESS ENTER TO CONTINUE...\n");
fflush(stdout);
fflush(stderr);
getchar();
}
But very strange, the aligned row didn’t make more speed, in fact, it slow down the speed a little:
D:\dev\mcgpu\trunk\debug>memtest.exe
================= REPORT ================
CHECK PATTERN ALIGN BLOCKS TIME
[OK] RowPath no 8 265.0158
[OK] RowPath yes 8 426.6126
[OK] ColPath no 8 129.0094
[OK] ColPath yes 8 129.0114
[OK] RowPath no 16 264.6627
[OK] RowPath yes 16 426.3832
[OK] ColPath no 16 129.0097
[OK] ColPath yes 16 129.0066
[OK] RowPath no 64 264.4906
[OK] RowPath yes 64 426.0882
[OK] ColPath no 64 129.0206
[OK] ColPath yes 64 129.0217
[OK] RowPath no 256 264.4951
[OK] RowPath yes 256 425.9530
[OK] ColPath no 256 129.0513
[OK] ColPath yes 256 129.0611
=========================================
PRESS ENTER TO CONTINUE...
So I was puzzled, any idea? Why the aligned row didn’t make any performance? it slow down the speed!
BTW, another question, about the cudaMallocPitch(), it returned a pitch value, which is said to be the bytes in row width. and then I call it like this:
cudaMallocPitch((void**)&d_data, &pitch, 1613 * sizeof(s), 1611);
but, very strange, the returned pitch is 1614 * sizeof(s), it’s not 16 * n * sizeof(s), and I don’t know what this pitch is for, please help.
Thansk for any words,
Regards,
Xiang.