Global memory reads optimization with texture cache

Hi there,

I have some troubles using texture cache on GPU.

My situation is: I have one square matrix of 16 unsigned char (4x4) and many threads. Each thread need to operate on one single item of this matrix and with other kind of data. In other words I have a number (possibly big) of blocks and each block has 4x4x32 threads (these are the three dimensions of one single block: x=4, y=4 and z=32). Assuming that the matrix on device memory is called “d_data” each thread need to operate on the unsigned char in position

[codebox]

#include <stdio.h>

#include <cutil_inline.h>

#define ROUNDS 100 // iterations to average

texture <unsigned char, 1, cudaReadModeElementType> tex;

dim3 block = dim3(4,4,32);

dim3 grid = dim3(2000);

global void kernel(); // kernel that uses texture cache

global void kernelStd(unsigned char *d_data); // kernel with “standard” (i.e. no texture cache) data copy

int main(int argc, char **argv) {

unsigned char *data = (unsigned char *) malloc(16 * sizeof(unsigned char));

for (unsigned char i = 0; i < 16; i++) data[i] = 0xff - i; // init of data

unsigned char *d_data;

cudaMalloc((void **) &d_data, 16);

cudaMemcpy(d_data, data, 16, cudaMemcpyHostToDevice);

double t_avg = 0;

unsigned int timer;



// measuring computation time with texture cache "disabled"

for (unsigned int round = 0; round < ROUNDS; round++)	{

	cutCreateTimer(&timer);

	cutStartTimer(timer);

	kernelStd<<<grid,block>>>(d_data);

	cudaThreadSynchronize();

	cutStopTimer(timer);

	t_avg += cutGetTimerValue(timer) / (double) ROUNDS;

	cutDeleteTimer(timer);

}

printf("%d rounds without texture cache. Average time: %f ms\n", ROUNDS, t_avg);

t_avg = 0;

// measuring computation time using texture cache

for (unsigned int round = 0; round < ROUNDS; round++)	{

	cutCreateTimer(&timer);

	cutStartTimer(timer);

	cudaBindTexture(0, tex, d_data, 16);

	kernel<<<grid,block>>>();

	cudaThreadSynchronize();

	cudaUnbindTexture(tex);

	cutStopTimer(timer);

	t_avg += cutGetTimerValue(timer) / (double) ROUNDS;

	cutDeleteTimer(timer);

}

printf("%d rounds using texture cache.   Average time: %f ms\n", ROUNDS, t_avg);	

cudaFree(d_data);

free(data);

return 0;

}

global void kernel() {

unsigned char myData = tex1Dfetch(tex, threadIdx.x + blockDim.x * threadIdx.y);

myData += 1;

}

global void kernelStd(unsigned char *d_data) {

unsigned char myData = d_data[threadIdx.x + blockDim.x * threadIdx.y];

myData += 1;

}

[/codebox]

thanks for your answers

M.M.

  1. I think the compiler will just optimize your kernels out since they do not do any writes to global memory…this could be the reason for the very quick execution time.
  2. The CUTIL timers are not very accurate (I think I’ve read around they have 10 ms resolution) – You should look into using CUDA hardware timers via events
  3. You should check for errors after your kernel launch
  4. If you are accessing 2D data, you should use a 2D texture…this will exploit 2D spatial locality rather than 1D and hopefully help speed things up once your kernel actually does something

You are right! I’ve checked the .ptx generated by nvcc: the kernel [font=“Courier New”]kernelStd[/font] has only two instructions [font=“Courier New”].loc[/font] and nothing else (except the [font=“Courier New”]exit;[/font] obviously :) ) so I think that no global to local copy was made after its execution. In the other kernel instead there is one texture load instruction: [font=“Courier New”]tex.1d.v4.u32.s32 {%r7,%r8,%r9,%r10},[tex,{%r3,%r4,%r5,%r6}];[/font], so the second kernel is slower than the first simply because the first exits immediately! :haha:

I’ve never used events or CUDA hardware timers, I’ll do a search on the web because I’ve no idea of their functionality.

You are right, but in this code I think that nothing can go wrong, it is written just for performance checking and there are only simple instructions, constant array sizes etc. So I’ve not inserted any error checking.

Ok, I’ll try this also.

Thanks very much for the answer, I’ll post the result of my test soon!

bye!

MM