cudaBindTexture2D problem

Hello

I’m having a problem with textures. I bind linear memory to 2D texture and then run a kernel where I copy values from texture back to device memory. I did test program and everything went well but then I used the same code in my simulation program and it doesn’t work.

(*fluid).cx = 10; //the width of array is then 12

(*fluid).cy = 10; //height is 12

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); 

texTest.addressMode[0] = cudaAddressModeClamp; 

texTest.addressMode[1] = cudaAddressModeClamp; 

texTest.filterMode	 = cudaFilterModePoint; 

texTest.normalized	 = false;

cudaBindTexture2D(0, texTest, d_test, channelDesc, (*fluid).cx+2, (*fluid).cy+2, ((*fluid).cx+2)*sizeof(float) ); 

dim3 threadsPerBlock(16, 16); 

dim3 numBlocks( (((*fluid).cx+2)+threadsPerBlock.x-1) / threadsPerBlock.x, (((*fluid).cy+2)+threadsPerBlock.y-1) / threadsPerBlock.y); 

TESTKERNEL<<<numBlocks, threadsPerBlock>>>(d_testR, d_xco, d_yco, d_fluid);  

//d_testR-array to store output, d_xco and d_yco are arrays to store threads indxs  and d_fluid contains info about the simulation grid like size etc

//copy back to host and print

and the part of output is

index in linear memory = value from texture = right value

x and y coordinate of thread

...

7= 7.000000 = 7.000000

7 0

8= 8.000000 = 8.000000

8 0

9= 9.000000 = 9.000000

9 0

10= 10.000000 = 10.000000

10 0

11= 11.000000 = 11.000000

11 0

12= 8.000000 = 12.000000

0 1

13= 9.000000 = 13.000000

1 1

14= 10.000000 = 14.000000

2 1

15= 11.000000 = 15.000000

3 1

...

So values for first row are ok but all values in second row are smaller by 4 tha they shloud be. in 3. row they are smaller by 8, in 4. row by 12 etc.

These differencies differs with width and height of the array. If with and height is something like 64 16 etc. everything is ok.

I’m running the simulation without any problems, so the problem shouldn’t be something like that I’m not having enough threads or something like this. But now I’m using cudaArrays and I find out that memcpy(deviceToDevice) is two time faster than memcpyToArray(deviceToDevice). So I want to use linear memory instead.

Anyone has an idea what might be a problem?? I would be most greatful for any help.

I just checked the it works with cudaArray

so if I change the program a little bit to this:

(*fluid).cx = 10; //the width of array is then 12

(*fluid).cy = 10; //height is 12

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); 

texTest.addressMode[0] = cudaAddressModeClamp; 

texTest.addressMode[1] = cudaAddressModeClamp; 

texTest.filterMode	 = cudaFilterModePoint; 

texTest.normalized	 = false;

cudaArray* testArray;

cutilSafeCall( cudaMallocArray(&testArray, &channelDesc, cx+2, cy+2));

cudaMemcpyToArray(testArray, 0, 0,  (*fluid).u   , N*sizeof(float), cudaMemcpyHostToDevice);

cudaBindTextureToArray(texTest, testArray, channelDesc);

dim3 threadsPerBlock(16, 16); 

dim3 numBlocks( (((*fluid).cx+2)+threadsPerBlock.x-1) / threadsPerBlock.x, (((*fluid).cy+2)+threadsPerBlock.y-1) / threadsPerBlock.y); 

TESTKERNEL<<<numBlocks, threadsPerBlock>>>(d_testR, d_xco, d_yco, d_fluid);  

//d_testR-array to store output, d_xco and d_yco are arrays to store threads indxs  and d_fluid contains info about the simulation grid like size etc

//copy back to host and print

everything is ok

I would like to ask one more question, why is memcpyToArrat ( …, deviceToDevice) so much slower than memcpy( … , deviceToDevice) for same size arrays?

CudaArrays are internally organized using a spacefilling curve (see wikipedia), so that spatially localized access results in localized memory access. This requires complicated shuffling of the original data.

The actual memory layout is kept in a shroud of secrets, probably for patent reasons and/or to avoid people messing with a layout that is subject to change through various generations of silicon.