Texture / Array Access

xlro · January 2, 2008, 4:57pm

Hello everybody,

i already read the manual, used the search and browsed the CUDA SDK-files and did not find an example where cuda arrays or texture values were manipulated.

Are these values read-only?

I’m currently using 2D-Textures for images if that helps.

What i want to do for example is to multiply each value in an image with f.e. 2 or set all entries of a cudaArray to 0. (i know how this works if i don’t use textures or cuda arrays).

I’m really sad that something like the following (C-Style) doesn’t work.

CUarray array;

.... //fill array 

array[0] = 10;

Thanks for your help,

greetings xlro.

paulius · January 2, 2008, 6:39pm

CUDA arrays/textures are read-only from a kernel’s point of view.

Paulius

xlro · January 2, 2008, 6:49pm

thanks for the fast answer :)

MisterAnderson42 · January 3, 2008, 12:44am

If you use a 1D texture bound to device memory, you can write to it.

Since you mention 2D textures via arrays you’ll need to write to a 2nd buffer and do a device to device memcpy, which is really fast at 70 GiB/s.

paulius · January 3, 2008, 8:01pm

Very true, but you have to be aware that texture cache is “read-only”. Meaning that it’s possible to have one thread write to device memory, but another thread fetching from the same location, hitting cache, and getting the “old” value. Tex cache entry is not invalidated when the corresponding mem location is written.

This is perfectly fine for some algorithms, like Warshall-Floyd for all-pairs shortest paths. Other algorithms may have a problem.

Paulius

BonsaiScott · January 4, 2008, 1:45pm

I’m curious why 2D textures are not writeable under CUDA, particularly since 2D textures (declared as render targets) are writeable under DirectX.

Richard_Dynes · January 4, 2008, 7:17pm

Is there a mechanism for invalidating the cache? Can it be done by program, either host or kernel?

BonsaiScott · January 10, 2008, 1:26pm

Then please help me understand what I’m doing wrong. Here’s my processing loop:

loop 8x {

process 1kx1k float4 texture (cudaArray) --> linear mem buffer

// cudaMemcpyToArray(..., cudaMemcpyDeviceToDevice)

}

Addresses are as returned from cudaMallocArray() & cudaMalloc() so there should not be any alignment issues.

Run time: 21.63 ms

uncomment the cudaMemcpyToArray()

Run time: 54 ms

32.37 ms to copy 128MB = 3.9 GB/sec, a far cry from 70 GB/sec

And the profiler reports timing info on cudaMemcpyToArray() when copying host–>device, but not device–>device. That seems very odd to me.

MisterAnderson42 · January 10, 2008, 4:39pm

That seems really weird. What performance for device to device copies do you get when you run the bandwithTest from the CUDA SDK?

VanDammage · January 10, 2008, 5:01pm

I get the same results as BonsaiScoot on cudaMemcpyToArray, but running the bandwidthTest from the CUDA SDK gives me about 70 GB/sec.

The only difference from the SDK example is that it’s a cudaMemcpyToArray() and not a cudaMemcpy().

it’s weird that cudaMemcpyToArray() is a so much slower on deviceToDevice copy.

You can also run the convolutionTexture example from the SDK.
There’s a memcpy deviceToDevice to a cudaArray and I also get merely 4,3 GB/sec

BonsaiScott · January 10, 2008, 5:09pm

Scoot :)

SDK’s bandwidth test reports 61.536 GB/sec on-board & I’m running a MSI 8800 GTX

VanDammage: thank you for the confirmation - I was worried that I was doing something wrong.

Now I’m worried that I cannot use CUDA and have to go back to DirectX - see this thread: [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=54691[/url]

VanDammage · January 10, 2008, 5:35pm

I hope somebody from NVIDIA can explain this bad performance on memcpy to cudaArrays, my algorithm’s also relying on a fast deviceToDevice memcpy. :(

MisterAnderson42 · January 10, 2008, 6:14pm

Nice find VanDammage. I can confirm your experiment on my hardware (gentoo linux w/ 8800 GTX, CUDA 1.1, 169.09 drivers):

"Average time: 0.913098 ms

Bandwidth: 4.278020 GiB/s"

I include my test application below so anyone can try.

This clearly seems to be a performance bug (I don’t recall anything in the release notes about it…). I’m 90% ceratain that the last time I used cudaMemcpyToArray with device to device (way back in CUDA 0.8) I was getting ~70 GiB/s, so the bug must have been introduced since then.

#include <stdio.h>

#  define CUDA_SAFE_CALL( call) do {                                         \

    cudaError err = call;                                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

    exit(EXIT_FAILURE);                                                      \

    } } while (0)

#ifdef NDEBUG

#define CUT_CHECK_ERROR(errorMessage)

#else

 #  define CUT_CHECK_ERROR(errorMessage) do {                                 \

    cudaThreadSynchronize();                                                \

    cudaError_t err = cudaGetLastError();                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \

                errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

#endif

int main()

	{

	const int width = 512;

	int len = width*width;

	float4 *d_data;

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_data, sizeof(float4)*len) );

	cudaArray *d_array;

	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float4>();

	CUDA_SAFE_CALL( cudaMallocArray(&d_array, &channelDesc, width, width) );

	

	CUDA_SAFE_CALL( cudaMemcpyToArray(d_array, 0, 0, (void*)d_data, len*sizeof(float4), cudaMemcpyDeviceToDevice) );

	cudaEvent_t start, end;

	CUDA_SAFE_CALL( cudaEventCreate(&start) );

	CUDA_SAFE_CALL( cudaEventCreate(&end) );

	

	CUDA_SAFE_CALL( cudaEventRecord(start, 0) );

	// execute the kernel

	int N = 100;

	for (int i=0; i < N; ++i)

  {

  cudaMemcpyToArray(d_array, 0, 0, (void*)d_data, len*sizeof(float4), cudaMemcpyDeviceToDevice);

  }

	CUDA_SAFE_CALL( cudaEventRecord(end, 0) );

	CUDA_SAFE_CALL( cudaEventSynchronize(end) );

	float runTime;

	CUDA_SAFE_CALL( cudaEventElapsedTime(&runTime, start, end) );

	runTime /= float(N);

	printf("Average time: %f ms\n", runTime);

	printf("Bandwidth:    %f GiB/s\n\n", (len * sizeof(float4)) / (runTime * 1.0e-3 * 1024*1024*1024));

	}

BonsaiScott · January 10, 2008, 6:24pm

Swapping 169.09 → 169.21 makes no appreciable difference.

VanDammage · January 11, 2008, 10:25am

I tried the test app from MisterAnderson on 3 differtent cards with different memory bandwidths

8600 GTS 128 bit-bandwidth
8800 GTS (G80) 320 bit-bandwidth
8800 GTX 384 bit-bandwidth

The tests ran on WinXP with Cuda 1.1 and Display Driver 169.21

The results only differ in some MiB/sec and are all around 4,2 GiB/sec.

MisterAnderson42 · January 16, 2008, 3:29pm

Oops, I made a mistake in the test program. The memory bandwidth calculated is low by a factor of 2 because len elements are read and len elements are written. I also checked on windows and obtain the same low performance. Using cudaMemcpy2DToArray doesn’t change anything. Just as a cross check, I also tried a standard device->device memcpy in global memory and obtain 66 GiB/s. Nobody’s responded to this yet, so I’m submitting a bug to the nvidia bug tracker.

VanDammage · January 16, 2008, 3:59pm

Good idea! Hope this’ll get fixed soon.

tachyon_john · January 17, 2008, 4:41am

I’m not surprised at all that cudaMemcpyToArray() has slower performance. I expect that the reason for the large performance difference between cudaMemcpy() and cudaMemcpyToArray() is because the cudaMemcpyToArray() implementation has to deal with boundary padding and conversion of the data into the memory layout required by the texture unit, which may not be all the same as the original data layout. I wouldn’t be surprised at all if the performance of the cudaMemcpyToArray() varies significantly depending on what the texturing mode and data type used are, not to mention the memory strides, and so on.

Cheers,
John Stone

MisterAnderson42 · March 6, 2008, 11:09pm

Just an update for those interested: The status on the bug I submitted for this problem changed to “fixed”.

I hope this means that the fix will be in the beta release expected this month, but I obviously don’t know for certain.

MisterAnderson42 · April 18, 2008, 3:07pm

System: Sun Ultra 40 M2 / Tesla D870 / Centos 5.1 / CUDA 2.0 beta
I ran the test file posted above (with the len*2 fix) and get:
Bandwidth: 55.857959 GiB/s
In multiple runs, fluctuations are less than 0.1 GiB/s.

It seems this bug is definitely fixed in CUDA 2.0. Enjoy :)

Topic		Replies	Views
cudaArrays: how to use them? CUDA Programming and Performance	3	5499	December 19, 2009
Question about texture memory CUDA Programming and Performance	3	4509	May 27, 2009
How to write the result to 2D texture? CUDA Programming and Performance	4	8206	January 23, 2008
Getting address of cudaArray CUDA Programming and Performance	4	10613	September 17, 2007
Writing in to cudaArray CUDA Programming and Performance	7	5775	July 2, 2010
Using cudaBindTextureToArray() CUDA Programming and Performance	3	4295	September 22, 2013
is 2D array writable? CUDA Programming and Performance	4	2915	March 13, 2008
Using 2d texture fetchs without binding to array Can it be done? CUDA Programming and Performance	5	3451	February 21, 2008
cudaBindTexture2D problem CUDA Programming and Performance	3	11827	August 3, 2010
cudaBindTexture2D vs cudaBindTextureToArray CUDA Programming and Performance	4	11463	October 24, 2009

Texture / Array Access

Related topics