Texture / Array Access

Hello everybody,

i already read the manual, used the search and browsed the CUDA SDK-files and did not find an example where cuda arrays or texture values were manipulated.

Are these values read-only?

I’m currently using 2D-Textures for images if that helps.

What i want to do for example is to multiply each value in an image with f.e. 2 or set all entries of a cudaArray to 0. (i know how this works if i don’t use textures or cuda arrays).

I’m really sad that something like the following (C-Style) doesn’t work.

CUarray array;

.... //fill array 

array[0] = 10;

Thanks for your help,

greetings xlro.

CUDA arrays/textures are read-only from a kernel’s point of view.

Paulius

thanks for the fast answer :)

If you use a 1D texture bound to device memory, you can write to it.

Since you mention 2D textures via arrays you’ll need to write to a 2nd buffer and do a device to device memcpy, which is really fast at 70 GiB/s.

Very true, but you have to be aware that texture cache is “read-only”. Meaning that it’s possible to have one thread write to device memory, but another thread fetching from the same location, hitting cache, and getting the “old” value. Tex cache entry is not invalidated when the corresponding mem location is written.

This is perfectly fine for some algorithms, like Warshall-Floyd for all-pairs shortest paths. Other algorithms may have a problem.

Paulius

I’m curious why 2D textures are not writeable under CUDA, particularly since 2D textures (declared as render targets) are writeable under DirectX.

Is there a mechanism for invalidating the cache? Can it be done by program, either host or kernel?

Then please help me understand what I’m doing wrong. Here’s my processing loop:

loop 8x {

process 1kx1k float4 texture (cudaArray) --> linear mem buffer

// cudaMemcpyToArray(..., cudaMemcpyDeviceToDevice)

}

Addresses are as returned from cudaMallocArray() & cudaMalloc() so there should not be any alignment issues.

Run time: 21.63 ms

uncomment the cudaMemcpyToArray()

Run time: 54 ms

32.37 ms to copy 128MB = 3.9 GB/sec, a far cry from 70 GB/sec

And the profiler reports timing info on cudaMemcpyToArray() when copying host–>device, but not device–>device. That seems very odd to me.

That seems really weird. What performance for device to device copies do you get when you run the bandwithTest from the CUDA SDK?

I get the same results as BonsaiScoot on cudaMemcpyToArray, but running the bandwidthTest from the CUDA SDK gives me about 70 GB/sec.

The only difference from the SDK example is that it’s a cudaMemcpyToArray() and not a cudaMemcpy().

it’s weird that cudaMemcpyToArray() is a so much slower on deviceToDevice copy.

You can also run the convolutionTexture example from the SDK.
There’s a memcpy deviceToDevice to a cudaArray and I also get merely 4,3 GB/sec

Scoot :)

SDK’s bandwidth test reports 61.536 GB/sec on-board & I’m running a MSI 8800 GTX

VanDammage: thank you for the confirmation - I was worried that I was doing something wrong.

Now I’m worried that I cannot use CUDA and have to go back to DirectX - see this thread: http://forums.nvidia.com/index.php?showtopic=54691

I hope somebody from NVIDIA can explain this bad performance on memcpy to cudaArrays, my algorithm’s also relying on a fast deviceToDevice memcpy. :(

Nice find VanDammage. I can confirm your experiment on my hardware (gentoo linux w/ 8800 GTX, CUDA 1.1, 169.09 drivers):

"Average time: 0.913098 ms

Bandwidth: 4.278020 GiB/s"

I include my test application below so anyone can try.

This clearly seems to be a performance bug (I don’t recall anything in the release notes about it…). I’m 90% ceratain that the last time I used cudaMemcpyToArray with device to device (way back in CUDA 0.8) I was getting ~70 GiB/s, so the bug must have been introduced since then.

#include <stdio.h>

#  define CUDA_SAFE_CALL( call) do {                                         \

    cudaError err = call;                                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

    exit(EXIT_FAILURE);                                                      \

    } } while (0)

#ifdef NDEBUG

#define CUT_CHECK_ERROR(errorMessage)

#else

 #  define CUT_CHECK_ERROR(errorMessage) do {                                 \

    cudaThreadSynchronize();                                                \

    cudaError_t err = cudaGetLastError();                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \

                errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

#endif

int main()

	{

	const int width = 512;

	int len = width*width;

	float4 *d_data;

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_data, sizeof(float4)*len) );

	cudaArray *d_array;

	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float4>();

	CUDA_SAFE_CALL( cudaMallocArray(&d_array, &channelDesc, width, width) );

	

	CUDA_SAFE_CALL( cudaMemcpyToArray(d_array, 0, 0, (void*)d_data, len*sizeof(float4), cudaMemcpyDeviceToDevice) );

	cudaEvent_t start, end;

	CUDA_SAFE_CALL( cudaEventCreate(&start) );

	CUDA_SAFE_CALL( cudaEventCreate(&end) );

	

	CUDA_SAFE_CALL( cudaEventRecord(start, 0) );

	// execute the kernel

	int N = 100;

	for (int i=0; i < N; ++i)

  {

  cudaMemcpyToArray(d_array, 0, 0, (void*)d_data, len*sizeof(float4), cudaMemcpyDeviceToDevice);

  }

	CUDA_SAFE_CALL( cudaEventRecord(end, 0) );

	CUDA_SAFE_CALL( cudaEventSynchronize(end) );

	float runTime;

	CUDA_SAFE_CALL( cudaEventElapsedTime(&runTime, start, end) );

	runTime /= float(N);

	printf("Average time: %f ms\n", runTime);

	printf("Bandwidth:    %f GiB/s\n\n", (len * sizeof(float4)) / (runTime * 1.0e-3 * 1024*1024*1024));

	}

Swapping 169.09 --> 169.21 makes no appreciable difference.

I tried the test app from MisterAnderson on 3 differtent cards with different memory bandwidths

8600 GTS 128 bit-bandwidth
8800 GTS (G80) 320 bit-bandwidth
8800 GTX 384 bit-bandwidth

The tests ran on WinXP with Cuda 1.1 and Display Driver 169.21

The results only differ in some MiB/sec and are all around 4,2 GiB/sec.

Oops, I made a mistake in the test program. The memory bandwidth calculated is low by a factor of 2 because len elements are read and len elements are written. I also checked on windows and obtain the same low performance. Using cudaMemcpy2DToArray doesn’t change anything. Just as a cross check, I also tried a standard device->device memcpy in global memory and obtain 66 GiB/s. Nobody’s responded to this yet, so I’m submitting a bug to the nvidia bug tracker.

Good idea! Hope this’ll get fixed soon.

I’m not surprised at all that cudaMemcpyToArray() has slower performance. I expect that the reason for the large performance difference between cudaMemcpy() and cudaMemcpyToArray() is because the cudaMemcpyToArray() implementation has to deal with boundary padding and conversion of the data into the memory layout required by the texture unit, which may not be all the same as the original data layout. I wouldn’t be surprised at all if the performance of the cudaMemcpyToArray() varies significantly depending on what the texturing mode and data type used are, not to mention the memory strides, and so on.

Cheers,
John Stone

Just an update for those interested: The status on the bug I submitted for this problem changed to “fixed”.

I hope this means that the fix will be in the beta release expected this month, but I obviously don’t know for certain.

System: Sun Ultra 40 M2 / Tesla D870 / Centos 5.1 / CUDA 2.0 beta
I ran the test file posted above (with the len*2 fix) and get:
Bandwidth: 55.857959 GiB/s
In multiple runs, fluctuations are less than 0.1 GiB/s.

It seems this bug is definitely fixed in CUDA 2.0. Enjoy :)