cudaMallocArray Bug CUDA error: invalid argument

so here is the bug I got,

#define CUERR { cudaError_t err; \

  if ((err = cudaGetLastError()) != cudaSuccess) { \

  printf("CUDA error: %s, line %d\n", cudaGetErrorString(err), __LINE__); return NULL; }}

texture<float, 2, cudaReadModeElementType> text;

extern "C"

int bug(unsigned int x, unsigned int y)

{

  cudaArray* cu_array;

  printf("malloc  Array:(%d,%d)\n",x,y);

  CUDA_SAFE_CALL( cudaMallocArray( &cu_array, &text.channelDesc, x, y));

  CUERR 

  CUDA_SAFE_CALL(cudaFreeArray(cu_array));

  printf("Succes\n");

  return 1;

}

can anybody try to reproduce it, to know if it’s just me…

I got an invalid argument error with x or y greater than 13345.

is this the limitation size with cudaArray ?

I have the 8800GTX under linux

On an internal build, I cannot replicate your bug (so if it was there, it has been fixed).

[fatica@dhcp ~]$ ./a.out
malloc Array:(128,128)
Succes
fatica@dhcp ~]$ ./a.out
malloc Array:(12800,128)
Succes
fatica@dhcp ~]$ ./a.out
malloc Array:(22800,128)
Succes
[fatica@dhcp- ~]$ ./a.out
malloc Array:(22800,22800)
CUDA error: out of memory, line 15

There will be a new release in few weeks.

Ok, but if you do malloc Array:(14400,1)

you get : CUDA error: invalid argument, line 15

why this ? I guess I should be allowed to have the second dimension equal to 1 no ?

also, I found out why I was having such limitation, the reason was I had another openGl process running in the same time on the device which was taking memory.

Now my other question, what is the limitation of the cudaymallocArray, is it the 768 mb of the device minus some runtime memory usage ?

I guess I also have a memory problem,

if I create two textures :
texture<float4, 2, cudaReadModeElementType> t1;
texture<float, 2, cudaReadModeElementType> t2;

I allocate/bind them :
malloc t1:(248,31)
Succes
malloc t2:(30752,4)
Succes

all the elements of t2 are presumably such that texfetch(t2, i, 0…3) = i;

however, when I do texfetch(t2, 0, 0…3)
I get 16384.0 (where I should get 0)
and texfetch(t2, 1, 0…3)
I get 16385.0 (where I should get 1)
then if I do texfetch(t2, 16384, 0…3) I get something random. same for other value beyond that.

So why is that ? is my two textures too big or something ? I really don’t understand…

and of course if I use smaller cudaArray like :
malloc textInput:(168,21)
malloc textFilter:(14112,4)

everything works fine…

In the current release, if the second dimension is equal to 1, the first dimension has to be smaller than 8192. There is a 8K limit for 1D Texture arrays and CUDA views a 2D array Nx1 as a 1D array of length N.

We will improve the documentation.

Ok, it makes more sense,

I guess if I want to use more than 8192, I’ll put y equal 2. would it be ok?

Yes, for now that is a good solution.

But actually it doesn’t solve my problem.

I still can’t use texture where any dimension is bigger than 16834.

I have no idea where this number come from, but if I have a 2d texture like this :

malloc Array:(16834,x) with x >= 1

my values are messed up in the texture as I described above

why ?

Yes, I have the same problem… As soon as I attempt to allocate an array larger than 32768 I get the invalid argument error. Even if I increase my second dimension to an arbitrary number greater than one, I cannot go above this limit.

What’s going on here?

Well what I ended up doing was to use texture up to the max size and if it was bigger then split into several…

I know Cuda is still beta but for the next release they should definitely give the specifications of texture/CudaArray.

And, if you do that, can you use up to the maximum device memory (theoretically)? And, if you can use all that memory, I wonder what kind of overhead that implies. Aka, I wonder if it’s worth it in the end…

well yeah you can malloc as many cudaarray as you want (up to memory available)

and then use you unique texture to bind them all, one by one.

Although keep in mind you’re gonna have to run several kernels since you can only bind the texture from the host.

Couldn’t you bind them all up front and then just texfetch the appropriate one from within one kernel?