I am working on a threaded application (pthreads) where I copy data from an allocated (with malloc) memory area on the host and into an array on the device (in the texture memory), using cudaMemcpy2DToArray. The array is allocated successfully using cudaMallocArray(), at least judging by the return value. The memory on the host is allocated in a different thread than the one which calls the GPU, the latter thread has a pointer to the memory area.
When I do the copy, cudaMemcpy2DToArray always returns cudaErrorInvalidValue, and after several days of debugging I am close to giving up. I have checked that all pointers point to the same areas in memory, are valid and that the different size, offset and spitch-values are correct. Different allocation techniques on the host, for example allocating just parts of the memory at a time or the whole required memory area, gives the same result (invalidValue).
However, something is not right, but I am not sure how to interpret it or if it is relevant. If I try to memset the array created by cudaMallocArray, memset always returns invalidDevicePointer. If I use regular Memcpy instead of Memcpy2DToArray, I get the same error value. Another strange thing, is that if I allocate the memory right before I call the GPU-code, it works, but this not very practical or efficient.
My question is, has anyone experienced anything similar and/or have any ideas or suggestions on how to fix it or proceed?
Thanks in advance for any help.