I have a bunch of arrays like that which only need to be calculated inside the kernel.
Is this ligit?
This is compiling fine , but as I try to assign a value to vav[i][j][k] , I got at seg fault.
This is strange. It seems that the malloc call does not generate the error but only the assignment.
Why would that be? And why no error message during compile?
If I do cudamalloc ouside the kernel, do I need to pass the array as argument?
That shouldn’t be the case. nvcc should emit a “error: calling a host function from a device/global function is only allowed in device emulation mode” error and stop.
No, you can also write the address of the allocation onto a constant memory symbol as well. But the kernel has to be made aware of it one way or the other.
There are two issues - not compiling and not working. The not working part almost certainly an error in your code. The not compiling part is down to what features are/are not supported by the different compilation trajectories nvcc uses depending on whether you are compiling for emulation or the device.
I am trying to male this cudaMemcpyToSymbol , but no luck so far.
I looked at the examples in SDK, but none fit my task exactly. For example, how do I declare the symbol ?
I tried the following outside the functions:
float ***vav[3]; // it says invalid device symbol
Ok, I can see that _device float is declared outside the function.
This example is not what I need to do. In the example, the values are passed with cudaMemcpy. In my case, some arrays in the kernel are temporary only and are using the values from other arrays which are passed with cudaMemcpy. The problem lies with the declaration of these temporary arrays.
There is no problem with scalar values beiing temporary and local to the kernel but it is not so for arrrays where some a-priori allocation is needed it seems.
I will play with this example to see if I can advance my case.
You asked how to use cudaMemcpyToSymbol to write the address of dynamically allocated memory onto a device side pointer. The code I linked to does exactly that. Sorry for not being more helpful…
That is normal. It compiles and runs otherwise (Ubuntu 9.04 x86_64, cuda 2.,3):
avid@cuda:~/code/cuda_struct$ nvcc -Xopencc="-Wall" example2.cu -o example2
In file included from example2.cu:13:
/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_clock':
/opt/cuda-2.3/bin/../include/common_functions.h:72: warning: implicit declaration of function `clock'
/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_memset':
/opt/cuda-2.3/bin/../include/common_functions.h:77: warning: implicit declaration of function `memset'
/opt/cuda-2.3/bin/../include/common_functions.h: In function `__cuda_memcpy':
/opt/cuda-2.3/bin/../include/common_functions.h:82: warning: implicit declaration of function `memcpy'
example2.cu: At top level:
/opt/cuda-2.3/bin/../include/common_functions.h:71: warning: `__cuda_clock' defined but not used
/opt/cuda-2.3/bin/../include/common_functions.h:76: warning: `__cuda_memset' defined but not used
/opt/cuda-2.3/bin/../include/common_functions.h:81: warning: `__cuda_memcpy' defined but not used
./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
./example2.cu(18): Warning: Cannot tell what pointer points to, assuming global memory space
avid@cuda:~/code/cuda_struct$ ./example2
0 15.0
1 16.0
2 17.0
3 18.0
4 19.0
5 20.0
6 21.0
7 22.0
8 23.0
9 24.0
10 25.0
11 26.0
12 27.0
13 28.0
14 29.0
15 30.0
16 31.0
17 32.0
18 33.0
19 34.0
20 35.0
21 36.0
22 37.0
23 38.0
24 39.0
25 40.0
26 41.0
27 42.0
28 43.0
29 44.0
30 45.0
31 46.0
That is the NoDevice error code. Something in your CUDA installation is broken.
Got it to work finally. When I change gpuAssert to cutilSafeCall I saw that no cuda device was available. I restarted X and everything compile and ran as expected for both emulation and release. I also modify the program to do what I want and it works also. I can assign values to the array ad[idx][0]. Back to regular programming I guess.
Thank you very much for your help.
unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;
ad[idx][idx][idx] = 23.0*idx; // this is what I need to do
d[idx] = ad[idx][idx][idx]; // in my kernel
}
int main()
{
int device = 0;
cudaSetDevice(device);
float *d ,*_d;
float ***_a; // 3D arrays can I allocate with malloc .....
I think you need to hit the books on C pointers and memory management - your problem is well out of the realm of CUDA itself.
If you really want a dynamically allocated 3D array (there are lots of reasons why it is neither necessary, nor a good idea), then you are going to have to recursively allocate the memory for it using cudaMalloc, which will be a nightmare. Save yourself a lot of trouble and use 3D indexing into a 1D array.
But, I just dicover that I cannot set _DSIZE to be more than 31 in the previous code, i.e. the maximum 3D array is 31x31x31 (29,791) why?
It is not a function of FTNREF3D macro, it tried without it.
I read in the reference manual, that I can allocate with cudaMalloc3DArray and array up to (8192,0,0) for a 1D array and (2048,2048,2048) for a 3D array. But in my case I can access up to index 29791 for a 1D array(vector).
What I would like to see is _DSIZE to be up to 128.
You can’t use cudaMalloc3DArray the way you are trying to. cudaMalloc3DArray is for allocating memory for storing 3D texture data. All cudaArrays are opaque memory types that are not intended to be manipulated by user code other than the texture access calls and the alloc/copy APIs.
As I said in my previous reply, if you want dynamically allocated 3D arrays, you will have to recursively call cudaMalloc - for an nxnxn 3D array, that means a total of 1 + n + n**2 malloc calls, and all sorts of convoluted, recursive copy to symbol calls or initialization kernels. And when you are done, you will have made everything slower because there are now two extra levels of pointer indirection required to get to your data. Just use a 1D linear memory space allocated with cudaMalloc and index into it using a 3D index.