I'm doing my first step in CUDA and I want to make a kernell call using all the thread that I can with my graphical card (9600GT) for testing.
For getting the number of the Thread I use in the Host code the runtime API:
cudaDeviceProp Dispositivo; // It is define as global variable
....
cudaSetDevice(i);
// get the device properties
CUDA_SAFE_CALL(cudaGetDeviceProperties(&Dispositivo, i));
So, in the field Dispositivo.maxThreadsPerBlock I had the number of Thread that I can use per block.
Using the DeviceInfor form CUDA SDK I know that this number of threads are 512.
If I make a call in this way the code doesn't work.
CUDASearch<<<1,Dispositivo.maxThreadsPerBlock,0>>>(StrCuda,lencad,CudaD,LastCaracter);
If I fix the threads in the block to 128 all work fine
What can I make wrong ?
Second Question:
When I ask Memory in the device using cudaMalloc where is stored in global or share memory ?
Yes I know it and I know too that it is not a good way to use only 1 block in the grid but I’m only making several tests to lern how it is working this.
CUDA Programming Guide tells that CUDA can run only 512 Thread per block, and my devices supports the same number per thread, so
I don’t understand why a call as follow does not work properly, maybe I’m forgetting to take acound some condition