max thread per block and memory device question

Hi everybody,

I have two question in order to improve my code. 

First question:

I'm doing my first step in CUDA and I want to make a kernell call using all the thread that I can with my graphical card (9600GT) for testing.
For getting the number of the Thread I use in the Host code the runtime API:

  cudaDeviceProp Dispositivo;  // It is define as global variable


// get the device properties
CUDA_SAFE_CALL(cudaGetDeviceProperties(&Dispositivo, i));

   So, in the field Dispositivo.maxThreadsPerBlock I had the number of Thread that I can use per block.

   Using the DeviceInfor form CUDA SDK I know that this number of threads are  512.

   If I make a call in this way the code doesn't work.


   If I fix the threads in the block to 128 all work fine
  What can I make wrong ?

Second Question:

  When I ask Memory in the device using cudaMalloc where is stored in global or share memory ?

Thanks in advance,


First, maxThreadsPerBlock shows just maximum size of each dimension.

CUDA can run only 512 threads concurrently at a time.

Second, it is on the global memory.

more detailed answers are written in CUDA Programming Guide. :rolleyes:



Yes I know it and I know too that it is not a good way to use only 1 block in the grid but I’m only making several tests to lern how it is working this.

CUDA Programming Guide tells that CUDA can run only 512 Thread per block, and my devices supports the same number per thread, so

I don’t understand why a call as follow does not work properly, maybe I’m forgetting to take acound some condition