CUDA Device #0
Major revision number: 2
Minor revision number: 1
Name: GeForce GT 425M
Total global memory: 1008271360
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1120000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 2
Kernel execution timeout: No
Kernel concurrent execution: Yes
First problem I see is with your calculation of index. When you do dimGrid(1,1024,1024), you are using the Y and Z dimensions but your tidx and tidy aren’t using the index in the Z dimension.
I doubt the thing you are trying to do with tidx * 400 is going to work. I don’t think you have so much memory to work on. How do you calculate the sizeofA? How do you use all the tidx, tidy and tid, row, col?
There is no need to use the Z dimension of a grid. But if you’re just trying to do a simple multiplication of matrices of the size 1024 by 400, you’re doing it in a very wrong manner. Please check the SDK’s matrix multiplication example and the whitepaper that comes with it for more information of how to implement matrix multiplication efficiently.