I am using a GTX Titan Black (2880Cores, 6G Memory) and Cuda-6.0 for my matrix computation.
The input matrix has a size of:
200190500(single type) 4B = 70 MB.
I also defined some variables and allocated memory for them. The size of all the variables (float type) defined with cudaMalloc() came up to:
(200190*5003 + 20019050038)4B = 1.91 GB,
which is less than my total 6G memory. The program crashed and I reduced the matrix size to: 2001904404B = 62.3 MB,
and kept the other varialbes the same. These variables took
(2001904403 + 20019044038)*4B = 1.6817 GB memory, and the program ran with no problem. cudaMemGetInfo() function returned 5.97G total memory, 5.66G free memory before allocation and 3.97G memory after allocation. This confirms that all my variables took 5.66-3.97 = 1.68G memory, as I previously calculated. So far, there is no problem with my code.

Then I slightly increased my matrix to 2001904414B = 62.4MB, and kept all the other variables the same, which took 1.6856 GB memory (slightly > 1.6817 G). My program started to crash and gave me some weird results. I also tried to change the matrix to 200190*{442,443,444,…}, and I had the same problem and results.

By the way, my block size was (8,8,8), and grid size was (200/8, 190/8*440/8) = (25,1320). I do not know what the problem is and I really want to get this GPU work as I spent a lot of money on it. I would appreciate if anyone can help me out.

Just because your program crashed when you changed matrix dimensions slightly does not mean that is a GPU problem or a GPU memory problem.

You could easily have an error in your code.

Do proper cuda error checking and run your code with cuda-memcheck. If cuda-memcheck reports any errors in the failing case, it usually indicates a coding defect on your part.

It is possible that a simple kernel timeout is occurring even under linux, but proper cuda error checking usually gives you an indication of that (the error code will indicate kernel timeout)

Thank you txbob. I have attached my demo code (runme.txt is actually runme.m, as .m files are not allowed here) and would appreciate if you can take a look of this. I tried matrix sizes of 700700690, which did not work. I reduced it to 700700689 and it worked, so did smaller sizes. I’ve no idea what happens. testCuda.cu (4.18 KB) runme.txt (226 Bytes)

You say the program “crashed”. Crashed how? Is there a status check for every CUDA API call and every kernel invocation? Does any of them report an error? If so, what error is reported? What happens if you run the code under the control of cuda-memcheck?

Given the dimensions of the 3D matrix, I wonder whether the code might run into (signed) integer overflow in some intermediate computations, although the difference in behavior between dimensions of 689 and 690 would seem to suggest that that is probably not the issue.

I did error checking as you did but forgot to change it back after testing this Check_CUDA_Error() function.

I also have provided a standalone test demo. I compiled it with nvcc -o cuda.out cuda.cu and run cuda-memcheck with outputs as follows:

========= CUDA-MEMCHECK
========= Error: process didn’t terminate successfully
========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
========= ERROR SUMMARY: 0 errors

Then I run the code as ./cuda.out 5 5 to multiply two 5*5 matrixes element-by-element. Below is a copy of the output:
iNx is: 5, iNy is: 5.
blocksInX is: 1, blocksInY is: 1
dimBlock.x is: 8, dimGrid.x is: 1
dimBlock.y is: 8, dimGrid.y is: 1
pf_o[0] is: 0.000000
pf_o[1] is: 1.000000
pf_o[2] is: 2.000000
pf_o[3] is: 3.000000
pf_o[4] is: 4.000000.

This is what I expected but when I increased the size of the two matrixes it gave me something wrong. For example, I ran the code as ./cuda.out 10000 10000 and got outputs as follows:

iNx is: 10000, iNy is: 10000.
blocksInX is: 1250, blocksInY is: 1250
dimBlock.x is: 8, dimGrid.x is: 1250
dimBlock.y is: 8, dimGrid.y is: 1250
pf_o[99999990] is: 99999992.000000
pf_o[99999991] is: 99999992.000000
pf_o[99999992] is: 99999992.000000
pf_o[99999993] is: 99999992.000000
pf_o[99999994] is: 99999992.000000
pf_o[99999995] is: 99999992.000000
pf_o[99999996] is: 100000000.000000
pf_o[99999997] is: 100000000.000000
pf_o[99999998] is: 100000000.000000
pf_o[99999999] is: 100000000.000000

I noticed that the outputs for lower indices were correct as opposed to higher indices as shown above. Any comments? cuda.cu (3.61 KB)

What is your code doing? If it is summing ‘float’ data, it looks like you are running into an issue with limited granularity of the ‘float’ data type, which then causes data that should mathematically sum to different results round to the same floating-point number.

The resolution provided by ‘float’ is only about 7 decimal digits, and your numbers above exceed that length. Try switching your data to ‘double’.