PyCUDA: very big 1D array indexing

I am calculating very big 2D array(5000000*144) using PyCUDA and pass it to GPU as 1D array.

The kernel is very simple operation but the all cell of result array does not be affected. From dis_out[0,0]~dis_[3333333,48] are filled with ‘4’ but others are filled with ‘0’. I used the 1D array indexing formula. why all threads don’t be affected?

import math
import numpy as np
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

len_of_index=16
Num_of_Conformer= 1000
BLOCK_SIZE = 1024 
Num_of_index = 5000
BLOCKSX = int(math.ceil(float(Num_of_Conformer*Num_of_index)*float(len_of_index*6)/float(BLOCK_SIZE)))

ligand_dis = np.zeros((5000000,144), dtype=np.float32)
dis_out = np.zeros((5000000,144), dtype=np.float32)

NROW=Num_of_Conformer*Num_of_index
NCOL=len_of_index*9

r_nx = np.int32(Num_of_index)
r_ny = np.int32(9)

max_i=float(Num_of_index*Num_of_Conformer*9*len_of_index)
va = np.float32(max_i)

ligand_dis_gpu = cuda.mem_alloc(ligand_dis.nbytes)
dis_out_gpu = cuda.mem_alloc(dis_out.nbytes)
cuda.memcpy_htod(ligand_dis_gpu, ligand_dis)
cuda.memcpy_htod(dis_out_gpu, dis_out)

mod = SourceModule("""
      __global__ void m_op(float * dis_out, float *  ligand, int row, float max_i)
    {
        #include <math.h>
        #include <stdio.h>

        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if(idx<max_i){
            dis_out[idx]=abs((5+dis_out[idx])-(ligand[idx]+1));
        }
    }
""")
func=mod.get_function("m_op")
func(dis_out_gpu, ligand_dis_gpu, r_nx, va, block=(BLOCK_SIZE,1,1),grid=(BLOCKSX,1,1))

cuda.memcpy_dtoh(dis_out, dis_out_gpu)

print dis_out[3333333,47]
print dis_out[3333333,48]

Final result:
print dis_out[3333333,47] -> 4,
print dis_out[3333333,48] -> 0

I solved the problem.
I mis-calculate ed BLOCKSX.
sorry