Is the memory management method of TX1 and TX2 different?

[b]Hi.

I am running Python code on TX2 that has been checked for normal operation on TX1.
However, TX2 seems to have a difference in the way the kernel is upgraded and the memory is used.
The code below is my code written in python (pycuda).
I have confirmed the normal operation in TX1 with the code to generate the Fresnel hologram.
This is not the only error. I felt that the overall memory management mechanism was different in TX2, unlike TX1.

Anyone have a way to solve this?

Codes here.
You do not have to worry about other parts.
The size of fx, fy, Gbp, and Gfp in this function is (5184x3888). The specific error point is cuda.memcpy_dtoh( ).[/b]

def cuda_fd_hologram(fx, fy, Nx, Ny, k, dz, lamb):
    """
    The Fresnel Diffraction hologram making function.
    """
    gbp = np.zeros((Ny, Nx), dtype=np.complex64)
    gfp = np.zeros((Ny, Nx), dtype=np.complex64)

    y, x = fx.shape
    gy, gx = gbp.shape

    blkdim = 32
    grid_x = gx/blkdim+1
    grid_y = gy/blkdim+1

    fx  = np.reshape(fx, y*x)
    fy  = np.reshape(fy, y*x)
    gbp = np.reshape(gbp, gy*gx)
    gfp = np.reshape(gfp, gy*gx)

    _fx  = cuda.mem_alloc(fx.size * fx.dtype.itemsize)
    _fy  = cuda.mem_alloc(fy.size * fy.dtype.itemsize)
    _gbp = cuda.mem_alloc(gbp.size * gbp.dtype.itemsize)
    _gfp = cuda.mem_alloc(gfp.size * gfp.dtype.itemsize)

    cuda.memcpy_htod(_fx, np.float32(fx))
    cuda.memcpy_htod(_fy, np.float32(fy))
    cuda.memcpy_htod(_gbp, np.complex64(gbp))
    cuda.memcpy_htod(_gfp, np.complex64(gfp))

    func = mod.get_function("kernel_fd")
    func(_fx, _fy, _gbp, _gfp, np.float32(k), np.float32(dz), np.float32(lamb), \
        np.int32(x), np.int32(y), block=(blkdim,blkdim,1), grid=(grid_x,grid_y))

    cuda.memcpy_dtoh(gbp, _gbp)
    cuda.memcpy_dtoh(gfp, _gfp)
    gbp = np.reshape(gbp, (y, x))
    gfp = np.reshape(gfp, (y, x))

    del _fx, _fy, _gbp, _gfp, fx, fy
    return gbp, gfp

Errors here

Traceback (most recent call last):
  File "test.py", line 806, in <module>
    running()
  File "test.py", line 210, in running
    test(data, ref)
  File "test.py", line 483, in reconstruction
    Gbp, Gfp = cuda_fd_hologram(fx, fy, Nx, Ny, k, dz, lamb)
  File "test.py", line 254, in cuda_fd_hologram
    cuda.memcpy_dtoh(gbp, _gbp)
pycuda._driver.LaunchError: cuMemcpyDtoH failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: unspecified launch failure

Finally, the output of dmesg.

[ 1753.520306] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x2e3c6000, fsynr=0x20013, cb=19, sid=16(0x10 - GPU), pgd=267e63003, pud=267e63003, pmd=d5ae5003, pte=0
[ 1753.535534] gk20a 17000000.gp10b: gk20a_fifo_handle_pbdma_intr: pbdma_intr_0(0):0x00004000 PBH: 20040718 SHADOW: 200381c5 M0: 8000001d 8004041d 80040715 00000000
[ 1753.549953] gk20a 17000000.gp10b: gk20a_set_error_notifier: error notifier set to 32 for ch 505
[ 1753.559334] gk20a 17000000.gp10b: gk20a_fifo_set_ctx_mmu_error_tsg: TSG 4 generated a mmu fault
[ 1753.568052] ---- mlocks ----

[ 1753.568079] ---- syncpts ----
[ 1753.568091] id 4 (disp_d) min 8 max 8 refs 1 (previous client : )
[ 1753.568095] id 5 (disp_e) min 1 max 1 refs 1 (previous client : )
[ 1753.568101] id 6 (disp_f) min 1 max 1 refs 1 (previous client : )
[ 1753.568105] id 7 (vblank1) min 19065 max 0 refs 1 (previous client : )
[ 1753.568119] id 18 (17000000.gp10b_507) min 7134 max 7134 refs 1 (previous client : )
[ 1753.568123] id 19 (17000000.gp10b_506) min 4 max 4 refs 1 (previous client : )
[ 1753.568130] id 24 (17000000.gp10b_504) min 12 max 12 refs 1 (previous client : 17000000.gp10b_501)

[ 1753.568538] ---- channels ----
[ 1753.568548] 
               channel 0 - 15810000.se

[ 1753.568551] NvHost basic channel registers:
[ 1753.568555] CMDFIFO_STAT_0:  00002040
[ 1753.568576] CMDFIFO_RDATA_0: 02c08a00
[ 1753.568586] CMDP_OFFSET_0:   00000000
[ 1753.568593] CMDP_CLASS_0:    00000000
[ 1753.568598] CHANNELSTAT_0:   00000000
[ 1753.568600] The CDMA sync queue is empty.

[ 1753.568606] 
               channel 1 - 15820000.se

[ 1753.568610] NvHost basic channel registers:
[ 1753.568613] CMDFIFO_STAT_0:  00002040
[ 1753.568616] CMDFIFO_RDATA_0: 28200e11
[ 1753.568618] CMDP_OFFSET_0:   00000000
[ 1753.568621] CMDP_CLASS_0:    00000000
[ 1753.568623] CHANNELSTAT_0:   00000000
[ 1753.568625] The CDMA sync queue is empty.

[ 1753.568631] 
               channel 2 - 15830000.se

[ 1753.568634] NvHost basic channel registers:
[ 1753.568636] CMDFIFO_STAT_0:  00002040
[ 1753.568639] CMDFIFO_RDATA_0: 11a20210
[ 1753.568641] CMDP_OFFSET_0:   00000000
[ 1753.568644] CMDP_CLASS_0:    00000000
[ 1753.568646] CHANNELSTAT_0:   00000000
[ 1753.568648] The CDMA sync queue is empty.

[ 1753.568665] 
               channel 3 - 15840000.se

[ 1753.568674] NvHost basic channel registers:
[ 1753.568682] CMDFIFO_STAT_0:  00002040
[ 1753.568687] CMDFIFO_RDATA_0: 20470cc4
[ 1753.568689] CMDP_OFFSET_0:   00000000
[ 1753.568692] CMDP_CLASS_0:    00000000
[ 1753.568695] CHANNELSTAT_0:   00000000
[ 1753.568697] The CDMA sync queue is empty.

[ 1753.568703] 
               ---- host general irq ----

[ 1753.568707] sync_intc0mask = 0x00000001
[ 1753.568709] sync_intmask = 0x50000003
[ 1753.568712] 
               ---- host syncpt irq mask ----

[ 1753.568715] 
               ---- host syncpt irq status ----

[ 1753.568720] syncpt_thresh_cpu0_int_status(0) = 0x00000000
[ 1753.568722] syncpt_thresh_cpu0_int_status(1) = 0x00000000
[ 1753.568725] syncpt_thresh_cpu0_int_status(2) = 0x00000000
[ 1753.568728] syncpt_thresh_cpu0_int_status(3) = 0x00000000
[ 1753.568732] syncpt_thresh_cpu0_int_status(4) = 0x00000000
[ 1753.568734] syncpt_thresh_cpu0_int_status(5) = 0x00000000
[ 1753.568737] syncpt_thresh_cpu0_int_status(6) = 0x00000000
[ 1753.568740] syncpt_thresh_cpu0_int_status(7) = 0x00000000
[ 1753.568742] syncpt_thresh_cpu0_int_status(8) = 0x00000000
[ 1753.568745] syncpt_thresh_cpu0_int_status(9) = 0x00000000
[ 1753.568748] syncpt_thresh_cpu0_int_status(10) = 0x00000000
[ 1753.568752] syncpt_thresh_cpu0_int_status(11) = 0x00000000
[ 1753.568760] syncpt_thresh_cpu0_int_status(12) = 0x00000000
[ 1753.568769] syncpt_thresh_cpu0_int_status(13) = 0x00000000
[ 1753.568777] syncpt_thresh_cpu0_int_status(14) = 0x00000000
[ 1753.568784] syncpt_thresh_cpu0_int_status(15) = 0x00000000
[ 1753.568788] syncpt_thresh_cpu0_int_status(16) = 0x00000000
[ 1753.568792] syncpt_thresh_cpu0_int_status(17) = 0x00000000
[ 1753.568797] 17000000.gp10b pbdma 0: 
[ 1753.572191] id: 4 (tsg), next_id: 4 (tsg) status: valid
[ 1753.577619] PUT: 000000010180e1f4 GET: 000000010180e190 FETCH: 000001d3 HEADER: 20040718

[ 1753.587199] 17000000.gp10b eng 0: 
[ 1753.590439] id: 2 (tsg), next_id: 4 (tsg), ctx: invalid 

[ 1753.597257] 17000000.gp10b eng 1: 
[ 1753.600477] id: 4 (tsg), next_id: 4 (tsg), ctx: invalid 


[ 1753.608950] 500-17000000.gp10b, pid 1647, refs: 2: 
[ 1753.613660]  in use idle not busy
[ 1753.617156] TOP: 800000010047f4fc PUT: 000000010047f4fc GET: 000000010047f4fc FETCH: 000000010047f4fc
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000001 0000fb90 00000004 00001004
[ 1753.630365] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x249c6000, fsynr=0x20013, cb=19, sid=16(0x10 - GPU), pgd=267e63003, pud=267e63003, pmd=21ad53003, pte=0

[ 1753.655209] 501-17000000.gp10b, pid 1647, refs: 2: 
[ 1753.659924]  in use idle not busy
[ 1753.663425] TOP: 8000000100800294 PUT: 0000000100800294 GET: 0000000100800294 FETCH: 0000000100800294
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.686176] 502-17000000.gp10b, pid 1647, refs: 2: 
[ 1753.690879] not in use idle not busy
[ 1753.694632] TOP: 8000000100c00294 PUT: 0000000100c00294 GET: 0000000100c00294 FETCH: 0000000100c00294
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.717395] 503-17000000.gp10b, pid 1647, refs: 2: 
[ 1753.722094] not in use idle not busy
[ 1753.725847] TOP: 8000000101000294 PUT: 0000000101000294 GET: 0000000101000294 FETCH: 0000000101000294
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.748603] 504-17000000.gp10b, pid 1647, refs: 2: 
[ 1753.753302]  in use idle not busy
[ 1753.756795] TOP: 8000002000600018 PUT: 0000002000600018 GET: 0000002000600018 FETCH: 0000002000600018
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00001801 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.779542] 505-17000000.gp10b, pid 1647, refs: 2: 
[ 1753.784241]  in use on_pbdma busy
[ 1753.787737] TOP: 800000010180e190 PUT: 000000010180e1f4 GET: 000000010180e190 FETCH: 0000fc010180e1f4
               HEADER: 20040718 COUNT: 00110002
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000001 02207ffc 00000002 00081004

[ 1753.810481] 506-17000000.gp10b, pid 903, refs: 2: 
[ 1753.815097]  in use idle not busy
[ 1753.818590] TOP: 8000002000140030 PUT: 0000002000140030 GET: 0000002000140030 FETCH: 0000002000140030
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00001301 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.841343] 507-17000000.gp10b, pid 903, refs: 2: 
[ 1753.845958]  in use idle not busy
[ 1753.849450] TOP: 800000200005bde0 PUT: 000000200005bde0 GET: 000000200005bde0 FETCH: 000000200005bde0
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00001201 SEMAPHORE 00000003 fc000000 3dc0b704 01100002

[ 1753.872195] 508-17000000.gp10b, pid 764, refs: 2: 
[ 1753.876806]  in use idle not busy
[ 1753.880295] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.903042] 509-17000000.gp10b, pid 764, refs: 2: 
[ 1753.907655]  in use idle not busy
[ 1753.911150] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.933909] 510-17000000.gp10b, pid 764, refs: 2: 
[ 1753.938521]  in use idle not busy
[ 1753.942017] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.964775] 511-17000000.gp10b, pid 764, refs: 2: 
[ 1753.969387]  in use idle not busy
[ 1753.972883] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1753.996327] gk20a 17000000.gp10b: gk20a_fifo_handle_pbdma_intr: pbdma_intr_0(0):0x00004000 PBH: 20040718 SHADOW: 200381c5 M0: 8000001d 8004041d 80040715 00000000
[ 1754.010747] gk20a 17000000.gp10b: gk20a_set_error_notifier: error notifier set to 32 for ch 505
[ 1754.020123] gk20a 17000000.gp10b: gk20a_fifo_set_ctx_mmu_error_tsg: TSG 4 generated a mmu fault
[ 1754.028822] ---- mlocks ----

[ 1754.028844] ---- syncpts ----
[ 1754.028853] id 4 (disp_d) min 8 max 8 refs 1 (previous client : )
[ 1754.028856] id 5 (disp_e) min 1 max 1 refs 1 (previous client : )
[ 1754.028860] id 6 (disp_f) min 1 max 1 refs 1 (previous client : )
[ 1754.028863] id 7 (vblank1) min 19065 max 0 refs 1 (previous client : )
[ 1754.028874] id 18 (17000000.gp10b_507) min 7134 max 7134 refs 1 (previous client : )
[ 1754.028878] id 19 (17000000.gp10b_506) min 4 max 4 refs 1 (previous client : )
[ 1754.028885] id 24 (17000000.gp10b_504) min 12 max 12 refs 1 (previous client : 17000000.gp10b_501)

[ 1754.029288] ---- channels ----
[ 1754.029298] 
               channel 0 - 15810000.se

[ 1754.029301] NvHost basic channel registers:
[ 1754.029304] CMDFIFO_STAT_0:  00002040
[ 1754.029307] CMDFIFO_RDATA_0: 02c08a00
[ 1754.029310] CMDP_OFFSET_0:   00000000
[ 1754.029312] CMDP_CLASS_0:    00000000
[ 1754.029315] CHANNELSTAT_0:   00000000
[ 1754.029317] The CDMA sync queue is empty.

[ 1754.029322] 
               channel 1 - 15820000.se

[ 1754.029325] NvHost basic channel registers:
[ 1754.029327] CMDFIFO_STAT_0:  00002040
[ 1754.029330] CMDFIFO_RDATA_0: 28200e11
[ 1754.029332] CMDP_OFFSET_0:   00000000
[ 1754.029335] CMDP_CLASS_0:    00000000
[ 1754.029337] CHANNELSTAT_0:   00000000
[ 1754.029339] The CDMA sync queue is empty.

[ 1754.029344] 
               channel 2 - 15830000.se

[ 1754.029347] NvHost basic channel registers:
[ 1754.029350] CMDFIFO_STAT_0:  00002040
[ 1754.029352] CMDFIFO_RDATA_0: 11a20210
[ 1754.029355] CMDP_OFFSET_0:   00000000
[ 1754.029358] CMDP_CLASS_0:    00000000
[ 1754.029360] CHANNELSTAT_0:   00000000
[ 1754.029362] The CDMA sync queue is empty.

[ 1754.029367] 
               channel 3 - 15840000.se

[ 1754.029370] NvHost basic channel registers:
[ 1754.029372] CMDFIFO_STAT_0:  00002040
[ 1754.029375] CMDFIFO_RDATA_0: 20470cc4
[ 1754.029377] CMDP_OFFSET_0:   00000000
[ 1754.029380] CMDP_CLASS_0:    00000000
[ 1754.029382] CHANNELSTAT_0:   00000000
[ 1754.029384] The CDMA sync queue is empty.

[ 1754.029391] 
               ---- host general irq ----

[ 1754.029395] sync_intc0mask = 0x00000001
[ 1754.029397] sync_intmask = 0x50000003
[ 1754.029399] 
               ---- host syncpt irq mask ----

[ 1754.029402] 
               ---- host syncpt irq status ----

[ 1754.029406] syncpt_thresh_cpu0_int_status(0) = 0x00000000
[ 1754.029409] syncpt_thresh_cpu0_int_status(1) = 0x00000000
[ 1754.029412] syncpt_thresh_cpu0_int_status(2) = 0x00000000
[ 1754.029414] syncpt_thresh_cpu0_int_status(3) = 0x00000000
[ 1754.029417] syncpt_thresh_cpu0_int_status(4) = 0x00000000
[ 1754.029420] syncpt_thresh_cpu0_int_status(5) = 0x00000000
[ 1754.029422] syncpt_thresh_cpu0_int_status(6) = 0x00000000
[ 1754.029425] syncpt_thresh_cpu0_int_status(7) = 0x00000000
[ 1754.029427] syncpt_thresh_cpu0_int_status(8) = 0x00000000
[ 1754.029430] syncpt_thresh_cpu0_int_status(9) = 0x00000000
[ 1754.029433] syncpt_thresh_cpu0_int_status(10) = 0x00000000
[ 1754.029435] syncpt_thresh_cpu0_int_status(11) = 0x00000000
[ 1754.029438] syncpt_thresh_cpu0_int_status(12) = 0x00000000
[ 1754.029440] syncpt_thresh_cpu0_int_status(13) = 0x00000000
[ 1754.029443] syncpt_thresh_cpu0_int_status(14) = 0x00000000
[ 1754.029446] syncpt_thresh_cpu0_int_status(15) = 0x00000000
[ 1754.029448] syncpt_thresh_cpu0_int_status(16) = 0x00000000
[ 1754.029451] syncpt_thresh_cpu0_int_status(17) = 0x00000000
[ 1754.029456] 17000000.gp10b pbdma 0: 
[ 1754.032864] id: 4 (tsg), next_id: 4 (tsg) status: valid
[ 1754.038273] PUT: 000000010180e1f4 GET: 000000010180e190 FETCH: 000001d3 HEADER: 20040718

[ 1754.047842] 17000000.gp10b eng 0: 
[ 1754.051070] id: 2 (tsg), next_id: 2 (tsg), ctx: invalid 

[ 1754.057881] 17000000.gp10b eng 1: 
[ 1754.061110] id: 4 (tsg), next_id: 4 (tsg), ctx: invalid 


[ 1754.069540] 503-17000000.gp10b, pid 1647, refs: 2: 
[ 1754.074243] not in use idle not busy
[ 1754.077996] TOP: 8000000101000294 PUT: 0000000101000294 GET: 0000000101000294 FETCH: 0000000101000294
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1754.100753] 504-17000000.gp10b, pid 1647, refs: 2: 
[ 1754.105453]  in use idle not busy
[ 1754.108947] TOP: 8000002000600018 PUT: 0000002000600018 GET: 0000002000600018 FETCH: 0000002000600018
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00001801 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1754.131692] 505-17000000.gp10b, pid 1647, refs: 2: 
[ 1754.136395] not in use on_pbdma busy
[ 1754.140148] TOP: 800000010180e190 PUT: 000000010180e1f4 GET: 000000010180e190 FETCH: 0000fc010180e1f4
               HEADER: 20040718 COUNT: 00110002
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000001 02207ffc 00000002 00081004

[ 1754.162895] 506-17000000.gp10b, pid 903, refs: 2: 
[ 1754.167509]  in use idle not busy
[ 1754.171005] TOP: 8000002000140030 PUT: 0000002000140030 GET: 0000002000140030 FETCH: 0000002000140030
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00001301 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1754.193758] 507-17000000.gp10b, pid 903, refs: 2: 
[ 1754.198373]  in use idle not busy
[ 1754.201866] TOP: 800000200005bde0 PUT: 000000200005bde0 GET: 000000200005bde0 FETCH: 000000200005bde0
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00001201 SEMAPHORE 00000003 fc000000 3dc0b704 01100002

[ 1754.224623] 508-17000000.gp10b, pid 764, refs: 2: 
[ 1754.229234]  in use idle not busy
[ 1754.232728] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1754.255477] 509-17000000.gp10b, pid 764, refs: 2: 
[ 1754.260092]  in use idle not busy
[ 1754.263585] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1754.286330] 510-17000000.gp10b, pid 764, refs: 2: 
[ 1754.290944]  in use idle not busy
[ 1754.294434] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[ 1754.317188] 511-17000000.gp10b, pid 764, refs: 2: 
[ 1754.321803]  in use idle not busy
[ 1754.325296] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

PS. I’ve run this code on TX1 very well…plz help. ToT

Hi,

Will check this issue and update to you.
Thanks.

Nice to meet you. I was waiting.
First of all thank you for your answer.
I have tried to find and apply some things to solve this problem.
And after modifying this part, I reflected it through kernel compilation.

Here’s how.

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 85a1bbe..79d5ab6 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -3044,7 +3044,7 @@ static int tegra_pcie_enable_msi(struct tegra_pcie *pcie, bool no_init)
        }

        /* setup AFI/FPCI range */
-       msi->pages = __get_free_pages(GFP_DMA32, 0);
+       msi->pages = __get_free_pages(GFP_DMA, 0);
    }
    base = virt_to_phys((void *)msi->pages);

The above method is found in the link below.
https://devtalk.nvidia.com/default/topic/1002486/jetson-tx2/iommu-unhandled-context-fault-on-pci-device-dma/

In my case, it seems to be a problem when importing data from CUDA Device back to Host.
The following output is written on the first line as a result of searching through dmesg for the error that occurs.
The previous link says that the address of iova is incorrect.

[ 4120.307744] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x249c6000, fsynr=0x20013, cb=19, sid=16(0x10 - GPU), pgd=267d53003, pud=267d53003, pmd=222ade003, pte=0

My opinion is that this problem does not apply correctly to pointers between 32-bit and 64-bit memory addresses.
This is a situation that requires proper patching. Maybe this is a problem that only happens in PyCUDA.
But basically PyCUDA will not be a problem because it links existing C language CUDA functions.

Hi,

I try to modify your code into c-based program but got out of memory on both tx1 and tx2.(size=5184x3888)
And both platform run properly with lower memory size.

Could you run the program with smaller size to check if still hit the error?

I will keep trying to reproduce this via c++ program since it’s more convenience for us to debug without python wrapper.
If you can reproduce this issue via c++, please also let us know.

Thanks.

Hi, Thank you for the reply. :)

First of all, to find out the problem, I made a small program for hologram generation only by extracting a part from the original program.

TX1 was able to create images with sizes of 2592x1944, 5184x3888, and 10368x7666.
However, TX2 succeeded in creating an image of size 2592x1944, but could not create it otherwise.

I posted another question on this forum. I think that question is an extension of this problem. Please refer to the link below.
https://devtalk.nvidia.com/default/topic/1013916/jetson-tx2/is-the-available-global-memory-in-tx2-less-than-tx1-/

I used Python, which is easy to implement because I need complex numbers.
Below is a minimized program.

from pycuda import autoinit, driver as cuda
from pycuda.compiler import SourceModule
from scipy import misc
import numpy as np

mod = SourceModule("""
#include <cuComplex.h>

__device__ static __inline__ cuComplex _cexp(cuComplex z)
{
	cuComplex res;
	float t = expf(z.x);
	sincosf(z.y, &res.y, &res.x);
	res.x *= t;
	res.y *= t;
	return res;
}

__global__ void kernel_fresnel(float *kfx, float *kfy, float2 *kGbp, float2 *kGfp, \
	float k, float dz, float lamb, const int width, const int height)
{
	int x = blockDim.x * blockIdx.x + threadIdx.x;
	int y = blockDim.y * blockIdx.y + threadIdx.y;
	int idx = x + y*width;

	if (idx <= width*height+1) {
		float t = k*dz*sqrtf(1-powf(lamb*kfx[idx],2.0)-powf(lamb*kfy[idx],2.0));

		cuComplex _kgbp, _kgfp;
		_kgbp.x = 0;
		_kgbp.y = t;
		_kgfp.x = 0;
		_kgfp.y = -t;

		cuComplex _res_b = _cexp(_kgbp);
		cuComplex _res_f = _cexp(_kgfp);
		kGbp[idx] = _res_b;
		kGfp[idx] = _res_f;
	}
}
""")

def cuda_fd_hologram(fx, fy, Nx, Ny, k, dz, lamb):
	"""
	The Fresnel Diffraction hologram making function.
	"""
	gbp = np.zeros((Ny, Nx), dtype=np.complex64)
	gfp = np.zeros((Ny, Nx), dtype=np.complex64)

	y, x = fx.shape
	gy, gx = gbp.shape

	blkdim = 32
	grid_x = gx/blkdim+1
	grid_y = gy/blkdim+1

	fx  = np.reshape(fx, y*x)
	fy  = np.reshape(fy, y*x)
	gbp = np.reshape(gbp, gy*gx)
	gfp = np.reshape(gfp, gy*gx)
	
	_fx  = cuda.mem_alloc(fx.size * fx.dtype.itemsize)
	_fy  = cuda.mem_alloc(fy.size * fy.dtype.itemsize)
	_gbp = cuda.mem_alloc(gbp.size * gbp.dtype.itemsize)
	_gfp = cuda.mem_alloc(gfp.size * gfp.dtype.itemsize)
	
	cuda.memcpy_htod(_fx, np.float32(fx))
	cuda.memcpy_htod(_fy, np.float32(fy))
	cuda.memcpy_htod(_gbp, np.complex64(gbp))
	cuda.memcpy_htod(_gfp, np.complex64(gfp))
	
	func = mod.get_function("kernel_fresnel")
	func(_fx, _fy, _gbp, _gfp, np.float32(k), np.float32(dz), np.float32(lamb), \
		np.int32(x), np.int32(y), block=(blkdim,blkdim,1), grid=(grid_x,grid_y))

	cuda.memcpy_dtoh(gbp, _gbp)
	cuda.memcpy_dtoh(gfp, _gfp)
	gbp = np.reshape(gbp, (y, x))
	gfp = np.reshape(gfp, (y, x))
	
	del _fx, _fy, _gbp, _gfp, fx, fy
	return gbp, gfp

def main():
	PI 		= 3.1415926535897932384626433832795
	lamb 	= 430 * (1e-9)
	k 		= 2 * PI/lamb
	dz 		= 5 * (1e-4)
	delta 	= 1.1e-06

	# Image size	2592x1944(o) 	5184x3888(x)
	Nx 		= 2592
	Ny 		= 1944
	dfx = 1/(Nx*delta)
	dfy = 1/(Ny*delta)
	
	fx, fy = np.meshgrid(np.arange(-Nx/2,Nx/2)*dfx, np.arange(-Ny/2,Ny/2)*dfy)
	Gbp, Gfp = cuda_fd_hologram(fx, fy, Nx, Ny, k, dz, lamb)

	misc.imsave("Gbp.png", Gbp.astype(float))
	misc.imsave("Gfp.png", Gbp.astype(float))


if __name__ == '__main__':
	cuda.init()
	main()

Hi,

We are discussing the iova address error internally.

But when I try to reproduce this issue via pyCuda, I met error when building.
(Sorry that I’m not familiar with pycuda.)

Do you build pycuda with the steps in this page?
https://github.com/CoDaS-Lab/jetson-for-vision/blob/master/README.md#install-pycuda
Or could you share the steps for installing pycuda on tx2?

I tried board reset and flashing today. pyCuda installed well.

below my setup steps.

  1. flashing TX2(JetPack 3.0) with 14.04 host ubuntu 14.04 PC

  2. apt-get update && upgrade

  3. apt-get install python-pip

  4. Add PATH on .bashrc

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64\
         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  1. pip install pycuda (takes 10~ mins) (2017-1 version)
    –6. pip install --upgrade pip (optional)
  2. apt-get install scipy

Finally, I was launching a posted code and well executed. But same error in iova(dmesg).

Thank you for your team’s efforts…

Hi,

Thanks for your sharing but I still met error when installing.ToT
Looks like there are some problem in pycuda’s source.
Do you met this before?

In file included from src/cpp/cuda.cpp:1:0:
    src/cpp/cuda.hpp:14:18: fatal error: cuda.h: No such file or directory
    compilation terminated.
    error: command 'aarch64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-qugALW/pycuda/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-lBkr8h-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-qugALW/pycuda/

Hi,

Yes. plz see the number 4 steps in my setup, and after you run source ~/.bashrc

Hi,

Sorry for that I tried step4 but still met the same error.
I will try if this issue can be reproduced directly from c++ memcpy.

Could you also try if error occurs in following source?

from pycuda import autoinit, driver as cuda
from pycuda.compiler import SourceModule
from scipy import misc
import numpy as np

def cuda_fd_hologram(fx, fy, Nx, Ny, k, dz, lamb):
	gbp = np.zeros((Ny, Nx), dtype=np.complex64)
	gy, gx = gbp.shape

	gbp = np.reshape(gbp, gy*gx)
	_gbp = cuda.mem_alloc(gbp.size * gbp.dtype.itemsize)

	cuda.memcpy_dtoh(gbp, _gbp)


def main():
	PI 		= 3.1415926535897932384626433832795
	lamb 	= 430 * (1e-9)
	k 		= 2 * PI/lamb
	dz 		= 5 * (1e-4)
	delta 	= 1.1e-06

	# Image size	2592x1944(o) 	5184x3888(x)
	Nx 		= 2592
	Ny 		= 1944
	dfx = 1/(Nx*delta)
	dfy = 1/(Ny*delta)
	
	fx, fy = np.meshgrid(np.arange(-Nx/2,Nx/2)*dfx, np.arange(-Ny/2,Ny/2)*dfy)
	Gbp, Gfp = cuda_fd_hologram(fx, fy, Nx, Ny, k, dz, lamb)

	misc.imsave("Gbp.png", Gbp.astype(float))
	misc.imsave("Gfp.png", Gbp.astype(float))

if __name__ == '__main__':
	cuda.init()
	main()

Thank you for your hard work…

I thought there might be a problem with my CUDA kernel code.
So I allocated the memory as shown below, copied it from the host to the device, and then copied the data from the device to the host without any kernel function calls.
The result was no problem in processing 10368 x 7666 images. This process used up to 4 ~ 5GB of memory.

I will redesign my kernel function definition and block and grid settings from the beginning and proceed, and if there is a problem, I will send a note.

def cuda_fdd(fx, fy, Nx, Ny, k, dz, lamb):
	gbp = np.zeros((Ny, Nx), dtype=np.complex64)
	gfp = np.zeros((Ny, Nx), dtype=np.complex64)
	y, x = fx.shape
	gy, gx = gbp.shape

	# transform 2D -> 1D array
	fx = np.reshape(fx, y*x)
	fy = np.reshape(fy, y*x)
	gbp = np.reshape(gbp, gy*gx)
	gfp = np.reshape(gfp, gy*gx)
	

	_fx = cuda.mem_alloc(fx.size * fx.dtype.itemsize)
	_fy = cuda.mem_alloc(fy.size * fy.dtype.itemsize)
	_gbp = cuda.mem_alloc(gbp.size * gbp.dtype.itemsize)
	_gfp = cuda.mem_alloc(gfp.size * gfp.dtype.itemsize)

	# Copying data from host to device
	cuda.memcpy_htod(_fx, np.float32(fx))
	cuda.memcpy_htod(_fy, np.float32(fy))
	cuda.memcpy_htod(_gbp, np.complex64(gbp))
	cuda.memcpy_htod(_gfp, np.complex64(gfp))

	# cuda kernel call

	# Copying data from device to host
	cuda.memcpy_dtoh(gbp, _gbp)
	cuda.memcpy_dtoh(gfp, _gfp)

	# transform 1D -> 2D array
	gbp = np.reshape(gbp, (y, x))
	gfp = np.reshape(gfp, (y, x))

	return gbp, gfp

This answer is probably the last thing on this question.

The initial problem was whether to refer memory correctly.
As a result, we confirmed that there was a problem with iova’s address allocation through the “dmesg” command.
The iova issue was addressed and resolved in other postings.
Of course, we have added some other options when configuring the kernel for TX2. As a result, all issues have been resolved.

The size of the image was 10368x7776. So I assigned 32x32 = 1024 threads to each block, and the grid size was assigned to (x = 10368/32 = 324, y = 7776/32 = 243).
In the code before the modification, the size of the grid is +1. This was due to the fact that when the x or y of the image is odd, it does not divide by the size of the block.

I thought I was controlling the index of the part not allocated to the image in the block allocated to the cuda kernel, but it seemed to be a problem.

Thanks to @AastaLLL, I can get more granular thoughts. Thank you for your hard work.
I am a graduate student in Korea and I am not good at English but will answer any further inquiries.

i have install pycuda on jetson tx1 .
But ,i have ERROR when i try an example given in Python that computes the product of two arrays :

import pycuda.compiler as comp
import pycuda.driver as drv
import numpy
import pycuda.autoinit

mod = comp.SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1))

print (dest-a*b)

Errors here

$ python3 add_Matrix.py 
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    """)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pycuda/compiler.py", line 291, in __init__
    arch, code, cache_dir, include_dirs)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pycuda/compiler.py", line 255, in compile
    return compile_plain(source, options, keep, nvcc, cache_dir, target)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pycuda/compiler.py", line 137, in compile_plain
    stderr=stderr.decode("utf-8", "replace"))
pycuda.driver.CompileError: nvcc compilation of /tmp/tmp0w1noknf/kernel.cu failed
[command: nvcc --cubin -arch sm_53 -I/home/ubuntu/.local/lib/python3.5/site-packages/pycuda/cuda kernel.cu]
[stderr:
gcc: error trying to exec 'cc1plus': execvp: No such file or directory
]

same error in dmesg :

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 4.4.38-tegra (buildbrain@mobile-u64-638) (gcc version 4.8.5 (GCC) ) #1 SMP PREEMPT Thu Mar 1 20:44:58 PST 2018
[    0.000000] Boot CPU: AArch64 Processor [411fd071]
[    0.000000] Found tegra_fbmem: 00800000@92c9d000
[    0.000000] Reserved memory: initialized node iram-carveout, compatible id nvidia,iram-carveout
[    0.000000] Reserved memory: initialized node ramoops_carveout, compatible id nvidia,ramoops
[    0.000000] cma: Reserved 16 MiB at 0x00000000fdc00000
[    0.000000] On node 0 totalpages: 1043456
[    0.000000]   DMA zone: 8192 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 519168 pages, LIFO batch:31
[    0.000000]   Normal zone: 8192 pages used for memmap
[    0.000000]   Normal zone: 524288 pages, LIFO batch:31
[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] PERCPU: Embedded 17 pages/cpu @ffffffc0ffe0f000 s31616 r8192 d29824 u69632
[    0.000000] pcpu-alloc: s31616 r8192 d29824 u69632 alloc=17*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] CPU features: enabling workaround for ARM erratum 832075
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 1027072
[    0.183788] tegra_powergate_init: DONE
[    0.183832] DTS File Name: /dvs/git/dirty/git-master_linux/kernel/kernel-4.4/arch/arm64/boot/dts/../../../../../../hardware/nvidia/platform/t210/jetson/kernel-dts/tegra210-jetson-tx1-p2597-2180-a01-devkit.dts
[    0.183881] DTB Build time: Mar  1 2018 20:46:06
[    0.186925] tegra_bpmp_of_clk_init: EMC proxy not found.
[    0.186968] Error: Driver 'tegra-mc' is already registered, aborting...
[    0.188468] platform tegra-carveouts: SMMU isn't ready yet
...
[    0.504240] tegra-pcie 1003000.pcie-controller: 4x1, 1x1 configuration
[    0.504312] tegra-pcie 1003000.pcie-controller: failed to get PHY: -517
[    0.504340] tegra-pcie 1003000.pcie-controller: failed to get PHYs: -517
[    0.508384] tsec 54500000.tsec: initialized
...
[    7.233330]   #1: tegra-snd-t210ref-mobile-rt565x
[    7.242923] EXT4-fs (mmcblk1p1): couldn't mount as ext3 due to feature incompatibilities
[    7.253545] EXT4-fs (mmcblk1p1): couldn't mount as ext2 due to feature incompatibilities
[    7.317372] EXT4-fs (mmcblk1p1): mounted filesystem with ordered data mode. Opts: (null)
[    7.327438] VFS: Mounted root (ext4 filesystem) on device 179:33.
[    7.338215] devtmpfs: mounted
...
[    7.343410] Freeing unused kernel memory: 1208K (ffffffc001168000 - ffffffc001296000)
[    7.353284] Freeing alternatives memory: 76K (ffffffc001296000 - ffffffc0012a9000)
[    7.378921] btb inv war enabled
[    7.426487] tegra-pcie 1003000.pcie-controller: link 0 down, retrying
[    7.507784] systemd[1]: System time before build time, advancing clock.
...
[    7.696613] random: systemd-sysv-ge: uninitialized urandom read (16 bytes read, 115 bits of entropy available)
[    7.843828] tegra-pcie 1003000.pcie-controller: link 0 down, retrying
[    7.957198] systemd[1]: Listening on udev Kernel Socket.
...
[71336.247123] gvfsd-dnssd[1751]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000006
[71336.261432] pgd = ffffffc0069cd000
[71336.268704] [00000000] *pgd=00000000869c9003, *pud=00000000869c9003, *pmd=0000000000000000

[71336.282007] CPU: 0 PID: 1751 Comm: gvfsd-dnssd Not tainted 4.4.38-tegra #1
...
[71336.406478] Library at 0x7f9af97408: 0x7f9af21000 /lib/aarch64-linux-gnu/libc-2.23.so
[71336.414348] Library at 0x4037a0: 0x400000 /usr/lib/gvfs/gvfsd-dnssd
[71336.420657] vdso base = 0x7f9b46a000
[71858.126508] IPVS: Creating netns size=1424 id=9
...

Installing pycuda

Visit
https://wiki.tiker.net/PyCuda/Installation/Linux

Download latest pycuda
https://pypi.org/project/pycuda/#files

cd pycuda-VERSION
python configure.py --cuda-root=/where/ever/you/installed/cuda

to find cuda on your system

use

find / -type d -name cuda 2>/dev/null

Mine was

/usr/local/cuda-10.0

Run
for python2

python configure.py --cuda-root=/usr/local/cuda-10.0

for python3

python3 configure.py --cuda-root=/usr/local/cuda-10.0

Ignore nvcc error

sudo make install

Testing the installtion inside test directory using python/python3 using

python3 test_driver.py

I get this

============================================================== warnings summary ==============================================================
/home/rlpl123/.virtualenvs/test_tensorrt/lib/python3.6/site-packages/_pytest/mark/structures.py:337
/home/rlpl123/.virtualenvs/test_tensorrt/lib/python3.6/site-packages/_pytest/mark/structures.py:337: PytestUnknownMarkWarning: Unknown pytest.mark.cuda - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/latest/mark.html
PytestUnknownMarkWarning,

-- Docs: https://docs.pytest.org/en/latest/warnings.html
============================================== 23 failed, 6 passed, 1 warnings in 25.77 seconds ==============================================

I have raised this issue on github https://github.com/inducer/pycuda/issues/208, if anyone has better method to install pycuda do share

I use the following script to install “pycuda” Jetson Nano. I think it should work on Jetson TX2/TX1 as well. Let me know whether it works if you have the chance to try it.

https://github.com/jkjung-avt/tensorrt_demos/blob/master/ssd/install.sh

1 Like

Subject: Installing pycuda (Nvidia TX1)

Thanks. The https://github.com/jkjung-avt/tensorrt_demos/blob/master/ssd/install.sh install script worked. I just had to modify one line to get around a compilation error:

#boost_pylib=(basename /usr/lib/{arch}-linux-gnu/libboost_python3-py3?.so)
boost_pylib=/usr/lib/aarch64-linux-gnu/libboost_python-py35.so

Error:
aarch64-linux-gnu-g++ -pthread -shared -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-aarch64-3.5/src/cpp/cuda.o build/temp.linux-aarch64-3.5/src/cpp/bitlog.o build/temp.linux-aarch64-3.5/src/wrapper/wrap_cudadrv.o build/temp.linux-aarch64-3.5/src/wrapper/mempool.o build/temp.linux-aarch64-3.5/src/wrapper/wrap_curand.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib/stubs -L/usr/local/cuda/lib64/stubs -lboost_python3-py3? -lboost_thread -lcuda -lcurand -o build/lib.linux-aarch64-3.5/pycuda/_driver.cpython-35m-aarch64-linux-gnu.so
/usr/bin/ld: cannot find -lboost_python3-py3?
collect2: error: ld returned 1 exit status
error: command ‘aarch64-linux-gnu-g++’ failed with exit status 1
Makefile:6: recipe for target ‘all’ failed
make: *** [all] Error 1