Greetings everyone. Please allow me to explain my question one by one step:
Step 1. walk follow the generic routine working with TensorRT python binding:
Declare one host-device memory mapping class:
def init(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
return “Host:\n” + str(self.host) + “\nDevice:\n” + str(self.device)
The python debugger tells me the self.device_mem is actually a DeviceAllocation object.
So I think I can safely assume it resides on the GPU memory.
Step 2. perform the TensorRT inference like everyone else:
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] # Run inference. context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle) [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs] stream.synchronize() output = outputs.device
According to Step 1, the output is a DeviceAllocation object.
Step 3. create a Pytorch Tensor in GPU:
my_tensor = torch.tensor([1, 2, 3, 4 …]) # HUGE Tensor here
my_tensor = my_tensor.cuda()
I think my_tensor is a tensor living in the GPU memory.
Step 4. calculate the CrossEntropyLoss using the following Pytorch API:
criterion = nn.CrossEntropyLoss().cuda()
loss = criterion(output, labels)
These two line of code is to fail due to incompatible types: DeviceAllocation and Tensor, which is expected.
Here comes the question: is there anyway to maniuplate the DeviceAllocation object or convert it to a Tensor(GPU) object so I can perform some own logic?
Maybe someone will suggest I can copy the DeviceAllocation to the host memory to do the next stuff, which I think may cause terrible performance loss if the Tensor is a HUGE one. After some fruitless online search, I think this may be the best place to ask this question.
Thank you so much for any hint or help. :)