At time of this writing, most modern ML inference workloads,
especially ones in production, run on NVIDIA GPUs for better performance.
NVIDIA TensorRT is one such tool that takes in a
PyTorch or
ONNX model, applies some model
optimizations and generate TensorRT binaries to be run on compatible GPUs.Some model optimization techniques here include:
Weight and activation precision calibrations (quantization)
Operation (op) / layer fusions
Auto-tuning of kernels so only the best algorithms are selected to run on
your specific target device (GPU)
Multi stream execution etc.
Cellulose currently integrates basic TensorRT features within our dashboard.
These include: