At time of this writing, most modern ML inference workloads, especially ones in production, run on NVIDIA GPUs for better performance. NVIDIA TensorRT is one such tool that takes in a PyTorch or ONNX model, applies some model optimizations and generate TensorRT binaries to be run on compatible GPUs. Some model optimization techniques here include:Documentation Index
Fetch the complete documentation index at: https://docs.cellulose.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Weight and activation precision calibrations (quantization)
- Operation (op) / layer fusions
- Auto-tuning of kernels so only the best algorithms are selected to run on your specific target device (GPU)
- Multi stream execution etc.
Operator Annotations
Figure out TensorRT compatibility on a per operation / layer basis
at a glance from the model visualizer.
Quantization Workflows (coming soon)
Quantize your model with TensorRT, understand your accuracy vs. system
performance improvement tradeoffs with ease.
Operator Fusion (coming soon)
Understand which operator / layer fusions are automatically applied by
TensorRT.

