TensorRT supports quantized floating point, where floating-point values are linearly compressed and rounded to 8-bit integers. This significantly increases arithmetic throughput while reducing storage requirements and memory bandwidth. When quantizing a floating-point tensor, TensorRT must know its dynamic range - that is, what range of values is important to represent - values outside this range are clamped when quantizing.
Dynamic range information can be calculated by the builder (this is called calibration) based on representative input data. Or you can perform quantization-aware training in a framework and import the model to TensorRT with the necessary dynamic range information.
- from NVIDIA’s documentation
We have plans to support NVIDIA TensorRT quantization workflows, and that means a Cellulose SDK that allows you to define what your calibration dataset will be like if you want to deploy in lower precision formats such as INT8.
Read more about NVIDIA TensorRT quantization workflows here
Have questions / need help?
Please reach out to email@example.com, and we’ll get back to you as soon as possible.