Quantization is a common way to reduce the demand on hardware.
When the activations are quantized, the number of MAC operationsvastly reduces, resulting in with a better latency and energy
consumption.
On the other hand, weight quantization decreases both memory
footprint and the number of MAC operations, also helping with area
reduction.
To obtain independent quantization of trainable parameters, QKeras
library is used. Mathematically, the mantissa quantization for a give
input x is: [3]
Previous studies have been done on 8-bit quantization schemes
and other fixed lower precision levels. [4]
Experiments have been conducted using a light-weight network on
the CIFAR10 dataset [5].
Adapting an intra-layer mixed quantization training technique for
both weights and activations, with respect to layer sensitivities, a
memory reduction of 2/8 times and a number of MAC operation
reduction of 2/30 times can be achieved compared to their
8bit/FP32 counterparts while sacrificing virtually no accuracy
against 8bit and around 2% against the FP32 model
No comments:
Post a Comment