硬件效率的可区分关节修剪和量化

论文标题

硬件效率的可区分关节修剪和量化

Differentiable Joint Pruning and Quantization for Hardware Efficiency

论文作者

Wang, Ying, Lu, Yadong, Blankevoort, Tijmen

论文摘要

我们提出了可区分的关节修剪和量化（DJPQ）方案。我们将神经网络压缩构架为基于联合梯度的优化问题，在模型修剪和自动量化之间以硬件效率进行交易。 DJPQ结合了基于变异信息瓶颈的结构化修剪和混合位精度量化，并将其量化为单个可区分的损耗函数。与以前考虑修剪和量化分别进行修剪和量化的工作相反，我们的方法使用户能够在单个培训过程中找到两者之间的最佳权衡。为了利用该方法进行更有效的硬件推理，我们将DJPQ扩展到将结构化修剪与两个位限制量化的量化集成在一起。我们表明，DJPQ显着减少了多个网络的比特操作数量（BOP）的数量（BOPS），同时保持原始浮点模型的前1个准确性（例如，ImabeNet上的RESNET18的53倍BOPS降低，MobilenetV2中的43倍）。与常规的两阶段方法相比，独立于修剪和量化的传统两阶段方法在准确性和BOP方面都优于我们的方案。即使考虑受到限制的量化，DJPQ也达到了比两阶段方法更大的压缩比和更好的准确性。

We present a differentiable joint pruning and quantization (DJPQ) scheme. We frame neural network compression as a joint gradient-based optimization problem, trading off between model pruning and quantization automatically for hardware efficiency. DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function. In contrast to previous works which consider pruning and quantization separately, our method enables users to find the optimal trade-off between both in a single training procedure. To utilize the method for more efficient hardware inference, we extend DJPQ to integrate structured pruning with power-of-two bit-restricted quantization. We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models (e.g., 53x BOPs reduction in ResNet18 on ImageNet, 43x in MobileNetV2). Compared to the conventional two-stage approach, which optimizes pruning and quantization independently, our scheme outperforms in terms of both accuracy and BOPs. Even when considering bit-restricted quantization, DJPQ achieves larger compression ratios and better accuracy than the two-stage approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题