论文标题
简单意味着更快:在CPU上的单眼第一人称视频中进行实时的人类运动预测
Simple means Faster: Real-Time Human Motion Forecasting in Monocular First Person Videos on CPU
论文作者
论文摘要
我们提出了一个简单,快速,轻巧的RNN基于RNN的框架,用于预测人类在第一人称视频中的未来位置。这项工作的主要动机是设计一个可以在CPU上以很高的速度准确预测未来轨迹的网络。这种系统的典型应用将是所有人的社交机器人或视觉援助系统,因为两者都无法具有高度的计算能力以避免变得更重,效率降低和更昂贵。与许多以前的方法相反,这些方法依赖于多种类型的提示,例如相机的自我运动或人类的2D姿势,我们表明,精心设计的网络模型仅依赖于边界框,不仅可以表现更好,而且还可以预测轨迹的速度很高,而大小约为17 mb。具体而言,我们证明,在过去信息的编码阶段中具有自动编码器,最终具有正规化层,可以提高预测的准确性,从而可以忽略不计。我们尝试三个第一人称视频数据集:CityWalks,FPL和JAAD。我们在Citywalks训练的简单方法超过了最新方法(SteD)的预测准确性,而在CPU上的速度快9.6倍(在GPU上运行)。我们还证明,我们的模型可以将零射击或仅通过15%的微调转移到其他类似数据集后,并在此类数据集(FPL和DTP)上使用最新方法(FPL和DTP)执行。据我们所知,我们是第一个在CPU上以每秒78个轨迹的高预测速率准确预测轨迹的人。
We present a simple, fast, and light-weight RNN based framework for forecasting future locations of humans in first person monocular videos. The primary motivation for this work was to design a network which could accurately predict future trajectories at a very high rate on a CPU. Typical applications of such a system would be a social robot or a visual assistance system for all, as both cannot afford to have high compute power to avoid getting heavier, less power efficient, and costlier. In contrast to many previous methods which rely on multiple type of cues such as camera ego-motion or 2D pose of the human, we show that a carefully designed network model which relies solely on bounding boxes can not only perform better but also predicts trajectories at a very high rate while being quite low in size of approximately 17 MB. Specifically, we demonstrate that having an auto-encoder in the encoding phase of the past information and a regularizing layer in the end boosts the accuracy of predictions with negligible overhead. We experiment with three first person video datasets: CityWalks, FPL and JAAD. Our simple method trained on CityWalks surpasses the prediction accuracy of state-of-the-art method (STED) while being 9.6x faster on a CPU (STED runs on a GPU). We also demonstrate that our model can transfer zero-shot or after just 15% fine-tuning to other similar datasets and perform on par with the state-of-the-art methods on such datasets (FPL and DTP). To the best of our knowledge, we are the first to accurately forecast trajectories at a very high prediction rate of 78 trajectories per second on CPU.