Until now there have been few formalized methods for conducting systematic benchmarking aiming at reproducible results when it comes to computer vision algorithms. This is evident from lists of algorithms submitted to prominent datasets, authors of a novel method in many cases primarily state the performance of their algorithms in relation to a shallow description of the hardware system where it was evaluated. There are significant problems linked to this non-systematic approach of reporting performance, especially when comparing different approaches and when it comes to the reproducibility of claimed results. Furthermore how to conduct retrospective performance analysis such as an algorithm’s suitability for embedded real-time systems over time with underlying hardware and software changes in place. This paper proposes and demonstrates a systematic way of addressing such challenges by adopting containerization of software aiming at formalization and reproducibility of benchmarks. Our results show maintainers of broadly accepted datasets in the computer vision community to strive for systematic comparison and reproducibility of submissions to increase the value and adoption of computer vision algorithms in the future.