Links
GitHub: repo
Thesis: Thesis
Abstract
There are many deep machine learning models that compute depth from binocular stereo
images, and many models which compute Optical Flow from monocular video. With
calibrated cameras, both can give dense 3D scene descriptions. While these approaches
perform very well, they do not take advantage of temporal coherence of stereo video.
We develop and explore 8 different architectures which take as input two consecutive
pairs of stereo images and produce a dense disparity image. We design these while
exploring two different modality combination paradigms, namely Early and Late Fusion
of binocular video frames. Additionally, we perform rigorous evaluation and statistical
analysis on the KITTI dataset (Mayer et al., 2016). We show one of our Late Fusion
models is at least as and possibly more performant (albeit with weak statistical evidence)
than LEAStero - a State of The Art stereo network. We also analyse error maps on
select dataset samples and suggest improvements which could push performance even
further, paving the way for future work.