Links

GitHub: repo

Thesis: Thesis

Abstract

There are many deep machine learning models that compute depth from binocular stereo

images, and many models which compute Optical Flow from monocular video. With

calibrated cameras, both can give dense 3D scene descriptions. While these approaches

perform very well, they do not take advantage of temporal coherence of stereo video.

We develop and explore 8 different architectures which take as input two consecutive

pairs of stereo images and produce a dense disparity image. We design these while

exploring two different modality combination paradigms, namely Early and Late Fusion

of binocular video frames. Additionally, we perform rigorous evaluation and statistical

analysis on the KITTI dataset (Mayer et al., 2016). We show one of our Late Fusion

models is at least as and possibly more performant (albeit with weak statistical evidence)

than LEAStero - a State of The Art stereo network. We also analyse error maps on

select dataset samples and suggest improvements which could push performance even

further, paving the way for future work.