The research web page has more info. I didn’t read the paper yet, but it seems to be inferring 3D information from just the 2D image, which is hard.
The video is pretty carefully navigated to avoid the black (or stretched-out) areas where there isn’t enough visual information to see (like if you tried to go to the side of the building that wasn’t included in the photo). The effect would happen quite a bit. But still, this is useful for being able to navigate slightly around 3D images, or, I imagine, if you have multiple images from multiple sides of an object, it might be enough to finish a sort of reasonable “convex hull” of the object.