Imagine my disappointment in seeing that the latter had seemingly been solved already (assuming the former is relatively simple in comparison). I was recently directed to a copy of a paper called Virtual Viewpoint Replay for a Soccer Match by View Interpolation From Multiple Cameras by Naho Inamoto and Hideo Saito, where I found not only Inamoto and Saito's work on the topic, but of course, many references to related research, as well.
Ok, so maybe it's not really such a surprise that work on this area has been done in the past -- it would be naive to think otherwise [:)]. But at first glance, this paper seems to cover a method that might be well suited to the broadcast industry, thereby coming much closer to the realm of my possible thesis topic. But as we will see, there is still much room for improvement, and therefore still work available if I do end up taking this direction.
What It's All About
The basic premise of this research is to be able to watch a sports game (soccer in the case of this particular paper) from angles that weren't explicitly captured on camera. Computer vision techniques would be employed to determine what an intermediate camera between two real ones would be seeing. A user of an on-demand video system could potentially guide his own camera and see the entire game from that viewpoint, or a broadcaster could simply present a new and interesting shot of a particular play for the audience.
In the past, the methods used to synthesize these new shots fit into three categories (according to Inamoto and Saito):
- Model Based Approach. Using the geometry of the scene, a 3-D reconstruction with a synthesized texture could be created and then reprojected into the desired novel view. The accuracy of the results depends on how well the model can be constructed, which in turn depends on having many calibrated cameras. This method is not well suited to large areas like a soccer stadium.
- Transfer Based Approach. Instead of a 3-D model, morphing techniques can be employed to obtain the new viewpoint between two images, or a trifocal tensor can be used for image transfer. Either of these require dense correspondences between the known views, hence the two known images must be static or only slightly varying.
- Approach Using the Plenoptic Function. The plenoptic function "describes all the radiant energy that is perceived by an observer at any point in space and time." With this function, it is possible to allow users to arbitrarily pan and tilt a virtual camera, and the resulting shots will be based on a collection of sample images. Since the plenoptic function is seven dimensions, though, it requires data reduction or compression in practical use. This makes it less suitable for the soccer example because it would be impossible to describe all the radiant energy in the stadium scene.
Inamoto and Saito's proposed method generates images of arbitrary viewpoints "by view interpolation among real camera images." Multiple cameras will be capturing a soccer match, and when a user chooses a new virtual viewpoint, the two or three cameras closest to it are used to find correspondences and perform view interpolation. Because the field and the soccer stadium itself are fixed, they can be considered static, and processed separately from the dynamic objects (players, the ball, etc). Furthermore, much of the processing time can be done ahead of time, including the determination of the camera geometry.
Several videos that show the results of this method can be found on this research web page. In addition to simply creating new viewpoints for the user, the videos demonstrate that the method allows for effects like that seen in The Matrix, where the camera spins around stationary players.
The performance of a developed system using this method was tested on a Pentium 4 3.2 GHz desktop with 2 GB of memory and an ATI Radeon 9800 graphics card. On average, the system could process 3.7 frames every second. The processing time is linear in the number of dynamic objects contained in a scene at a given time.
What It Means For Me
Here I'd like to discuss some aspects of the reported research that I'm not sure would be entirely suitable for the production industry, and could thus be interesting areas for me the explore for my thesis. Some of these I have not mentioned above, but are still part of the discussed research, and more information can be found in the associated paper.
It's worth noting that, being Canadian, we wanted to explore the application of virtual viewpoints in the context of hockey games rather than soccer. How cool would it be to put yourself in Daniel Alfredsson's skates as he flies to the net and puts a puck in top shelf?! Keep this context in mind as you read the issues I raise below.
First, the proposed method requires that several cameras are fixed and calibrated ahead of time. The first problem with this is not too serious: there is a requirement for the stadium or arena to invest in more cameras. This is due to the fact that almost all cameras at a sports event are constantly moving to follow the action or to switch its focus to another point of interest. It is possible that this is a reasonable investment, provided that the results are worth the extra money. As will be discussed in a moment, this certainly isn't the case at this stage.
The second issue with fixed cameras could, in my opinion, be slightly more problematic. The manual work required to calibrate these cameras requires broadcasting technicians to spend time identifying corresponding points between the cameras, and identifying the static regions (such as the field or stadium in the case of soccer) in the image captured by every camera. This work takes an hour for just four cameras, and you may need more cameras than that to get a high quality result. Now, this extra work may not be relevant after all if you consider that once the cameras are calibrated, they don't need to be calibrated again because they are fixed. I'm just wondering whether there is the possibility of these cameras being taken away at the end of each game, thus forcing this task to be redone before the next game when the cameras are replaced. The possibility of human error each time could seriously affect the results for that game; for example, being slightly off on the point correspondences throws off the whole geometry of the cameras!
Next, consider a requirement that allows soccer players to be identified on the field. It is essential that the color of the ball and the players (including the uniform) be different from the field. It would seem that most soccer teams have jerseys that avoid this problem, but this is certainly not the case in hockey. Even my favorite hockey team wears white on away games, so it would be difficult to distinguish between the jersey and the white ice surface. I also believe there could be new issues when using this method for hockey related to the reflectivity of the ice. Surely the light absorbing properties of ice and turf are different, so shadows might need to be handled differently, and reflections might get in the way.
Finally, the critical issues outlined in the paper must be addressed before such a system could be used for real broadcasting. For example, players can occasionally disappear from the scene when they have not been filmed in at least two of the cameras. Problems also occur when four or five players overlap. While this may not happen quite as often on a wide open soccer field, the smaller hockey rink would have this happen all the time!
It's pretty clear that the issues outlined here, as well as generally improving the speed and quality of the resulting virtual video, provide a lot of potential research for an enthusiastic Masters student such as myself. I'll let you all know what I end up doing for my thesis as soon as I know!