AI Others Research

Turing Award | A Deep Dive Into Levoy and Hanrahan’s 1996 Paper on Light Field Rendering

In the seminal 1996 paper Light Field Rendering, Levoy and Hanrahan describe a representation for light fields that allows for both efficient creation and display.

First published: Proc. ACM SIGGRAPH ’96
Link: https://graphics.stanford.edu/papers/light/light-lores-corrected.pdf

Introduction

In March of this year, Pat Hanrahan and Edwin Catmull were honoured with the 2019 Turing Award. As the founding members of the Pixar Animation Studios, Hanrahan and Catmull helped establish the foundations for today’s computer graphics. In the seminal 1996 paper Light Field Rendering, Levoy and Hanrahan describe a representation for light fields that allows for both efficient creation and display. This representation provides a way of generating novel rendered views of a 3-D scene using a 4-D data structure. The paper also discusses practical issues associated with working with light fields, including measuring light fields from virtual and real scenes, tackling aliasing, storing light fields, and rendering novel views of a scene from a light field model.

The associated Wikipedia page describes a “light field” as “a vector function that describes the amount of light flowing in every direction through every point in space” and then introduces 5-D and 4-D light fields in this context. These examples however may not be the best foundation on which to build an understanding of light field research. Thus, in this article I will try to give an introduction to light fields that provides the proper prerequisites for understanding Levoy and Hanrahan’s paper on light field rendering.

Light Fields

In order to describe a single ray of light, we must account for the following properties

  • spectral content (which determines what we perceive as its colour),
  • position in space,
  • the direction in which it is travelling, and
  • how it changes over time.

In full detail, we can describe a single, monochromatic ray of light in terms of seven quantities, three for position (x, y, z), two for direction (θ and φ), one for wavelength (λ), and one for time (t). This is known as the plenoptic function, which describes the intensity of light rays as a function of these seven quantities and is denoted as P(θ, φ, λ,t, x, y, z).

However, for the purpose of computer vision — or visual perception in this case — we need only measure and extract information from a subset of the plenoptic function. This is because it is impractical to experience the entire spectrum of light, over all space, in all directions and over all time.

To illustrate this concept, consider the following statement: A colour photograph measures the plenoptic function over the range of frequencies occupied by visible light, for a fixed position and time, and over a finite range of directions. Otherwise stated, this photograph is the function P(θ, φ, λ, t, x, y, z) with t, x, y, z fixed. Now, since only three of the parameters (θ, φ, λ) may vary, the colour photograph can be considered as a 3D slice of the plenoptic function.

As mentioned earlier, the two light field representations found on Wikipedia are the 5-D and 4-D light fields. I will discuss the two representations next.

The 5-D light field

The 5-D plenoptic function models the human visual system. The eyes may be very roughly modelled as two cameras, at separate locations in space (typically along the x-axis), and recording a sequence of images over time. In addition, we can simplify the wavelength (λ) parameter in a way that approximates how human eyes perceive colour, e.g., using RGB (red, green, blue) representation. Finally, as mentioned previously, the eyes can be thought of as recording sequences of images over time. Mathematically, at each point in time, the plenoptic function will look something like this

  • P_RGB(θ, φ, t’, x, y, z),

where the RGB subscript on the function indicates that its value is given by a colour triplet (in replacement of λ), t’ is a fixed value (for each image in the sequence), while the remaining parameters are free. Thus, the 5-D light field with parameters θ, φ, x, y, and z. Note that x in the case of two human eyes will only have 2 possible points to sample from. In terms of computer graphics, the number of cameras may be many, giving a lot more freedom in the x parameter.

The 4-D light field

As stated, there is no constraint to the number of “eyes” in computer vision. In fact, we can extend this idea and add cameras separated along, say, the y-axis! In terms of computer graphics, we can think of having an infinite array of cameras on the x-y plane with infinite directions. At a single instant in time, this array of cameras would measure the 4-D slice of the plenoptic function given by

  • P_RGB( θ, φ, t’, x, y, z’),

where t’ is a fixed time instance, and the parameter z’ is simply not used (as the cameras exist on the x-y plane). This 4-D slice of the plenoptic function is that subset which is represented in a light field. Without loss of generality, we will refer to this “coloured” plenoptic function simply as P( θ, φ, t’, x, y, z’), without the RGB subscript, to be in sync with the paper’s notation. This representation contains sufficient information about the light in a scene to allow a rendering system to generate 2-D images of the scene from novel camera positions. With this concept down, we are now ready to take a closer look at the paper.

Rendering

Rendering, in its most simple terms, is the process of converting a 3-D scene into a 2-D rendered scene that can be displayed to audiences. Animated films are the product of playing the resulting 2-D images in order. Figure 1. illustrates an example of a rendering process.

Classical techniques in computer graphics are geometry-based. That is, they generate images of 3-D scenes based on geometric models of those scenes. The complexity of these techniques depends on the complexity of the geometry within the scenes. In addition, these techniques are limited by how accurately they are able to represent the interactions between the light and a surface. This is because these traditional techniques make use of lighting and surface models which are often an approximation of reality. These two factors limit the effectiveness of geometry-based modelling, particularly in real-time applications.

image.png
Figure 1. Rendering is the process of generating an image from a model. The model can be either 2-D or 3-D.

Another approach is image-based rendering. Using this approach, rendering an image of a scene involves calculating the value of an appropriate subset of the plenoptic function for that scene. The advantage of this technique is that if one can represent the plenoptic function reasonably well it is possible to render novel views of a scene without ever concerning oneself with geometric, lighting, or surface models. The cost of interactively viewing the scene is therefore independent of scene complexity.

Although not the highlight of the paper, I would like to point out the significance of image-based rendering. The speed of image-based rendering is the main advantage that initially drove its popularity. Novel views can be rendered quickly and independent of the scene complexity. Thus, image-based rendering was also used as a means to perform analysis. One thing to keep in mind is that, although independent of scene geometry, scenes with complex geometry will typically still require many samples of the plenoptic function in order to be accurately represented.

Paper Contribution

A novel rendering technique is presented

In their paper the authors proposed a new technique that is robust and allows much more freedom in the range of possible views. The paper states

“The major idea behind the technique is a representation of the light field, the radiance as a function of position and direction, in regions of space free of occluders (free space). In free space, the light field is a 4-D, not a 5-D function.”

Now, what does this actually mean? Recall the previous discussion about the plenoptic function. In the case of a 5-D slice of the plenoptic function, P(θ, φ, t’, x, y, z), there is enough information to allow completely free camera motion, e.g., a camera can perform translation and rotation in all possible directions. Allowing complete freedom of camera motion obviously may incur redundancies. Thus, we can actually simplify this 5-D representation into the 4-D subset without significant penalty. This is the subset represented by the light field described by Levoy and Hanrahan.

How is this accomplished? By imposing the constraint that the light rays in a scene must be of constant value everywhere along their directions of propagation. By doing so, the dimensions can be reduced by one. Intuitively, we are assuming that a ray of light is constant along its propagation path. In other words, light rays should be presented as the same value along the line travelled, so there is no need to represent them at all points on that line. Thus, keeping the information of one point on the line is sufficient to represent the entire line.

The 4-D subset of the plenoptic function represented by a light field may be expressed as P(θ, φ, t’, x, y, z’ ). Note that the z-direction is fixed. This represents each ray in terms of its direction of propagation, as two angles, and its position on the x, y plane, as two distances.

The major issue in choosing a representation of the 4-D light field is how to parameterize the space of oriented lines (propagation paths). Several factors must be considered when choosing this parameterization.

  • Computational efficiency: computation of the position of a line from its parameters should be fast. As mentioned above, a single point of a light ray should contain enough information to calculate its entire line easily.
  • Control over the set of lines: since the space of all lines is infinite, we should only keep track of a finite subset of lines that are useful. For example, in the case of viewing an object we need only lines intersecting the convex hull of the object.
  • Uniform sampling: a light field representation ideally allows the user to move freely around an object without noticing any resolution changes in the model. This requires the representation to be invariant under both rotations and translations. Hence the term, uniform sampling.

The solution presented by Levoy and Hanrahan is as follows. Parameterize lines by their intersections with two planes in arbitrary position. Quoting the paper, “By convention, the coordinate system on the first plane is (u, v) and on the second plane is (s, t). An oriented line is defined by connecting a point on the u-v plane to a point on the s-t plane.”

What does this mean? Since each line is defined by connecting a point on the u-v plane to a point on the s-t plane, it requires the use of two sets of local coordinates (represented by the u-v and s-t planes). The reference planes are separated by a distance d, and light rays are assumed to travel from the u-v reference plane towards the s-t reference plane. This is illustrated in Figure 2.

image.png
Figure 2. Parameterization method to represent lines by a point and a direction.

The sampled function L(u, v, s, t) resulting from this two-plane parameterization is therefore the 4-D light field parameterization. Levoy and Hanrahan use the term light slab to denote this representation. To recap, a light field is a representation of the light flow coming from the scene. The 4-D plenoptic function is a function of the position and direction of the light rays.

One way we can think about the relationship between the u-v plane and the s-t plane is as follows.

  • The light field (in fact a sub-set of it) is represented by L(s,t,u,v) and discretized by all possible lines between the two planes.
  • The s-t plane is subdivided into squares with each vertex forming a centre of projection, with a rectangular subset of the u-v plane as the view plane window forming an associated image.
  • There is an image associated with each grid point of the s-t plane, and a radiance associated with each (s,t,u,v) coordinate. This describes how to form one light slab.

Once the light field is constructed, it can be used for synthesis of an image from a virtual camera that does not correspond to the set of cameras on the s-t plane. A new image can be formed by sampling the corresponding set of lines through the required viewpoint and directions.

Levoy and Hanrahan discussed in the paper the creation of both virtual light fields (from rendered images) and real light fields (from digitized images). We will look at this in the next section.

Creating light fields from rendered and digitized images

Light fields generated from rendered images
To generate a virtual light field, a light slab is easily generated simply by rendering a 2-D array of images. Each image represents a slice of the 4-D light slab at a fixed u-v value and is formed by placing the centre of projection of the virtual camera at the sample location on the u-v plane. One must keep in mind that each of the 2-D images’ x-y samples must map to the same s-t samples. In Figures 4. and 5., two ways of visualizing a light field are illustrated.

image.png
Figure 4. In the first visualization of the light field, each image in the array represents the rays arriving at one point on the u-v plane from all points on the s-t plane.

In this visualization (Fig 4), we can think of each image individually. Each image represents a “perspective” of the object. Extend this idea to an array of images, and what we have is a series of discrete perspectives. The entire array of images displays the light field, and each individual image indicates which sample is being used to construct the view on the s-t plane.

image.png
Figure 5. In the second visualization of the light field, each image represents the rays leaving one point on the s-t plane bound for all points on the u-v plane.

The second visualization displays an s-t array of u-v images (Fig 5). Quoting the paper, this “occurs because the object has been placed astride the focal plane, making sets of rays leaving points on the focal plane similar in character to sets of rays leaving points on the object”. This can be understood visually by considering the example in Figure 6.

image.png
Figure 6. Visualization in Figure 5. results in images that look like reflectance maps. The purple triangles represent the sample points on the camera plane used to sample the light coming from the focal plane. In this figure, the bottom ray is perturbed to a different colour than intended.

A ray bounces back from the focal plane placed at distance d from the camera plane. Since the object is placed astride the focal plane, the ray we are going to sample is not perfect and is perturbed by sets of rays leaving points on the object. Thus, this gives the peculiar-looking pictures in Figure 5. To fix this, we simply move the focal plane to a better spot where the object of interest is in focus. This is illustrated in Figure 7.

image.png
Figure 7. By moving the focal plane to distance d’, the object now becomes in focus as the reflected light ray is sampled correctly.

The question now is, how is the rendering done from a constructed light field? Rendering an arbitrary view of a scene from a light field model is extremely simple. First, a model of the virtual camera is used to determine the set of rays corresponding to the image pixels. Then, each ray is simply parameterized in terms of its points of intersection with the two reference planes. Once we have all the rays needed, we can construct a virtual view of the scene!

Aliasing
As mentioned, light fields are independent of the complexity of the scene, and therefore can quickly generate rendered images. However, it is not true that a light field with a fixed number of samples will represent all scenes equally well. In fact, a light field essentially samples the continuous-domain plenoptic function, and so the resulting sampled data may experience aliasing.

The effects of aliasing may be alleviated by pre-filtering using a synthetic aperture before sampling. This is done by initially oversampling along the camera-spacing dimensions, and then applying a discrete low-pass filter (this models a synthetic aperture). Levoy and Hanrahan recognized that the sampling density of the camera plane must be relatively high to avoid excessive blurriness in the reconstructed images.

Pre-filtering using an aperture can be understood through the following example. Consider a camera placed on the u-v plane and in focus on the s-t plane. In this case, the filtering process corresponds to integrating over a pixel corresponding to an s-t sample, as well as integrating over an aperture equal in size to a u-v sample, as shown in Figure 9.

image.png
Figure 9. Filtering using an aperture. The centre camera is focused on the s-t plane with an aperture on the u-v plane whose size is equal to the u-v sample spacing. A film plane is placed behind the aperture. Thus, integrating over a pixel on the film plane is equivalent to integrating over an s-t region bounded by the pixel.

Some final points regarding virtual light fields must be noted. Measuring a light field from a virtual environment is essentially the process of building a database of light rays emanating from a scene. The set of rays represented by the database is determined by the parameters of the two reference planes. The size and separation of the reference planes effectively determine the subset of light rays that will be represented in the light field and thus determine the range of motion that the virtual camera is allowed when rendering. The plane separation will affect a quantity related to the field of view of the light field while larger reference planes will cover a larger area, and thus allow more perspectives (and in turn, camera motion).

Light fields generated from digitized images
A great strength of light fields is their ability to represent digitized, real-world images. However, as the paper acknowledges, “digitizing the imagery required to build a light field of a physical scene is a formidable engineering problem.” This is due to the large number of images required (hundreds or thousands). Moreover, the lighting must be controlled to ensure a static light field, yet flexible enough to properly illuminate the scene, all the while staying clear of the camera to avoid unwanted shadows. This process must be automated or at least computer-assisted. Therefore, the goal is to build an apparatus capable of measuring light fields of such scenes.

Since the light field represents rays with varying positions and angles, it is necessary to measure images at multiple positions relative to the scene. Levoy and Hanrahan accomplish this by using various combinations of moving scene elements and a moving camera.

Data Compression and Decompression

Once a light field has been created, it must be stored. The most intuitive method is to generate a single binary file containing all the data in the 4-D light field array. However, this often results in large memory requirements to store these light fields. Therefore, compression techniques must be considered.

Choice of compression scheme
Light field arrays are large, with the largest example used in the paper being 1.6GB! This means that compression techniques must be considered when creating, transmitting and displaying light fields. Several characteristics of light fields contribute to the selection process of an appropriate compression technique.

  • Data redundancy: Compression techniques aim to reduce redundancy of data. Light fields exhibit redundancy in all 4 dimensions. We can visualize this redundancy in Figure 4., where the individual camera views representing different perspectives share much of the same information about the object.
  • Random access: Compression techniques typically introduce constraints on random access to data. For example, variable-bitrate coders may require a frame to be decoded fully before moving on to the next frame. This is a problem for light fields in the sense that the samples (e.g., individual images in the u-v plane) are dispersed in memory. An objective of the chosen compression technique therefore is that it should support low-cost random access to the individual samples.
  • Asymmetry: In the context of data compression, symmetric and asymmetric refers to the time spent on compression versus the time spent on decompression. A compression technique is considered symmetrical if it takes the same amount of time to compress data as it does to decompress the data. Levoy and Hanrahan make the assumption that light fields are assembled and compressed ahead of time, making this an asymmetric application.
  • Computational expense: Computational cost is always a factor when it comes to selecting engineering processes. Levoy and Hanrahan sought out a compression scheme that can perform decoding with low computational cost and without hardware assistance.

The compression scheme chosen by Levoy and Hanrahan is a two-stage pipeline consisting of fixed-rate vector quantization followed by entropy coding (Lempel-Ziv). This compression pipeline is displayed in Figure 11.

image.png
Figure 11. Two-stage compression pipeline. The light field is partitioned into tiles, which are encoded using vector quantization to form an array of codebook indices. The codebook and the array of indices are further compressed using Lempel-Ziv coding. Typical file sizes are shown beside each stage.

Compression pipeline
The compression pipeline is shown in Figure 11. The components within the pipeline are all well-known techniques in compression, and we will briefly discuss each of them. For further details, readers may refer to online resources. First, some terminology must be clarified. A reproduction vector is called a codeword, and the set of codewords available to encode a source is called the codebook. Indices are used for decoding. The decoding process of data will result in an index. Looking up this index in the codebook will output the codeword corresponding to that index.

Vector quantization: The first stage of the compression pipeline is vector quantization (VQ). Before we talk about VQ, we must first understand sampling.

Sampling converts a continuous-time (band-limited) signal to a discrete-time sequence of sample values. However, since an infinite number of bits is in general required to specify each sample value, sampling alone is insufficient to create a practical digital representation of the source signal. The quantization process is a means to represent the sample values with some (finite) specified precision.

Then what is VQ? Instead of quantizing samples of a waveform individually, the samples can instead be gathered into vectors which are then quantized. The basic idea is as follows. We start with a codebook C of reproduction vectors and define a mapping from the space of all possible input vectors to the C. The quality of a codeword is typically characterized using mean-squared error (MSE). A visual representation of a 2-D vector quantizer is shown in Figure 12.

image.png
Figure 12. A 2-D vector quantizer. A point in R^2 is represented by the nearest reproduction vector, resulting in the bin boundaries shown.

Entropy coding: The second stage of the compression pipeline is a Lempel-Ziv entropy coder. Entropy coders are designed to decrease the cost of representing high probability symbols.

To gain an intuition of entropy coding, I encourage the interested reader to look up Huffman Codes. Here I will just provide a very simple example. Consider a binary prefix code (codes over the binary field). We wish to encode 4 possible messages A, B, C, and D. What are the possible ways to encode this? Intuitively, we can use the following encoding ENC = {A↦00, B↦01, C↦10, D↦11}, where each symbol is coded into 2 bits.
Let’s assume that each symbol is equally likely, e.g., p(A) = p(B) = p(C) = p(D) = 0.25. Then, on average, the length of the code is 2, since L(ENC) = 2xp(A) + 2xp(B) + 2xp(C) + 2xp(D) = 2. However if the probability distribution of A, B, C, and D are not equal, this encoding may not be optimal. For example, let {0.8, 0.1, 0.05, 0.05} be the probability distribution of A, B, C, and D, respectively. The average length when using encoding C is still 2 bits if we use the same encoding ENC = {A↦00, B↦01, C↦10, D↦11}. Turns out we can do much better. A Huffman tree is provided in Figure 12. to illustrate the idea of “decreasing the cost of representing high probability symbols.”

image.png
Figure 12. Huffman coding of example. The symbols with the smallest two probabilities (C and D) are connected together to form a node with a larger probability (0.05+0.05=0.1). The next smallest two probabilities (B and (C, D) intersection) are taken to perform the same process. Finally, A and the sum of (B, C, and D) are connected in the final node to get probability 1.

Under the coding in Figure 12., we have ENC’ = {A↦0, B↦01, C↦011, D↦111}. The average length is now L(ENC’) = 0.8×1 + 0.1×2 + 0.05×3 +0.05×3 = 1.5 bits, which is less than just encoding each symbol with 2 bits.

As the astute reader may notice, this encoding requires prior knowledge of symbol probabilities! So-called universal codes avoid this problem and can achieve source entropy (under certain assumptions about the nature of the source) regardless of source probabilities. Lempel-Ziv codes are such codes. Readers may refer to the topic of source and entropy coding for further understanding.

Putting it all together
The data compression pipeline in Figure 11. utilizes a two-stage procedure. The first stage utilizes VQ to construct a codebook as well as an index. In the second stage, entropy coding is used to compress and combine the two into a final compressed bitstream file. Decompression naturally also happens in two stages. The first stage involves decoding the compressed file. The output of this stage is a codebook and an array of code indices. The second stage involves dequantization of the samples of the light field as the observer (source) moves through a scene. Sample compression results are provided by Levoy and Hanrahan in Table 1.

image.png
Table 1. Compression statistics for two light fields.

Levoy and Hanrahan note that during an interactive viewing, the compressed Buddha image is indistinguishable from the original while the compressed lion exhibits some artifacts, but only at high magnifications. Also, as a general rule, unwanted artifacts due to compression become objectionable only above 200:1.

Conclusion

The goal of this article was to, hopefully, provide the reader with the relevant knowledge required to understand rendering images from light fields.

Levoy and Hanrahan introduced light fields as a means of quickly rendering images of 3-D scenes from novel camera positions. A light field models the light rays permeating a scene, rather than modelling the geometry of that scene, making the process of rendering images from a light field independent of scene complexity. This made the process of rendering images from a light field extremely fast.


Author: Joshua Chou | Editor: H4O & Michael Sarazen

1 comment on “Turing Award | A Deep Dive Into Levoy and Hanrahan’s 1996 Paper on Light Field Rendering

  1. Pingback: [R] Turing Award | A Deep Dive Into Levoy and Hanrahan’s 1996 Paper on Light Field Rendering – tensor.io

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: