🎬 Neural Radiance Field

CS 280A Project - UC Berkeley

Part 0: Camera Calibration & 3D Scanning

This section documents the camera calibration and 3D object scanning pipeline using ArUco markers for robust pose estimation.

Camera Calibration Results

Calibration Images:
80+ images with varying angles and distances
ArUco Tags:
4×4 dictionary (DICT_4X4_50)
Calibration Method:
cv2.calibrateCamera() with multiple views

3D Object Scan

Captured 80+ images of the target object (Pou) with consistent zoom and lighting.

Camera Frustum Visualization 1
Camera Frustums - View 1
Camera Frustum Visualization 2
Camera Frustums - View 2

Pose Estimation

Used PnP (Perspective-n-Point) to solve for camera poses from detected ArUco markers.

Key Implementation Details

✓ Robust detection with frame skipping for failed detections

✓ Camera-to-world matrix inversion: c2w = inv(w2c)

✓ Undistortion using cv2.undistort() with optimal camera matrix

✓ Principal point adjustment for cropped regions

Part 1: Neural Field for 2D Images

Training a neural field to fit 2D images using sinusoidal positional encoding and MLPs.

Model Architecture

Input Dimension:
2D pixel coordinates (normalized to [0, 1])
Positional Encoding:
L_freq = 10, Output: 42D vector (2 + 2×10×2)
Hidden Layers:
4 layers with varying widths (128, 256, 256, 128)
Activation:
ReLU, final layer: Sigmoid for [0, 1] output
Optimizer:
Adam with lr = 1e-2
Training Iterations:
3000 iterations, batch size = 10,000 pixels

Training Progression - Fox

Iteration 0
Iter 0
Iteration 300
Iter 300
Iteration 600
Iter 600
Iteration 999
Iter 999 (Final)

Training Progression - My Cat!

Results on cat

Cat - Iteration 0
Iter 0
Cat - Iteration 999
Iter 999 (Final)

Hyperparameter Study: 2×2 Grid

L=4, Width=128
L=4, Width=128
Limited frequency, small model
L=4, Width=256
L=4, Width=256
Limited frequency, larger model
L=10, Width=128
L=10, Width=128
High frequency, small model
L=10, Width=256
L=10, Width=256
High frequency, large model (Best)

PSNR Training Curve

2D Training PSNR Curve
Peak PSNR: ~33 dB at convergence

Part 2: Neural Radiance Field

Training a full 3D NeRF on the given multi-view lego images and pou (my own pictures).

Implementation Details

2.1 - Ray Generation

Pixel to Ray:
Convert 2D image coordinates to 3D rays in world space
Formula:
ray_origin = c2w[:3, 3], ray_direction from pixel coordinates through intrinsic matrix K
Offset:
Added 0.5 to pixel coordinates for pixel center sampling

2.2 - Point Sampling

Uniform Sampling:
t = np.linspace(near=2.0, far=6.0, n_samples=64)
Perturbation:
Added random noise during training to prevent overfitting
3D Points:
points = ray_origin + ray_direction * t

2.3 - Data Loading

Ray Sampling:
10,240 rays per batch from ~100 training images
Multi-image Sampling:
Global random sampling across all images
Validation:
10 validation images for PSNR monitoring

2.4 - NeRF Network Architecture

Position Encoding:
L_pos = 10 (63D vector)
Direction Encoding:
L_dir = 4 (27D vector)
Hidden Dimension:
256 neurons per layer
Output:
Density (σ) + RGB Color (c)

Network Architecture Diagram

NeRF Network Architecture
NeRF MLP architecture with positional encoding and dense connections

2.5 - Volume Rendering

Rendering Equation:
C(r) = Σ T_i * α_i * c_i
Transmittance (T_i):
Probability ray survives to sample i
Alpha (α_i):
α_i = 1 - exp(-σ_i * Δt)
Color (c_i):
RGB color predicted by network

Visualization: Rays & Samples

Rays and Samples Visualization
Rays from training cameras with sample points (black dots)

Training Progression

Lego Iteration 0
Iter 0 (Random)
Lego Iteration 500
Iter 500
Lego Iteration 1000
Iter 1000
Lego Iteration 4999
Iter 4999 (Final)

Validation PSNR Curve

Lego PSNR Curve
Peak PSNR: ~27.5 dB after 5000 iterations

Novel View Synthesis - Spherical Video

Novel views rendered from unseen camera poses on a circular trajectory:

Lego Novel Views GIF
Rendered from 60 unseen test camera poses
NeRF Quality Metrics
27.5 dB PSNR

Successfully exceeded 23 dB target for full credit

Part 2.6: Training with Your Own Data

Custom NeRF training on a "Pou" using the dataset created in Part 0.

Dataset Information

Object:
Pou
Training Images:
~80 images (undistorted)
Image Resolution:
1280×1707 (resized to 320×426 for training)
Near/Far Planes:
near=0.35, far=1.55 (adjusted for smaller object)

Hyperparameter Changes

Parameter Lego (Reference) Pou (Our Object) Reason
near / far 2.0 / 6.0 0.35 / 1.55 Different camera positions
n_samples 64 64 Same for quality
Image Resolution 200×200 320×426 Better detail capture
Learning Rate 5e-4 5e-4 Standard NeRF settings
Training Iterations 5000 5000 Same for detail learning

Key Implementation Changes

1. Adaptive Near/Far Planes

Computed optimal near/far from actual camera pose distribution rather than hardcoding

2. Increased Image Resolution

Trained at higher resolution (320×426) for better detail preservation

3. Better Undistortion

Applied optimal camera matrix with ROI cropping to handle lens distortion

4. Training Loss Tracking

Added loss history to checkpoint files for debugging and visualization

Training Loss Over Iterations

Pou Training Loss Curve
Converges after ~4000 iterations; MSE Loss: 0.05 → 0.005

Intermediate Renders During Training

Iteration 0
Iter 0 (Random)
Iteration 400
Iter 400
Iteration 600
Iter 600
Iteration 4999
Iter 4999 (Final)

Novel View GIF - Camera Circling Object

Pou Novel Views GIF
60 frames of unseen camera poses circling the object

Training Discussion

The custom object training presented unique challenges compared to the lego scene:

  • Scale Adaptation: The dramatic difference in object size required retuning near/far planes by ~4×
  • Lighting Variability: Small variations in capture lighting resulted in artifacts
  • Background Noise: Initial training failed with cluttered backgrounds. Using a plain background was essential
  • Camera Calibration: Precise undistortion was critical—errors propagated through the entire pipeline

Bells & Whistles: Depth Map Rendering

Rendered depth maps for the Lego scene by compositing per-point depths instead of colors in the volume rendering equation.

Depth Rendering Implementation

Modified Volume Rendering:

Instead of: C = Σ T_i * α_i * c_i

We compute: D = Σ T_i * α_i * t_i

Where t_i is the sample distance along the ray.

Depth Video Results

Visualized in grayscale:

Lego Depth Map GIF
Depth map video showing geometric structure of Lego scene

Depth Map Details

Depth Range:
2.0 to 6.0 units
Colormap:
Grayscale
Frames:
60 test camera poses

Project Summary

🎯 Objectives Achieved

✅ 2D Image Fitting

Successfully trained neural fields on 2D images with sinusoidal positional encoding, achieving 30+ dB PSNR.

✅ 3D NeRF Training

Implemented full multi-view NeRF pipeline on Lego dataset, reaching 27.5 dB PSNR target with novel view synthesis.

✅ Custom Object Capture

Created complete camera calibration pipeline and trained NeRF on personal object dataset with novel view rendering.

📊 Technical Skills

• PyTorch neural network implementation

• Camera calibration and pose estimation

• 3D coordinate transformations

• Volume rendering equations

• GPU optimization and batching

🎬 Results

• 2D PSNR: ~33 dB

• Lego PSNR: 27.5 dB

• Novel view quality: Excellent

• Depth rendering: ✓ Complete

• Training time: ~40 minutes for 5000 iters