IL-NeRF: Incremental Learning for Neural Radiance Fields with Camera Pose Alignment

Motivation

Existing incremental learning methods assume that the camera pose parameters are estimated in advance based on the complete dataset, which poses a paradox as the setting of incremental learning is that the data arrives sequentially. Our IL-NeRF addresses a more practical scenario where pre-estimated camera poses are unavailable for each training data chunk.

Challenge

Since the previous training data have been discarded, the incoming training data cannot simply be used directly for camera pose estimation because the isolated estimated camera pose will not be in the same coordinate system as the previous camera pose, which will lead to NeRF training misalignment and failure to render the 3D scene. Therefore, accurately estimating the camera poses of the sequential coming data within the same coordinate system in incremental NeRF training becomes a crucial issue that needs to be addressed.

Framework

Firstly, the network \( \Theta_{t-1}^* \) from the previous NeRF are frozen. Then, incremental camera pose alignment is employed to estimate the current camera poses \( \mathcal{P}^c \) through (a) Finding optimal camera poses from the previous camera poses; (b) Estimating the camera poses for the incoming image data and the rendered images from the selected camera poses; (c) Aligning the current camera poses into the previous camera coordinate system. Finally, the network \( \Theta_{t} \), the current estimated poses \( \mathcal{P}^c \), and previous poses \( \mathcal{P}^p \) are jointly trained on both the current image data rays \( C^c \) and the distilled past rays \( C^p \) simultaneously.

Results

Qualitative Comparison

The original NeRF demonstrates severe catastrophic forgetting, leading to the loss of early-task scene information.
In contrast, IL-NeRF is able to preserve the scene of interest throughout the training process.

    (a) Kitchen and Garden scenes in the Mip-NeRF360 dataset.

    (b) Fortress and Horns scenes in the LLFF dataset.

    (c) Pinecone and Vasedeskin scenes in the NeRF-real360 dataset.

Quantitative Comparsion

    Performance comparison with the baselines on PSNR, SSIM, and LPIPS. IL-NeRF outperforms the original NeRF, EWC, NeRF- SLAM and achieves comparable results with CLNeRF. Note that CLNeRF, NeRF, and EWC require the ground truth pre-estimated camera poses from entire image data, but IL-NeRF estimates and aligns camera poses by the proposed incremental camera pose alignment module.