Chinese Research Proposes Novel Context Aware Vision Transformer (CA-ViT) for Ghost-Free High Dynamic Range Imaging


By merging lots of low dynamic range (LDR) photography with different exposures, multi-frame high dynamic range (HDR) imaging attempts to provide images with wider dynamic range and more realistic characteristics. However, in reality, camera movements and dynamic foreground objects frequently contradict this ideal scenario, leading to negative ghosting distortions in the reconstructed HDR results. Several techniques known as HDR ghosting removal algorithms have been suggested to generate high-quality, ghost-free HDR photos. Traditionally, many approaches include correcting input LDR images or excluding misaligned pixels before image blending to reduce ghosting effects.

However, exact alignment is complex and the overall impact of HDR is diminished when relevant information is lost due to poor pixel rejection. Accordingly, CNN-based learning algorithms that explore in-depth features in a data-driven manner have been developed to combat the ghosting phenomenon. Current CNN-based ghosting removal techniques can be broadly divided into two groups. In the first category, homography or optical flow is used to pre-align LDR images, and a CNN is then used to perform multi-image fusion and HDR reconstruction. However, optical flow is inconsistent in the presence of occlusions and saturations, and homography is unable to align moving objects in the foreground. In order to manage ghosting artifacts and achieve peak performance, the second category suggests end-to-end networks with implicit alignment modules or unique learning algorithms.

However, restrictions become apparent in the presence of distant object motions and significant intensity changes. Convolution’s built-in location constraint explains the situation. CNN is not suitable for reliance on long-range modeling (such as ghosting effects caused by large motion) because it requires stacking deep layers to generate a wide receptive field. Also, since the same kernels are used throughout the image, the convolutions ignore long-range intensity fluctuations of different areas of the image. Therefore, performance improvement is required by studying content-dependent algorithms with long-range modeling capability.

Due to its better long-range modeling capabilities, research interest in Vision Transformer (ViT) has recently increased. However, experimental results highlight two significant issues that preclude its use in HDR ghosting removal. Generalization does not occur when trained on insufficient data, even though the datasets available for HDR ghosting removal are limited due to the extravagant cost of collecting large numbers of realistically labeled samples. . On the other hand, transformers lack the inductive biases inherent in CNN.

On the contrary, the neighboring pixel associations of both intra-frame and inter-frame are crucial to recover local features over many images. However, the pure transformer fails to obtain such a local context. To do so, they propose an all-new Context-Aware Vision Transformer (CAViT), which is designed with a two-pronged architecture to simultaneously capture global and local dependencies.

They use a window-based multi-head Transformer encoder for the global branch to capture remote contexts. They create a local context extractor (LCE) for the local branch that extracts local feature maps through a convolutional block and chooses the most advantageous features across multiple frames using a channel attention method. Thus, the proposed CA-ViT allows the interaction of local and global parameters. They showcase a new Transformer-based architecture (dubbed HDR-Transformer) for ghost-free HDR photography by integrating with CA-ViT. In particular, a feature extraction network and an HDR reconstruction network constitute the core of the proposed HDR-Transformer. Using a spatial attention module, the feature extraction network extracts shallow features and coarsely merges them.

The suggested CA-ViT is the fundamental building block for the hierarchically constructed HDR reconstruction network. In order to reconstruct high-quality, ghost-free HDR photos, CA-ViTs describe long-range ghosting artifacts and local pixel interaction. This eliminates the need for stacking intense convolution blocks.


The main contributions of this study can be summarized as follows:

  • They propose a new vision transformer called CA-ViT that can fully utilize global and local image context dependencies while vastly outperforming its predecessors.
  • They introduce a unique HDR transformer that can reduce processing costs, ghosting artifacts, and recreate high-quality HDR photos. It is the first Transformer-based HDR ghosting removal framework to be developed.
  • They are undertaking extensive testing on three example HDR benchmark datasets to compare HDR performance processors to current state-of-the-art techniques.

The official code implementation of this document is available on Github.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Ghost-free High Dynamic Range Imaging with Context-aware Transformer'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Consultant intern in content writing at Marktechpost.


About Author

Comments are closed.