Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild (CVPR 2020 Best Paper)

Introduction

symmetric clue를 이용해서 single raw image 인풋으로 (supervision 없이) deformable 3D object recon 하는 논문.

우선은 input image를 depth / albedo / viewpoint / illumination 으로 decompose 하는 autoencoder를 기반으로 하는 end-to-end learning 구조.

"대부분의 객체들은 이론적으로라도 symmetry 구조를 가진다"를 전제로 풀어냄

illumination을 reasoning cue로 사용, appearance 가 shading 등에 의해 완벽한 symmetry 가 아닌 경우에 대해서도 underlying symmetry를 찾을 수 있도록 했다는데 정확히 어떻게 이걸 가능하게 했다는 건지는 찾아봐야 알 듯.

Deformable 3D Objects?

기존의 (혹은 고전적인) dnn 은 image를 2d texture로써 인식했다.

3D modeling으로 image를 인시갈 경우, 훨씬 더 많은 natural image를 표현할 수 있을 뿐 아니라, image understanding에 요구되는 high-level 정보들을 훨씬 더 많이 제공할 수 있다.

Experimental Conditions

NO GT. viewpoint, keypoint, segmentation, depth map 등 어떠한 2d 혹은 3d 정보에 대해서도 ground truth 없는 상태로. (without external supervision) - annotation obstacle를 없애기 위해

Unconstraince collection off single-view input image.

>> 어떤 supervision도 없이 unconstrained single image로 3D object model 을 estimation 한다?? >> ill-posed. 상황을 개선시킬 수 있는 minimum assumption → 'symmetry'

Symmetry가 뭘 줄 수 있는지?

>> 특정 객체가 완벽한 symmetry 라고 가정하면, single-view를 단순히 mirroring해서 virtual second view를 얻어낼 수 있다. 그리고 이 mirrored pair 간 correspondence를 알 수 있다면, stereo vision을 이용해 3d recon을 할 수 있다.

>> 이런 특성을 활용해, symmtery 를 일종의 geometric cue로 decomposition에 사용하기로.

문제는 실제 obejct는 albedo, pose, shape. illumination 등 많은 요인에 의해 symmetry가 성립되지 않는다는 것. 이를 해결하기 위해

explicitly model illumination to exploit the underlying symmetry - illumination을 additional cue로 사용

model 이 potential lack of symmetry 를 배우도록 함. 결과적으로 모델은 object의 어떤 pixel에 대해 ( 다른 factor 들과 더불어) image 안에 symmetric counterpart가 존재할 확률을 표현한 dense map을 학습하게 된다.

Related Work

>>우선, unsupvervised manner로 3D model prediction 하는 related works

Learning single-imag 3D reoncsturction by generative modelling of shape, pose and shading (IJCV 2019)

Lifting autoencoders: Unsupervised learning of a fully-disentangled 3D morphable model using deep nono-rigid structure from motion (ICCVW 2019)

Unsupervised generative 3D shape learning from natural images (arXiv 2019)

Methods

Photo-geometric autoencding

우선 Image I를 grid 상에 정의된 function Omega로 정의하는 것으로 시작

I = $\Omega$ = x

이 때, image는 특정 대상 object가 centered 된 이미지라고 가정한다.

하고자 하는 일은 결국, I를 받아 세 가지 factor - depth map, albedo image, global light direction, viewpoint( $w ~ R^6$ )로 mapping 하는 function $\Phi$ 를 구하는 것.

Simple Inver Rendering Formaiton. image I는 구해진 네 가지 factor와 lighting $Λ$ , reprojection function $\Pi$ 에 의해 다음과 같이 recon된다.

** global lighting direction / lighting 을 별도로 정의했는데 정확히 어떻게 다른 건지는 아직 설명이 없음. 지켜봐야.

이 때, 'canonical viewed image'로 recon 한다는 점에 주목하자.

viewing direction w는 actual image view direction과 canonical view 사이의 transformation matrix로 정의 된다.

위 식에서 결과적으로 lighting operation $\Lambda$ 가 generation 하는 object는 canonical view에서 관찰되었을 때의 albedo, depth, lighting direction (w =0 일 때)의 object 이고

reprojection operation $\Pi$ 는 canonical image (위에서 만들어낸)을 바탕으로 적절한 viewpoint change 를 반영해 원본 이미지를 formation 한다. 이 때, 저 I_hat과 I가 같도록 유도하는 방향으로 loss는 설정됨.

Discussion?

The effect of lighting could be incorporated in the albedo a by interpreting the latter as a texture rather than as the object's albedo

lighting + albedo → texture term 으로 묶지 않고 분리한 이유?

>> 몇 가지 benefit이 있다고 말하고 있다.

albedo 가 symmetric 한 경우에도 illumination에 의한 appearance change가 asymmetric 처럼 보이게 할 수도 있다. 결과적으로 symmtety constraint 를 얻어내는 데 훨씬 효과적임

illumination 과 albedo를 구분해 생각함으로써 얻어지는 shading 정보가 결과적으로 underlying 3D shape를 얻는데 주요한 정보를 제공한다. ( 이 페이퍼의 경우, shading ↔ 3D shape 간 mutualy constraining 하는 관계)

Probably Symmetric Objects

>> 이 부분이 젤 중요한 데 제대로 이해하지 못한 느낌.

Assumption

(to identify symmetric object points implicitly)

canonical frame 상에 정의된 depth, albedo 가 fixed vertical plane을 기준으로 symmetry 라고 가정.

>> 이게 결과적으로 model 이 object의 canonical view를 찾도록 도와주는 side effect가 있다는 데 왜 그런건지 모루게씀.

symmetry를 기준으로 map을 flip 하는 operation a 를 정의. (along the horizontal axis)

결과적으로 $d ~= flip d', a ~=flip a'$ 이 되도록 해야 한당.

이 두 constraint를 각 parameter에 대한 loss를 따로 정의해 balancing 할 수도 있겠지만 잘 converge되지 ㅇ낳을 것이므로, second reconstruntion (flipped reoncstruction) I_hat ' 을 정의해 flipped epth, albeo constraint를 balancing 할 것.

이제 최종적으로 두 가지 reconstruction loss를 얻었다. 첫 번째는 orig img = recon img loss, 두 번째는 orig img = flipped recon img loss.

이렇게 loss를 정의하게 되면 두 loss가 서로 닮은 형태이기 때문에 balancing에 용이할 뿐만 아니라 논문에서 의도하는 symmetry probabilistic modelling이 가능하다.

Symmetry Probabilistic Model

Source Image / Recon Image 간 loss는 다음과 같이 정의할 수 있다.

$l_$ 는 pixel location uv에서의 recon / orig 이미지 간 L1 distance loss. $\sigma$ 는 network에 의해 같이 추정되는 confidence map = aleatoric uncertainty.

** uncertainty 개념을 제대로 이해하기 위해서는...

우선 Bayesian Deep Learning을 알아야.

논문에서 refer한 부분은

What uncertainties do we need in bayesian deep learniong for computer vision?_NIPS 2017

필요한 부분까지는 이해한 듯?

>> aleatoric uncertainty란 결국 model 의 underfitting이 아닌 데이터 자체가 가지고 있는 결함, noise에 의한 uncertainty를 지칭한다. 왜 aleatoric uncertainty 냐 하면, 여기서 aleatoric uncertainty 라고 지칭한 confidence map은 symmetric recon / orig recon 사이의 loss를 계산하는 데 사용되기 떄문이다. 즉 다시 말해 이 confidence map이 model 이 줘진 inpit 중 어느 부분이 혹은 어느 정도만큼이 asymmetric 한지를 학습하도록 해준다. (즉, symmetric assumption 이 성립하지 않는 곳을 찾아낸다). 어떤 instance / object 가 asymmetric 한지는 전적으로 모델이 학습하는 것.

The loss can be interpreted as the negative log-likelihood of a factorized Laplacian distriubution on the reconstruction

>> factorized laplacian distribution? → 각 reconstructdion의 laplacian distribution 의 marginal distribution이 independent함을 의미. 다시 말해 각 laplacian distribution의 covariance matrix가 diagonal이다.

Laplain Distribution / Gaussian Distribution 짚고 넘어가기

Laplacian Distribution의 PDF

결과적으로, 최종 learning objective는 아래와 같이 두 reconstruction loss의 합으로 정의 가능한다.

맨 왼쪽의 $\Phi(I) =$ . 왼쪽 loss term 은 recon 이미지와 original loss 간의 비교 loss. 오른쪽 term 은 weight factor x symmetric recon image . orginal image 간 loss.

Image Formation Model

lighting and reprojection funcion .

reference frame 에서의 3D poin P를 image pixel p = (u, v, 1)f로 mapping하는 projection.

perspective camera view 가정. object-camera 간 거리를 대략 1m로 가정하고, 이미지가 object centered 로 cropped되었다고 가정하므로 perspective caera fov는 좁게 설정하였다 (10도 정도)

depth map parameter d 는 canonical view에서 얻어졌다고 가정하는 image pixel (u,v)에 대해 depth value를 가지고 있음. 다시 말해 3d point 는 $P = d_.K^{-1}p$ (uv plane 상의 point p를 reprojection 시킨 뒤 depth value 를 준다) 와 같이 정의될 수 있음.

viewpoint w는 rotation, translation matrix의 value 6개로 이뤄진 vvector. canonical view를 actual view로 transformation 시키는 matrix ( $(u,v) >> (u',v')$ )

Uploaded by Notion2Tistory v1.1.0

'Computer Vision' 카테고리의 다른 글

Unsupervised Learning of Dense Visual Representations (NIPS 2020) (0)	2021.04.22
Multi-view Relighting using a Geometry-Aware Network (0)	2021.04.22
Unsupervised Learning for Intrinsic Image Decomposition from a Single Image (CVPR 2020) (0)	2021.04.22
Install packages to Anaconda Environment directly from git source (Windows) (0)	2020.05.07
Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks (ECCV2016) (0)	2019.12.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Vision4Graphics & Graphics4Vision

Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild (CVPR 2020 Best Paper)

Introduction

Related Work