# DTN : Image Translation with GAN (3)

Comments# 2. Unsupervised Cross-Domain Image Generation (DTN)

published to **ICLR2017** by Yaniv Taigman, Adam Polyak, Lior Wolf

Learn $ G: S \rightarrow T $ of two related domains, $ S $ and $ T $ without labels! (labels of images are usually expensive)

## Baseline model

$ D $ : discriminator, $ G $ : generator,

$ f $ : context encoder. outputs feature. (128-dim)

\begin{equation}

R_{GAN} = \max_D \mathbb{E}_{x\sim\mathcal{D}_S} \log[1-D(G(x))] + \mathbb{E}_{x\sim\mathcal{D}_T} \log[D(x)]

\end{equation}

\begin{equation}

R_{CONST} = \mathbb{E}_{x\sim\mathcal{D}_S} d(f(x),f(G(x)))

\end{equation}

$f$-constancy : Does $x, G(x)$ have similar context?

$ d $ : distance metric. ex) MSE

$ f $ : "Pretrained" context encoder. *Parameter fixed*.

$f$ can be pretrained with classification task on $S$

Minimize two Risks : $ R_{GAN}$ and $ R_{CONST} $

Experimentally, Baseline model didn't produce desirable results.

Thus, similar but more elaborate architecture proposed

## Proposed "Domain Transfer Network (DTN)"

First, **$ f $ : the context encoder** now encode as $f(x)$ then $g$ will generate from it : $ G = g(f(x)) $

$g$ focuses to generate from given context $f(x)$

Second, for $x \in \mathbf{t}$, $x$ is also encoded by $f$ and applied $g$

"Pretrained $f$ on $S$" would not be good as much as on $T$. But enough for context encoding purpose

$ L_{TID}$ : $G(x)$ should be similar to $x$

Also $D$ takes $G(x)$ and performs ternary (3-class) classification. (one real, two fakes)

## Losses

### Discriminator loss : $L_D$

\begin{equation}

L_D = -\mathbb{E}_{x \in \mathbf{s}} \log D_1 (G(x)) - \mathbb{E}_{x \in \mathbf{t}} \log D_2 (G(x)) - \mathbb{E}_{x \in \mathbf{t}} \log D_3 (x)

\end{equation}

$D_i(x): Probability$

$D_1(x)$ : generated from $S$? / $D_2(x)$ : generated from $T$? / $D_3(x)$ : sample from $T$?

### Generator : Adversarial Loss $L_{GANG}$

\begin{equation}

L_{GANG} = - \mathbb{E}_{x \in \mathbf{s}} \log D_3 (G(x)) - \mathbb{E}_{x \in \mathbf{t}} \log D_3(G(x))

\end{equation}

Fool $D$ to classify as sample from $T$

### Generator : $L_{CONST}$ and Identity preserving $ L_{TID}$

\begin{equation}

L_{CONST} = \sum_{x \in \mathbf{s}} d(f(x),f(g(f(x)))

\end{equation}

, in feature level

\begin{equation}

L_{TID} = \sum_{x \in \mathbf{t}} d_2(x,G(x))

\end{equation}

, in pixel level

$d, d_2$ used as MSE in this work

\begin{equation}

L_{G} = L_{GANG} + \alpha L_{CONST}+ \beta L_{TID} + \gamma L_{TV}

\end{equation}

$L_{TV}$ is for output smoothing.

$L_G$ minimized over $g$

$L_D$ minimized over $D$

## Experiments

- Street View House Numbers (SVHN) $\rightarrow$ MNIST
- Face $\rightarrow$ Emoji

Both cases, $S$ and $T$ domains differ considerably

SVHN $\rightarrow$ MNIST

Pretrain $f$ on $SVHN_{f_TRAIN}$

Learn $G: SVHN_{DTN_TRAIN} \rightarrow MNIST_{TEST}$

Train a MNIST classifier on $MNIST_{TRAIN}$. will be used as evaluation purpose later

Domain transfer on $SVHN_{TEST}$ : $G(SVHN_{TEST})$

### $f$

- 4 convs (each filters 64,128,256,128) / max pooling / ReLU
- input $32 \times 32$ RGB / output 128-dim vector.
- $f$ do not need to be very powerful classifier.
- achieves 4.95% error on SVHN test set
- Weaker in $T$ : 23.92% error on MNIST.
- Learn analogy of unlabeled examples

### $g$

- Inspired by DCGAN
- SVHN-trained $f$'s 128D representation $\rightarrow 32\times32$
- four blocks of deconv, BN, ReLU. TanH at final.
- $$ L_{G} = L_{GANG} + \alpha L_{CONST}+ \beta L_{TID} + \gamma L_{TV} $$

$\alpha=\beta=15, \gamma=0$

### Evaluate DTN

Train classifier on $MNIST_{TRAIN}$.

Architecture same as $f$

MNIST performance 99.4% test set.

Evaluate by testing MNIST classifier on $ G(\mathbf{s}_{TEST}) = { G(x)|x \in \mathbf{s}_{TEST} } $ using $Y$ : $\mathbf{s}_{TEST}$ label.

## Experiments: Unseen Digits

Study the ability of DTN to overcome omission of a class in samples.

For example, class '3' Ablation applied on

- training DTN, domain $S$
- training DTN, domain $T$
- training $f$.

But '3' exists in testing DTN! Compare results.

(a) The input images. (b) Results of our DTN. (c) 3 was not in SVNH. (d) 3 was not in MNIST. (e) 3 was not shown in both SVHN and MNIST. (f) The digit 3 was not shown in SVHN, MNIST and during the training of f.

## Domain Adaptation

$S$ labeled, $T$ unlabeled, want to train classifier of $T$

Train k-NN classifier

### Face $\rightarrow$Emoji

- face from Facescrub/CelebA
- emoji gained from bitmoji.com, not publicized

preprocess on emoji with heuristics. Align face. - $f$ from DeepFace pretrained network. (Taigman et al. 2014) the author's previous work
- $f(x)$ is 256-dim
- $g$ outputs $64 \times 64$
- SR (Dong et al. 2015) to upscale final output.

### Results

choose $\alpha=100, \beta=1, \gamma=0.05$ via validation

Original style transfer can't solve it

DTN also can style transfer.

DTN is more general than Styler Transfer method.

## Limitations

- $f$ usually can be trained in one domain, thus asymmetric.
- Handle two domains differently.
- $T \rightarrow S$ is bad.

- Bounded by $f$. Needs pre-trained context encoder.
- any better way to learn context without pretraining?
- Any more $S \rightarrow T$ tasks?

## Conclusion

- Demonstrate Domain Transfer, as an unsupervised method.
- Can be generalized to various $S \rightarrow T$ problems.
- $f$-constancy to maintain context of domain $S$ & $T$
- Simple domain adaptation and good performance
- inspiring work to future domain adaptation research

More open reviews at OpenReview.net