Blind-Source Separation using Generative Adversarial Network

Introduction

The goal of this experiment is to see if blind-source separation can be solvable in an unsupervised fashion with an aid from pre-trained GAN.

The application of this model, if correctly implemented, can be extended to image separation and audio processing. For image processing: Given a mixture of two images, can we recover the original two images? For audio processing: Given the input of duet of piano and guitar, can we construct a model that separates piano from guitar in an unsupervised manner?

Generative Adversarial Network

GAN is a generative model that can be used to perform a very sophisticated high dimensional density estimation using a learning framework which mimics a Two-Player adversarial game.

Density estimation can be done by $G$, which represents a generative model, usually constructed by neural networks. Specifically, we consider $G: z \rightarrow x$ where $z$ is drawn from some known distribution (usually uniform distribution) and $x$ is a point in a high dimensional space. $x$ can be viewed as a realization of a random variable $X$ that is drawn from a learned probability distribution (modeled by $G$) over the high dimensional space. As an example, suppose we want to model the underlying probability distribution of images. We can consider the domain of $X$ to be the entire image space, and choose $G$ to be a sophisticated CNN that can upsample a simple uniformly random vector $z$ (usually lives in some low dimension) to an image $x$ (which usually lives in a very high dimensional space). By using the adversarial-game type learning framework, we can learn a model that represents the probability distribution over the entire images.

Now the question is how exactly we construct such a function that can model complex probability distribution. The phrase “two player adversarial game” sort of tells us that we should consider another model other than $G$, and somehow creates the adversarial game between those two models. Specifically, we introduce another model $D$, which represents a discriminative model s.t. $D: x \rightarrow [0,1]$, where $x$ is a point in the same high-dimensional space as the range of $G$. The adversarial game can be realized by assigning to the output of $D$ a probability of $x$ coming from a true distribution. So if the output of $D$ is low, it means that $D$ thinks that $x$ comes from a “fake” distribution modeled by $G$. $G$ tries to “fool” $D$ by generating a sample that looks like one from the true distribution, whereas $D$ attempts to detect “fake” samples generated by $G$. We iteratively update $G$ and $D$ until we are satisfied with the quality of the results from $G$.

How can we design an objective function that realizes the above adversarial-game type learning framework? The simplest way is to use min-max type algorithm. Recall that we want $D(x)$ to produce high values when $x \sim p_{true}$ and low values when $x \sim p_{fake}$. This can be realized by $\max_{D} E_{x \sim p_{data}} [ \log D(x) ] + E_{z \sim p_{predefined}} [ \log( 1 - D(G(z))) ]$. We also want $G(z)$ to make $D(G(z))$ as high as possible, which can be realized by applying $min_{G}$ operation to the above terms. This amounts to the following objective funtion:

$J(\theta_G, \theta_D) = min_{G} max_{D} E_{x \sim p_{data}} [ log D(x) ] + E_{z \sim p_{predefined}} [ log 1 - D(G(z)) ]$

Blind-Source separation

The central question of BSS is this: Given an observation that is a mix of a number of different sources, can we recover both the underlying mechanism of such mixing and the sources, having access to the observation only?

In general, the answer is “no”, because the problem is too difficult to solve. But if we add some conditions on the properties of sources, then we could recover both when the mixing mechanism can be considered as a linear system.

Here, we consider a non-linear version of the BSS problem. That is, given an observation $x$, we assume $x$ is generated from an unknown non-linear function $F$ with an unknown input $z$.

To make it suitable for our image experiment, suppose that we mix two different images, say, MNIST data and some synthesized data. We assume that these two images live in two different data manifolds. The sources $z$ here are precisely these two manifolds, and the non-linear function $F$ is the generating process of the mixture image $x$ from $z$s. The question is, when we observe only $x$, which is a mixture of two images, can we recover $F$, $z_1$ and $z_2$, (and thus recover two original images) by modeling a part of $F$ as GAN? (We don’t have access to the underlying manifold, so in our experiment the recovery of $z_1$ and $z_2$ will be done implicitly.)

Preliminary Model

Essentially, we attempt to solve the problem by using autoencoder, where the decoder part consists of two pre-trained GANs + an additional decoder that converts the output of two GANs into one image that is an estimate of the original input, which is the mixture of the two images.

  1. Pre-training: Let G’(z) and G(z) are the two generative models learned by GAN’s framework using MNIST and synthesized data, respectively.
  2. Build an autoencoder using G’(z) and G(z) as a part of decoder.

We can model $G’(z)$ and $G(z)$ by DCGAN. For implementation, I should be careful how to make a good alignment between the two-dimensional $z$ with the encoder. (Two-dimensional because we have two images.)

Experiments

(To be followed.)