Generative Adversarial Networks (Original Formulation)

“Generative Adversarial Networks” (Goodfellow et al.) Paper overview.


Author

Affiliation

Andrea Bonvini

 

Published

Dec. 13, 2020

Citation

Bonvini, 2020


Let’s start with a question: what is Generative Modeling?

Generative modeling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.

GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated).

Formulation

Both D and G are conveniently chosen as MLPs. The generative process depends on two networks:

θg and θd are the network parameters, xRn is an input image (either real or generated by G) and zRd is some random noise to be fed to the generator. We suppose that x is sampled from a distribution pdata (i.e. xpdata(x)) and z is sampled from a distribution pz (i.e. zpz(z)). Our Discriminator’s output is to be seen as the probability that the input image comes from the data and not from the generator: D(,θd):Rn[0,1]

The generator gives as output a generated image: G(,θd):RdRn

A good discriminator is such that:

Training D consists in maximizing the binary cross-entropy:

maxθd( Expdata(x)[logD(x,θd)]+Ezpz(z)log[1D(G(z,θg),θd)] )

Where

A good generator G is one that makes D fail:

minθg(maxθd( Expdata(x)[logD(x,θd)]+Ezpz(z)log[1D(G(z,θg),θd)] ))

Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, as long as G changes slowly enough.

Let’s schematize it:

We need to solve by an iterative numerical approach the min max game shown at (4). In order to do so we alternate:

There is a reason why Goodfellow proposed to optimize log(D()) instead of log(1D()). If we try to descend the gradient of log(1D(x)), we notice that at the beginning of the training process, when the generated samples would be easily classified as “fake” (i.e. D(x)0), there would be too few gradient in order to learn properly!

We have the following value function for our min-max problem:

V(G,D)=Expdata(x)log(D(x))x+Ezpz(z)log(1D(G(z))dz

=xpdata(x)log(D(x))dx+zpz(z)log(1D(G(z))dz

=x(pdata(x)log(D(x))+pg(x)log(1D(x)))dx

This last equality comes from the Radon-Nikodym Theorem of measure theory and it’s sometimes referred as the Law Of The Unconscious Statistician (or LOTUS Theorem) since students have been accused of using the identity without realizing that it must be treated as the result of a rigorously proved theorem, not merely a definition (if you want the full proof check this out! )

V(G,D)=x(pdata(x)log(D(x))+pg(x)log(1D(x)))dx

Let’s first consider the optimal discriminator D for any given generator G. The training criterion for the discriminator D, given any generator G, is to maximize the quantity defined below:

argmaxD(V(G,D))=argmaxD(x(pdata(x)log(D(x))+pg(x)log(1D(x)))dx)

For the individual sample x we derive V(G,D) w.r.t. D(x) and we equal this quantity to 0 in order to find the optimal discriminator D(x)

ddD(x)(pdata(x)log(D(x))+pg(x)log(1D(x)))=0

pdata(x)D(x)pg(x)1D(x)=0

pdata(x)(1D(x))pg(x)D(x)D(x)(1D(x))=0

pdata(x)pdata(x)D(x)pg(x)D(x)D(x)(1D(x))=0

pdata(x)pdata(x)D(x)pg(x)D(x)=0

pdata(x)D(x)(pdata(x)+pg(x))=0

D(x)=pdata(x)pdata(x)+pg(x)

Does this point represent a maximum? we have to check if the second derivative calculated in D is negative.

ddD(x)(pdata(x)log(D(x))+pg(x)log(1D(x)))=pdata(x)D(x)pg(x)1D(x)

d2d2D(x)(pdata(x)log(D(x))+pg(x)log(1D(x)))=ddD(x)(pdata(x)D(x)pg(x)1D(x))

ddD(x)(pdata(x)D(x)pg(x)1D(x))=pdata(x)D2(x)pg(x)(1D(x))2<0

The quantity above is negative for every D, D included, since pdata(x) and pg(x) are between 0 and 1.

We then can plug D into V(G,D) and find the optimal generator G as:

x(pdata(x)log(D(x))+pg(x)log(1D(x)))dx

G=argminG(x(pdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(1pdata(x)pdata(x)+pg(x)))dx)

G=argminG(x(pdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(pg(x)pdata(x)+pg(x)))dx)

G=argminG(x(pdata(x)(log2log2)+pdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)(log2log2)+pg(x)log(pg(x)pdata(x)+pg(x)))dx)

G=argminG(log2x(pg(x)+pdata(x))dx+xpdata(x)(log2+log(pdata(x)pdata(x)+pg(x)))dx+xpg(x)(log2+log(pg(x)pdata(x)+pg(x)))dx)

G=argminG(log2(2)+xpdata(x)log(2pdata(x)pdata(x)+pg(x))dx+xpg(x)log(2pg(x)pdata(x)+pg(x))dx)

G=argminG(log22+xpdata(x)log(pdata(x)pdata(x)+pg(x)2)dx+xpg(x)log(pg(x)pdata(x)+pg(x)2)dx)

G=argminG(log4+KL(pdata||pg+pdata2)+KL(pg||pg+pdata2))

G=argminG(log4+2JSD(pdata||pg))

Where the Kullback-Leibler divergence (KL) and the Jenson-Shannon divergence (JSD) are quantities that measure the difference between two distributions and we know that JSD(pdata||pg)=0 only when pdata=pg !

V(DG,G)=log4

Theorem 1:

The global minimum of the virtual training criterion V(DG,G)

is achieved if and only if pg=pdata. At that point, V(DG,G) achieves the value log4.

Besides, that was what we expected! We wanted our generator to learn the same distribution which generated the data. If we know that pdata=pg then it’s trivial to observe that at the end of the training process the optimal discriminator will be forced to output 0.5 since it won’t be able to distinguish between real and fake samples anymore.

D(x)=pdata(x)pdata(x)+pg(x)=12

But does this converge?

Well, as stated in the original Paper:

If G and D have enough capacity, and at each step of our algorithm, the discriminator is allowed to reach its optimum given G and pg is updated so as to improve the criterion Expdata[logDG(x)]+Expg[log(1DG(x))]

then pg converges to pdata.

Proof:

Consider V(G,D)=U(pg,D) as a function of pg as done in the above criterion. Note that U(pg,D) is convex in pg.

The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained. In other words, if f(x)=supαAfα(x) and fα(x) is convex in x for every α , then fβ(x)f if β=argsupαAfα(x). This is equivalent to computing a gradient descent update for pg at the optimal D given the corresponding G. supDU(pg,D) is convex in pg with a unique global optima as proven in Theorem 1, therefore with sufficiently small updates of pg, pg converges to px, concluding the proof.

In practice, adversarial nets represent a limited family of pg distributions via the function G(z;θg), and we optimise θg rather than pg itself. Using a multilayer perceptron to define G introduces multiple critical points in parameter space. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.

Footnotes

    Citation

    For attribution, please cite this work as

    Bonvini (2020, Dec. 14). Last Week's Potatoes: Generative Adversarial Networks (Original Formulation). Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-generative-adversarial-networks-original-formulation/

    BibTeX citation

    @misc{bonvini2020generative,
      author = {Bonvini, Andrea},
      title = {Last Week's Potatoes: Generative Adversarial Networks (Original Formulation)},
      url = {https://lastweekspotatoes.com/posts/2021-07-22-generative-adversarial-networks-original-formulation/},
      year = {2020}
    }