“Generative Adversarial Networks” (Goodfellow et al.) Paper overview.
Let’s start with a question: what is Generative Modeling?
Generative modeling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated).
Generator G: produces realistic samples e.g. taking as input some random noise. G tries to fool the discriminator
Discriminator D that takes as input an image and assess whether it is real or generated by G
Both D and G are conveniently chosen as MLPs. The generative process depends on two networks:
D=D(x,θd)
G=G(z,θg)
θg and θd are the network parameters, x∈Rn is an input image (either real or generated by G) and z∈Rd is some random noise to be fed to the generator.
We suppose that x is sampled from a distribution pdata (i.e. x∼pdata(x)) and z is sampled from a distribution pz (i.e. z∼pz(z)).
Our Discriminator’s output is to be seen as the probability that the input image comes from the data and not from the generator:
D(⋅,θd):Rn→[0,1]
The generator gives as output a generated image:
G(⋅,θd):Rd→Rn
D(x,θd) is maximum when x∈X (i.e. x is sampled for the real images dataset X)
1−D(x,θd) is maximum when x was generated by G
1−D(G(z,θg),θd) is maximum when z∼pz(z)
Training D consists in maximizing the binary cross-entropy:
maxθd( Ex∼pdata(x)[logD(x,θd)]+Ez∼pz(z)log[1−D(G(z,θg),θd)] )
Where
A good generator G is one that makes D fail:
minθg(maxθd( Ex∼pdata(x)[logD(x,θd)]+Ez∼pz(z)log[1−D(G(z,θg),θd)] ))
Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, as long as G changes slowly enough.
Let’s schematize it:
We need to solve by an iterative numerical approach the min max game shown at (4). In order to do so we alternate:
k-steps of Stochastic Gradient Ascent w.r.t. θd to solve
maxθd( Ex∼pdata(x)[logD(x,θd)]+Ez∼pz(z)log[1−D(G(z,θg),θd)] )
1-step of Stochastic Gradient Descent w.r.t. θg being θd fixed:
minθg( Ex∼pdata(x)[logD(x,θd)]+Ez∼pz(z)log[1−D(G(z,θg),θd)] )
i.e. (the removed term does not depend on θg)
minθg( Ez∼pz(z)[log(1−D(G(z,θg),θd))] )
i.e.
maxθg( Ez∼ϕz[logD(G(z,θg),θd)] )
There is a reason why Goodfellow proposed to optimize log(D(⋅)) instead of log(1−D(⋅)). If we try to descend the gradient of log(1−D(x)), we notice that at the beginning of the training process, when the generated samples would be easily classified as “fake” (i.e. D(x)∼0), there would be too few gradient in order to learn properly!
We have the following value function for our min-max problem:
V(G,D)=Ex∼pdata(x)log(D(x))x+Ez∼pz(z)log(1−D(G(z))dz
=∫xpdata(x)log(D(x))dx+∫zpz(z)log(1−D(G(z))dz
=∫x(pdata(x)log(D(x))+pg(x)log(1−D(x)))dx
This last equality comes from the Radon-Nikodym Theorem of measure theory and it’s sometimes referred as the Law Of The Unconscious Statistician (or LOTUS Theorem) since students have been accused of using the identity without realizing that it must be treated as the result of a rigorously proved theorem, not merely a definition (if you want the full proof check this out! )
V(G,D)=∫x(pdata(x)log(D(x))+pg(x)log(1−D(x)))dx
Let’s first consider the optimal discriminator D for any given generator G. The training criterion for the discriminator D, given any generator G, is to maximize the quantity defined below:
argmaxD(V(G,D))=argmaxD(∫x(pdata(x)log(D(x))+pg(x)log(1−D(x)))dx)
For the individual sample x we derive V(G,D) w.r.t. D(x) and we equal this quantity to 0 in order to find the optimal discriminator D(x)∗
ddD(x)(pdata(x)log(D(x))+pg(x)log(1−D(x)))=0
pdata(x)D∗(x)−pg(x)1−D∗(x)=0
pdata(x)(1−D∗(x))−pg(x)D∗(x)D∗(x)(1−D∗(x))=0
pdata(x)−pdata(x)D∗(x)−pg(x)D∗(x)D∗(x)(1−D∗(x))=0
pdata(x)−pdata(x)D∗(x)−pg(x)D∗(x)=0
pdata(x)−D∗(x)(pdata(x)+pg(x))=0
D∗(x)=pdata(x)pdata(x)+pg(x)
Does this point represent a maximum? we have to check if the second derivative calculated in D⋆ is negative.
ddD(x)(pdata(x)log(D(x))+pg(x)log(1−D(x)))=pdata(x)D(x)−pg(x)1−D(x)
d2d2D(x)(pdata(x)log(D(x))+pg(x)log(1−D(x)))=ddD(x)(pdata(x)D(x)−pg(x)1−D(x))
ddD(x)(pdata(x)D(x)−pg(x)1−D(x))=−pdata(x)D2(x)−pg(x)(1−D(x))2<0
The quantity above is negative for every D, D∗ included, since pdata(x) and pg(x) are between 0 and 1.
We then can plug D⋆ into V(G,D) and find the optimal generator G⋆ as:
∫x(pdata(x)log(D⋆(x))+pg(x)log(1−D⋆(x)))dx
G⋆=argminG(∫x(pdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(1−pdata(x)pdata(x)+pg(x)))dx)
G⋆=argminG(∫x(pdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(pg(x)pdata(x)+pg(x)))dx)
G⋆=argminG(∫x(pdata(x)(log2−log2)+pdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)(log2−log2)+pg(x)log(pg(x)pdata(x)+pg(x)))dx)
G⋆=argminG(−log2∫x(pg(x)+pdata(x))dx+∫xpdata(x)(log2+log(pdata(x)pdata(x)+pg(x)))dx+∫xpg(x)(log2+log(pg(x)pdata(x)+pg(x)))dx)
G⋆=argminG(−log2⋅(2)+∫xpdata(x)log(2⋅pdata(x)pdata(x)+pg(x))dx+∫xpg(x)log(2⋅pg(x)pdata(x)+pg(x))dx)
G⋆=argminG(−log22+∫xpdata(x)log(pdata(x)pdata(x)+pg(x)2)dx+∫xpg(x)log(pg(x)pdata(x)+pg(x)2)dx)
G⋆=argminG(−log4+KL(pdata||pg+pdata2)+KL(pg||pg+pdata2))
G⋆=argminG(−log4+2⋅JSD(pdata||pg))
Where the Kullback-Leibler divergence (KL) and the Jenson-Shannon divergence (JSD) are quantities that measure the difference between two distributions and we know that JSD(pdata||pg)=0 only when pdata=pg !
⟹V(D⋆G,G)=−log4
Theorem 1:
The global minimum of the virtual training criterion
V(D⋆G,G)
Besides, that was what we expected! We wanted our generator to learn the same distribution which generated the data. If we know that pdata=pg then it’s trivial to observe that at the end of the training process the optimal discriminator will be forced to output 0.5 since it won’t be able to distinguish between real and fake samples anymore.
D⋆(x)=pdata(x)pdata(x)+pg(x)=12
But does this converge?
Well, as stated in the original Paper:
If G and D have enough capacity, and at each step of our algorithm, the discriminator is allowed to reach its optimum given G and pg is updated so as to improve the criterion
Ex∼pdata[logD⋆G(x)]+Ex∼pg[log(1−D⋆G(x))]
then pg converges to pdata.
Proof:
Consider V(G,D)=U(pg,D) as a function of pg as done in the above criterion. Note that U(pg,D) is convex in pg.
The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained. In other words, if f(x)=supα∈Afα(x) and fα(x) is convex in x for every α , then ∂fβ(x)∈∂f if β=argsupα∈Afα(x). This is equivalent to computing a gradient descent update for pg at the optimal D given the corresponding G. supDU(pg,D) is convex in pg with a unique global optima as proven in Theorem 1, therefore with sufficiently small updates of pg, pg converges to px, concluding the proof.
In practice, adversarial nets represent a limited family of pg distributions via the function G(z;θg), and we optimise θg rather than pg itself. Using a multilayer perceptron to define G introduces multiple critical points in parameter space. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.
For attribution, please cite this work as
Bonvini (2020, Dec. 14). Last Week's Potatoes: Generative Adversarial Networks (Original Formulation). Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-generative-adversarial-networks-original-formulation/
BibTeX citation
@misc{bonvini2020generative, author = {Bonvini, Andrea}, title = {Last Week's Potatoes: Generative Adversarial Networks (Original Formulation)}, url = {https://lastweekspotatoes.com/posts/2021-07-22-generative-adversarial-networks-original-formulation/}, year = {2020} }