Learning by Competition: Dual Discriminator Generative Adversarial Networks

There has been a resurgence of interest on generative adversarial networks (GANs) in recent years. The overall performance of the generator depends on how well the discriminator is trained. In this study, we use two discriminators and one generator in the adversarial architecture. Each discriminator has a different perspective of how to evaluate the generated data, which makes the dual discriminators compete with each other to improve the performance of the generator output. The competition of the discriminators is evaluated using a zero-sum game in order to make the optimizer converge into the Nash-equilibrium. Experimental results show that the proposed approach has better visual performance and the EM-distance metric over some well-known GAN models, including DCGAN, BEGAN, WGAN, and WGAN-GP.


Introduction
GANs have many potential applications and been applied to various areas, which makes GANs an important research topic in deep learning.Successful applications include data augmentation [1], art generation [2], image-to-image translation [3]- [4], super-resolution [5], and image completion [6].
The idea behind GANs is to develop two neural networks, a generator and a discriminator, competing with each other [7].The generator is trained to learn generating synthetic data that is likely to come from the sample space by taking a latent variable drawn from some distributions such as the normal distribution or uniform distribution.The discriminator then evaluates the generated data and gives a scale value representing the probability that this generated sample comes from the real dataset.In other words, the discriminator is trained to distinguish the data generated by generator from the real data.
The optimization of a GAN network can be formulated as a minimax game.Given an optimal discriminator, the generator is learn to minimize the distribution difference between the generated data and the real sample data.Eventually, both the generator and the discriminator networks improve their performance, and in theory with a sufficient network capacity, the generator should be able to learn the underlying data distribution and generate samples that are impossible to be differentiated by an optimal discriminator.
However, the optimization of the traditional GAN algorithm sometimes tends to diverge as it is not easy to train and can have the mode collapse issue due to the mode seeking process where the model distribution concentrates on a single mode of data distribution [8].A theoretical analysis made by [9] shows that training GANs tends to be unstable.Therefore, it is necessary to tune the network parameters carefully.Many improvements have been made to cope with these problems.Some of the theoretical breakthroughs can be found in [10]- [11].

Method description
In the original GAN paper, there are only two players for competition: a generator and a discriminator.Both the generator and the discriminator are parameterized by neural networks.As described above, the generator tries to counterfeit original data, for instance, a specific class of images from a particular dataset.The discriminator aims to discern which samples are taken from the real dataset and which ones are fake.The generator only uses the information, a scalar value derived from the discriminator to improve its performance.The value given by the discriminator represents how close the discriminator consider the sample generated by the generator comes from the real dataset.
At the beginning, due to the random initialization of the networks parameters, the result could not be the best.After iteration runs for optimization of the objective function, the generator and the discriminator will update its parameters and improve their abilities.Researchers have shown that theoretically with a sufficient capacity of neural network and enough training [7], the samples created by the generator will look identical to those that come from the real data.No matter how good the discriminator is, it will have a 50% chance of correctly determining the class of both real and fake samples.This means that the generator learns the underlying distribution of the real dataset.
In the line of the thought described above, we use two discriminators to evaluate the performance of the generator to improve the performance of the generator.The authors in [12], also adopted multiple discriminators, but the difference between [12] and our approach is the way the networks are trained.The discriminators in [12] cooperate to get a better output and the generator receives information from multiple sources that make the generator improve its performance in a lower amount of epochs.The main issue of the approach in [12] is that it requires higher amount of memory to store the information of different discriminators, and the training-time per epoch increases, as multiple neural networks need to be trained.To cope with this issue, instead, we make the discriminators compete to each other to improve overall performance but use less memory.
The generator is a parametric function that maps a random variable z drawn from a fix probability distribution p(z) corresponding element x ∈ X that is in the same space of the real data.We define the generator as the function: Gθ(z) where θ represents the set of parameters of the mapping function, the weights and biases of a neural network.The discriminators similarly are defined as parametric functions Dθi(x), i ∈ {1, 2}, that assign a real value to every element of X representing its confidence about the origin of the sample x (i.e.real or generated).The generator competes against the two discriminators.
To train the generator, we maximize the expected value of Dθ1(z) + Dθ2(z).To train the discriminators, we take a different approach by observing the symmetry of the discriminators.We know that if we could train the dual discriminators to their optimality, the output of both discriminators should always be equal no matter their individual input.
We make the dual discriminators compete in a min-max game.Additionally, as the symmetry of the pay-off function is a zero-sum game, by applying the min-max theorem, we can guarantee that the optimal point will reach the Nash-equilibrium.This allows us to train one discriminator per epoch; thus, in order to keep the performance of both discriminators at similar levels, we only train the one with the lower performance.This avoids the need to optimize all of the neural networks in every epoch, reducing the execution time.The min-max game between the discriminators and the generator can be expressed in (1): min max To train the generator, we use a different approach.Notice that taking the gradients of (1) with respect to the parameters of Dθ1, gives (2): That is clearly independent from Dθ2, similarly taking the partial derivatives with respect to the parameters of Dθ2 gives a symmetric expression that is independent of Dθ1.If we use this rule to train the discriminators, these will not use the information of their performance related to each other.It is equivalent to use the training algorithm two times independently.In that case, we would not be able to take advantage of the symmetry of both discriminators.To avoid this, we define a different cost function to train the dual discriminators.First, we define the expected performance of the discriminators in (3): where fk(x) is the logistic function 1/(1 -e −kx ) Hence, we can now define an appropriate cost function C(E1, E2).Because of the symmetry of the discriminators, we use a zero-sum game to guarantee that the equilibrium point is the Nash-equilibrium.To have this, we need C(E1, E2) to be anti-symmetric with respect to its parameters i.e.C(a, b) = C(b, a).Note that this function should not be a linear combination of E1 and E2, because the gradients with respect to the parameters of one discriminator will be independent of the other.With the min-max theorem, we know that the Nash equilibrium of the game is given by the optimization defined in (4): In order to avoid falling into the local optima in the cost function C(a, b), we require it to be monotonically increasing with respect to the difference between a, b and satisfy (5).In this way, the cost will penalize more and more the competitor that is losing, being necessary to increase gradually the effort needed to surpass the leading one. (5) The proposed cost function for the discriminators is: where atanh(x) is the inverse of the hyperbolic tangent function of x, and a, b ∈ I, I = [0, 1].The limiting value η∈(0, 1) guarantees that the cost function C is well-defined.Putting (3) in ( 6) gives: ) If we train both discriminators at the same time, depending on the initial values, one of them should outperform the other.The algorithm only updates the discriminator with the lowest value; otherwise, one of the discriminators improves much faster than the other does, and the result is similar to only having one discriminator.In our experiment, the discriminators are trained more frequently than the generator.The procedure is summarized in algorithm 1.

Experimental Results
We trained several well-known models and compared their visual results with the results of our method using the LSUN-Bedroom dataset [13].The LSUN-Bedroom dataset consists of 3,033,042 images that were center-cropped and resized to 64x64 pixels.
To quantitatively evaluate the performance of the generator, we choose the EM-distance as an evaluation metric [16].The EM-distance or Wasserstein-1 is a measure between probability distributions and has some advantages over KL-divergence or JS-divergence.KL-divergence is well known for not being symmetric with respect to its parameters.Although JS-divergence is symmetric, i.e.JS(P, Q) = JS(Q, P), it does not satisfy the triangular inequality and has discontinuity problems.The EM-distance is symmetric, satisfies the triangular inequality, and is continuous almost everywhere.To perform the computation of the experiment we used the open source code from [17].
According to our experiments, BEGAN obtained the worst results.After a point of iteration, the samples degenerated and failed to continue to improve, which is known as mode collapse.BEGAN is known for performing well on face datasets [15] such us CelebA [18].This indicates that the tuning of the hyper-parameters is critical on different datasets.
DCGAN and WGAN performed better than BEGAN in our experiments.Figure 1 shows the results of the samples generated by DCGAN after the training.The results are clearly better than those generated by BEGAN in Fig. 2, but there are some collapsed samples.The results of WGAN are shown in Fig. 3.The performance of WGAN shows similar problems as those of DCGAN.Some samples are clearly unrealistic, but in general, the samples are comparable to those of DCGAN.
The samples generated by WGAN-GP are shown in Fig. 4. WGAN-GP and our method both show the best performance.Clearly, the shape of the figures is more realistic compared to those of DCGAN and WGAN.This indicates that the gradient penalty used in the algorithm instead of the clipping in WGAN makes the training more stable.
The samples generated by the proposed approach, shown in Fig. 5, are at least as good as those generated by WGAN-GP.The estimated EM-distance of our algorithm with respect to the real data is smaller than that of WGAN-GP, making our algorithm perform better on the LSUN dataset.
The EM-distance between the data generated by the generator and the real data is listed in Table .1.The distance between BEGAN and the real data was not considered due to mode-collapse.The smaller the distance means that the distribution is closer to that of the original dataset.

Conclusions
In this study, we have implemented a dual discriminator GAN network based on a competitive architecture.The visual results and the EM-distance of the experiment on the LSUN-Bedroom dataset showed that the proposed approach could obtain good results over some well-known methods.It indicates that using competing discriminators can lead to better performance than a traditional GAN framework.We are currently conducting the proposed approach on various datasets to verify the adaptability and scalability.

Table 1 .
EM-distance between real and the generated data.