*Finally managed to make some time to write a blog post this term. Looking forward to getting back into blogging more frequently.
Ian Goodfellow introduced GANs in 2014, and since then they have shown incredible quality in the generation of images.
How do GANs work?
GANs consist of two competing networks – a generator (G) and a discriminator (D). G generates synthetic data from some noise with the goal of fooling D into thinking it’s real data. D has to discriminate whether a given sample is real or fake. It is this simple tussle between these two networks that make GANs so powerful.
Why do we care about GANs?
The beauty of GANs lies in the fact that you don’t have to come up with a loss function – the network critics itself and given enough time; the AI can generate anything you expect of it.
The advantage of this min-max setup is that this generative approach is not dependent on a stringent loss function – the goal of each network is to fool the other, not to imitate some pre-defined dataset which can introduce unwanted biases.
You can also train conditional GANs to give specific kind of data – for, e.g., pictures of men under 40 wearing a fedora or text of people with Alzheimer’s talking to their caregivers.
GANs for text
GANs (or any networks requiring some backpropagation) struggle when it comes to working with discrete observed or latent structures like text. Because you can’t differentiate through discrete elements, you have to use some approximation like REINFORCE to backpropagate or represent your discrete components in some continuous representation (ours).
One of the current states of art GANs for text generation papers (based on BLEU scores), Adversarial Generation of Natural Language, uses the probability distribution over text tokens (Softmax approximation) to represent the output of their G and 1-hot vectors to represent the real data. The figure below sums up their approach succinctly –
We hypothesize that training GANs to generate word2vec vectors instead of discrete tokens can produce better text because:
- Semantic and syntactic information is embedded in this real-valued space itself.
- Our structure will be vocabulary-size agnostic as the GAN structure can be static when new words are added – so you don’t need to fiddle with the dimensions of your network with your ever-adapting vocabulary.
- Given the nature of the word2vec space, we can expect an interesting variety of possibly similar generated sentences.
- No approximation needed in the GAN training phase as the output of G is a sequence of word2vec vectors that are fed directly to D.
As seen in the figure above, we use some random Gaussian noise as input to our G, which gives a sequence of word2vec vectors. The real data is mapped to a series of vectors using a pre-trained word2vec model. These vectors are stacked on top of each other, normalized and then treated as you would for images. The samples are sent to D which determines if a given sample is real or fake. Because the primary GAN structure stays the same – we use the same loss function from Goodfellow’s original paper.
During inference, and intermittently during training, we map these samples of generated word2vec vectors to their closest neighbor using cosine similarity on the pre-trained word2vec vocab-dictionary. This mapping is the crux of the network – it’s where the AI creates text.
Turns out training GANs for text is notoriously tricky, and it’s at a relatively crude research stage at this point. Below are some of the generated sentences on the CMU Chinese Poetry text translation dataset (it consists of straightforward English sentences) using the Softmax approximation talked about earlier –
Our GAN (we used the standard DCGAN architecture) learns to start and stop a sentence w/ the same characters every sentence (<s> and </s> respectively.) We trained our GAN to spit out 5- and 7-word sentences on the CMU dataset (a benchmark requirement for the text GAN papers), and these are some of the sentences our GAN generates:
- <s> i ‘m probably rich . </s>
- <s> can you background anything cream ? </s>
- <s> where ‘s the lens . </s>
- <s> can i have a tripod ? </s>
- <s> can i eat a pillow ? </s>
- <s> you can hold the cheeseburger fried </s>
- <s> could you take me to a manager ? </s>
Turns out our GAN learns to generate bi-grams and some tri-grams. The output looks to be of similar quality to that of Rajeswar et al. and the BLEU-2 score is comparable too.
However, our current set-up is suffering from mode collapse – a common problem with GAN training – where our G figure out a way to fool (a weak) D with just a few examples. In our case, G manages to create 15-20 different real-looking sentences that are enough to confuse D into thinking that the samples from G are always real. Mode Collapse is a known problem in the Computer Vision world, and our next step would be to implement those suggestions into our model.
We also want to try our model on a dataset with longer sentences and more realistic sentences – like Dementia Bank (people w/ and w/o dementia describing pictures to a nurse) and Newsgroup20 (A lot of news articles.) We will also want to use some improved metrics from the text translation world to measure the quality of our text, as BLEU score has known problems and is not suitable for our task (except for it is very easy to calculate and its use as a benchmark.)
Finally, we want to study how changing one of the latent variables affect the structure of the generated sentence. It would be interesting if we can control the features like the sentiment, the plurality or the tense of a sentence.
We got some more experiments lined-up w/ conditional variants of our GANs, but more on that in Part 2. Let me know what you guys think of this, and if you have any suggestions for changes or kinds of data you would like to generate synthetically!