Video

0:00 Problem description, 1:00 Baseline, demos 1:24 Proposed method, demos

Project Summary

We propose Context Aware Language Generator for Minecraft(CALM), a framework for generating environment aware, style specific text snippets in Minecraft. Specifically, our agent will recognize the surrounding environment, which we refer to as context, to generate short stories that are coherent to the received environment. For example, given a river flowing near the agent as context and a user defined style such as horror story style, the agent should be able to generate short text snippets that contain string “river” in the style of horror stories. Concretely, we decided that our system will support generation of the following six styles: [superhero, action, drama, horror, thriller, and sci_fi]. Meanwhile, our system could take into account discrete blocks, composite blocks, weather and time of day when generating text snippets.

To be able to generate diverse and realistic stories, we believe it is important to incorporate AI/ML algorithms in the generation process. Existing works on generating text with given string usually involves training customized model or changing the generation process to deviate from auto-regressive manner, thus making it not convenient to leverage these methods(7,10,13). On the other hand, modification-based methods that swapps the words into templates/existing texts fails to generate diverse samples, and the generated texts depends on its template(8). With these issues in mind, we propose to make use of pre-trained style-aware language model with our modified search-based decoding algorithm that ensures the output text contains the target string. We show our method could generate results that are style-coherent and sntactically-sound when compared to human-written texts, and does not require the user to train a customized model.

Approaches

To achieve our goal of generating text that is coherent to the environment in Minecraft, our work involves the following two modules:

The rest of this section is organized as follows: we first cover the background concepts related to our implementation, then introduces both the baseline and proposed version of language generation approach. Finally, we introduce how we retrieve string from the Malmo environment.

1. Background

To make the report self-contained, we briefly cover the concept of language modeling, text generation, decoding methods, part-of-speech tagging and GPT-2 in this subsection.

2. Text generation methods

We introduce the implementaion of our baseline and proposed approach, and discuss their advantages and disadvantages in this subsection.

2.1 baseline generation approach

In comparison to our proposed generation method, we implemented a generator that swaps the target string into pre-stored human-written story snippets. Upon initialization, the generator keeps a publically available pool of stories with genre labels, originally used to train neural models(4). In the generation process, we first randomly sample a story snippet of desired genre from the story pool, then swaps the given string into the sampled story.

To determine the position to swap the target string in, we make use of part-of-speech labels. Specifically, we view a position in the sampled text as potential position if the part-of-speech tag of the original token at such position is the same as the desired string. For example, the swap in the above picture happened because ‘pince’ and ‘cow’ are both Nouns.

2.2 proposed generation approach

our proposed approach

We choose pretrained GPT-2 model fine tuned on style specific corpus such as horror story and super hero stories for language generation(4, 6). During the training process, the pre-trained model was given samples that starts with a genre token, thus the model will learn to generate different texts based on the given genre token. A typical sample in training data looks like the following:

Thus, we would be able to control the generation behaviour of the model by conditioning the model on the same special style tokens. At each language generation step, the pretrained model ouputs a logits of corpus size that specifies the probability of the corresponding words to appear at the current position. Traditionally, the word chosen at each generation step is achived by approaches such as nucleus sampling and top-k sampling(14).

To achieve context aware langauge generation, our system retrieves description of surrounding blocks from adjacent blocks of the agent(such as ‘diamond_block’) and returns a string that is common in natural language(‘diamond’). We then try to make the target word appear in our output text snippet by finding a position where the target word is most likely to appear. Formally, the context aware language generation task is defined as finding a sequence \((w_1,w_2,w_3... w_n)\) with an victim word \(w_v\) such that \(abs(p(target word \mid w_0, ... , w_{v-1})-p(w_v \mid w_0, ... , w_{v_1}))\) is minimized, while the raw probability of a certain word is sampled at each generation step after the word swapping happened is given by \(p(w_i \mid w_0, w_1, ... , target word, ... , w_{i-1})\)(prior to applying topk/nucleus sampling).

a more detailed description

To find the optimal sequence, our system generates candidate sequence by computing the potential target word swapping loss \(abs(p(target word \mid h_{t-1})-p(chosen word \mid h_{t-1}))\) at each generation step t, and view the current sequence as a candidate hypothesis if such loss is smaller than a pre-defined threshhold. Upon generation of a hypothesis, the word swapping loss is assigned to that branch as a score. Meanwhile, the system will keep on the generation process with other possible cases where the word swap did not take place, untill a new position where the target word could be swapped in or end of generation steps is reached. After the exploration of candicate hypothesises is done, we choose the word sequence with lowest swapping loss as output.

2.3 discussion of generation approaches

The baseline method had the advantage of being easy to implement and having fast inference time. However, the swap-based nature makes it unable to generate completely orginal text snippets. Meanwhile, while POS tagger most gerentees the generated text is grammartically correct, this approach does not take into consideration the meaning of the target word. Compared to the baseline, our proposed method could generate more diverse and unseen text snippets. Meanwhile, the model could consider the semantics of the target words in generation process, thus could produce more natural texts with given string. However, due to the large search space that is explored in the decoding process, our proposed method have longer inference time when compared to the baseline model.

3. The environment perception module

3.1 Environment perception methods

We cover our methods for retrieving strings from surrounding environment in this section.

3.2 Interaction between environment perception and text generation modules

We designed the two modules to be stand alone, and both could be used as-is. We include relevant implementaion that is required to replicate our results in the demo video in this section, but both modules should be easily customizable.

To make generated text related to environment perception, we use if statement to check the environment condition. When the time is night, we prefer to use ‘horror’ or ‘thriller’ as genre to generate word.Specificly, we will give both ‘horror’ or ‘thriller’ a 0.4 out of 1 probability which are than the other genres when we generate. Also, when we detect specific items, such as lava, we prefer to use ‘sci-fi’ or ‘superhero’ as genre, which use the same method describe upper.

Evaluation

To evaluate our proposed generation method, we compare results of our method against our baseline and/or human written short story snippets in various setups. We evaluate our results by the following characteristics:

For all samples from our proposed method, we used nucleus sampling option and constraint that the model returns the best searched solution in 30 words. We show our method could generate text that is generally on par with human written story snippets in terms of style and grammar.

1. Style coherency

To ensure the results we generated are style coherent, we choose to use human evaluation metrics to ensure the ouputs are indeed of the intended style. We use two methods to check this feature: letting human evaluators guess the style of a given text without knowing its genre, and letting human evaluators rate the style coherency of a text given its genre. In both settings, we evaluated our generated text against human written text snippets. Note that our swap-based generator uses samples from the same human written text dataset, thus we omit its results from this section in particular.

1.1 Measurement by style classification

Specifically, we randomly sample sentences with different styles from our system, and let human readers guess the genre of the generated sentence. Upon evaluation, the human rater is allowed to pick two genres out of the six possible genres. We mark a text snippet is correctly recognized if one of the guessed genre by human evaluator is the true genre of that text. We report the averaged recognized examples among human evaluators as follows:

Note that the Random Selection row is the result one would get if genre labels were randomly assigned to 60 samples with balancely distributed genres. Interestingly, upon evaluation we found that it is harder than expected for a human evaluator to guess the genre of a text snippets. In conclusion, we show that our generated result could achieve similar performance as human-produced texts.

1.2 Measurement by score-based evaluation

We then let human evaluators rate the coherence score of a piece of text given its genre from 0 to 5. More precisely, 0 means not coherent at all and 5 means the genre is extremely clear. We show that our generated text generally achieves result as good as human written texts.

2. Syntactical soundness

To ensure our method only impose minimal damage to the ability of language model to generate syntactically correct text, we use Grammarly, an open source grammar checker to check the suspected grammar errors in our generated text. For a given docment, grammarly will reports the count of suspected errors and an overall score for writting(description of grammarly score).

Specifically, we pasted 120 generated samples for both the proposed method, the baseline method and human written texts into grammarly. The number of flagged errors and overall writing scores is shown as above. We conclude that our proposed method could generate text that are as grammarly sound as humans.

In general, all methods of generation will generate grammartically correct text snippets, and the difference in error count and scores is within the tolerant level for compensating the stochasity of sampling. Note that grammarly assigns different weights for different errors, so a document with more errors could still get a higher overall score - the Grammarly Score should be treated more of a qualititive metric.

3. How well does target string(from blocks) blend in?

Since both our basedline and proposed approach works by replacing a victim word by the target word at some point in the generation process, a natural question to ask is how naturally does the target string fit into the produced text. As naturalness is generally a subjective term, we use human ratings for this metric and treat this as a qualitative feature. Conceretly, we sample sentences pairs containing the same string from both generators 60 times to construct evaluation dataset for this subsection. We then let a human evaluator pick the sentence in which the target string looks more natural without knowling the producer of each samples.

Out of 60 samples, our proposed method outperforms the basedline method in 52 cases. Interestingly, some of the text generated by swap-base generator is clearly problematic, but will look banal to grammar check such as Grammarly. For example, our baseline method will generate the followin text:

It is clear that diamond is a inserted word, and the grammar is also not completely coherent here. However, both Grammarly and GoogleDoc grammar checker fails to recognize such incongruity. Compared to the baseline, our model could put the target word into context. For example:

When recognized that pig is inserted in the generation process, out model will went on to talk about related terms like “farming”, thus the generated text looks more natural. In summary, we show that our model uses target strings better than the baseline model.

4.1 Ability for our system to swap at least one target word

A basic requirement for our language generation task is the ability to make the target string appear in the output. To evaluate this, we queried both our baseline generator and proposed generator for 120 samples, and both of the generators succeeded in all cases.

4.2 Target swapping loss and quality trade-off

A natural question that comes with our generation task is whether there is a limit with respect to how many target strings we could make the output contain. As our system keeps track of all possible choices of outputs, the answer to such question will depend on how much quality of the generated text is the user willing to compensate, and whether there is any length constraint of the output.

As a general suggestion, we show the target swapping loss(defined in Approach-2.2) for the model’s best result with respect to the length constraint of output text(for each target word). In the table, each data point \((X=N,Y=S)\) is obtained by letting the model find the best sentences of length N for 10 different combination of genre tokens and input phrases, and take the average of the swapping losses of these best sentences as S.

Naturally, the longer the output can be, the better the quality of swapping will be. For example, if the user accepts the quality of a search loop of 5, and want to make 5 words appear in the output, the model should be able to find the solution in 25 words, excluding remaining words that the model need to sample before the EOS token.

We already demonstrated the quality of text generated by 30 search loops in previous evaluation experiments in this section. We now provide some additional examples here for a taste of the quality of generated text at different swap loss levels.

References

  1. Attention Is All You Need

  2. Better Language Models and Their Implications

  3. Grammarly

  4. GPT2 Genre Based Story Generator

  5. Hierarchical Neural Story Generation #aka topk samping

  6. Huggingface: On a mission to solve NLP, one commit at a time.

  7. Insertion Transformer: Flexible Sequence Generation via Insertion Operations

  8. Masterpiece Generator

  9. Natural Language Toolkit

  10. Progressive Generation of Long Texts

  11. The Curious Case of Neural Text Degeneration

  12. The Stanford CoreNLP Toolkit

  13. Topic Aware Neural Response Generation

  14. Neural Language Generation: Formulation, Methods, and Evaluation

group meeting took place every Thursday 10:00 pm