Sunday, July 23, 2017

Some advice for journalists writing about artificial intelligence

Dear Journalists,

I'd like to offer some advice on how to write better and more truthfully when you write articles about artificial intelligence. The reason I'm writing this is that there are a whole lot of very bad articles on AI (news articles and public interest articles) being published in newspapers and magazines. Some of them are utter nonsense, bordering on misinformation, some of them capture the gist of what goes on but are riddled with misunderstandings. No, I will not provide examples, but anyone working in AI and following the news can provide plenty. There are of course also many good articles about AI, but the good/bad ratio could certainly be improved.

First off, I understand. You're writing about an extremely fast-moving field full of jargon and enthusiastic people with grand visions. Given all this excitement, there must be plenty to write about, but you don't know much (or even anything) about the field. You probably know as little about AI as I know about, say, tannery. But where tannery evolves only very slowly and involves very concrete materials and mechanics, AI moves at breakneck speed and few of those words that get thrown around seem to refer to anything you can touch or see. There's a feeling that you need to write about the latest developments NOW before they are superseded, but it's hard to see where to even begin to decipher the strange things those AI researchers say. And of course you want to write something readable, and clickable, and you don't have much time. It can't be easy.

So here's a few things to keep in mind, and some concrete recommendations, for more critical and higher-quality reporting on AI. Some of this is based on my experience with being interviewed by journalists of varying technical proficiency, and with varying inclination to buy the story I was trying to sell them. Yes, we're all trying to sell something, even we curmudgeons in the ivory tower are trying to sell you something. More about this below.

Keep in mind: AI is a big field, and very diverse in terms of topics and methods used. (True, it's not as diverse as it should be in some other senses.) The main AI conferences (such as IJCAI, AAAI, ICML and NIPS) have thousands of attendees, and most of them only understand a small part of what goes on in the conference. When I go to one of these conferences, I can perhaps follow maybe 20% of the talks and get something out of them. While I might be a bit dim myself, it's rare to find anyone who can keep up to date with sub-fields as diverse as constraint propagation, deep learning and stochastic search.

Recommendation: Do not assume that researchers you talk to knows "what's going on right now in AI". Even more importantly, if someone says they know what's going on right now in AI, assume that they only know a small part of the big picture. Double-check with someone working in another field of AI.

Keep in mind: There is no such thing as "an artificial intelligence". AI is a collection of methods and ideas for building software that can do some of the things that humans can do with their brains. Researchers and developers develop new AI methods (and use existing AI methods) to build software (and sometimes also hardware) that can do something impressive, such as playing a game or drawing pictures of cats. However, you can safely assume that the same system cannot both play games and draw pictures of cats. In fact, no AI-based system that I've ever heard of can do more than a few different tasks. Even when the same researchers develop systems for different tasks based on the same idea, they will build different software systems. When journalists write that "Company X's AI could already drive a car, but it can now also write a poem", they obscure the fact that these are different systems and make it seem like there are machines with general intelligence out there. There are not.

Recommendation: Don't use the term "an AI" or "an artificial intelligence". Always ask what the limitations of a system is. Ask if it really is the same neural network that can play both Space Invaders and Montezuma's Revenge (hint: it isn't).

Keep in mind: AI is an old field, and few ideas are truly new. The current, awesome but a tad over-hyped, advances in deep learning have their roots in neural network research from the 1980s and 1990s, and that research in turn was based on ideas and experiments from all the way back in the 1940s. In many cases, cutting edge research consists of minor variations and improvements on methods that were devised before the researchers doing these advances were born. Backpropagation, the algorithm powering most of today's deep learning, is several decades old and was invented independently by multiple individuals. When IBM's Deep Blue computer won over Garry Kasparov and showed that computers could play Chess better than humans, the very core of the software was the Minimax algorithm, first implemented by Alan Turing in the 1940s. Turing, one of the fathers of both artificial intelligence and the wider field of computer science, also wrote the paper "On Computing Machinery and Intelligence" which was published in 1950. While that paper is most famous for introducing what is now called the Turing Test, it also contains the seeds of many of the key ideas in artificial intelligence.

Recommendations: Read Turing's 1950 paper. It's surprisingly easy and enjoyable to read, free from mathematical notation, and any technical terms can easily be glossed over. Marvel at how many of the key ideas of artificial intelligence were already in place, if only in embryonic form. When writing stories about exciting new developments, also consult an AI researcher that is old, or at least middle aged. Someone who was doing AI research before it was cool, or perhaps even before it was uncool, and so has seen a full cycle of AI hype. Chances are that person can tell you about which old idea this new advance is a (slight?) improvement on.

Keep in mind: Researchers always have something to sell. Obviously, those working in some kind of startup are looking to increase the valuation of their company and their chances of investment or acquisition. Those working in academia are looking for talk invitations, citations, promotions and so on. Those working in a large company will want to create interest in some product which might or might not be related to their actual results.

Recommendations: Don't believe the hype. Approach another researcher, who the people you're writing about did not forward you to, and ask if that person believes their claims.

Keep in mind: Much of "artificial intelligence" is actually human ingenuity. There's a reason why researchers and developers specialize in applications of AI to specific domains, such as robotics, games or translation: when building a system to solve a problem, lots of knowledge about the actual problem ("domain knowledge") is included in the system. This might take the role of providing special inputs to the system, using specially prepared training data, hand-coding parts of the system or even reformulating the problem so as to make it easier.

Recommendation: A good way of understanding which part of an "AI solution" are automatic and which are due to niftily encoded human domain knowledge is to ask how this system would work on a slightly different problem.

I'd better stop writing here, as this text probably already sounds far too grumpy. Look, I'm not grumpy, I'm barely even old. And I don't want to give the impression that there isn't a lot of exciting progress in AI these days. In fact, there are enough genuine advances to report on that we don't need to pad out the reporting with derivate research that's being sold as new. Let's all try to be honest, critical and accurate, shall we?

Friday, November 04, 2016

How Darwin plays StarCraft

StarCraft is perhaps the single hardest game for computers to play well. At least if you only count games that people care about; you could of course construct games that were harder, but there's no guarantee anyone would play those games. When doing AI research, working on games that people care about means you are working on relevant problems. This is because games are designed to challenge the human brain and successful games are typically good at this. StarCraft (and its successor StarCraft 2) are played and loved by millions of people all over the world, with a very active competition scene where pro players are well-paid stars.

And there's no question that the game is hard; there is a series of AI StarCraft competitions that has been running since 2010, but the best AI players are still at the level of human novices. In other words, roughly where the best AI Go players were 15 years ago, or the best AI Chess players were 50 years ago. As computers are now able to play Chess and Go better than the best humans, the question is when we can surpass human ability for StarCraft as well.

It's not just me thinking this. Google DeepMind recently announced that StarCraft 2 will be one of their major new testbeds, after their success at training deep networks to play Atari games in the ALE framework. Facebook AI Research recently published their first paper on using machine learning to learn to play StarCraft and just today submitted another, showing that they take this challenge seriously. In academia, there is already a rich body of work on algorithms for playing (parts of) StarCraft, or generating maps for it. Given the game's complexity, it is unlikely we will conquer all of it soon; we have our work cut out for us.

A screenshot from the original StarCraft game

One of the reasons the game is so hard is that playing it well requires thinking and acting on different levels of abstraction. The game requires resource collection management, build order scheduling, prioritizing technology development, exploration, micro-management of troops as well as overall strategy and ways of deducing and countering the adversary's strategy. Trying to build an AI that can do all this well is very very hard. It is therefore prudent to approach the various parts of the problem separately.

In a new paper, we propose a new algorithm for playing StarCraft micro, given a forward model. "Micro" is the second-to-second, sometimes frame-to-frame, business of managing armies of StarCraft units in combat. The difficulty of playing micro is the reason professional (human) StarCraft players often average several hundred mouse-clicks per minute. To an unprepared onlooker good micro play tends to look chaotic, while it is in reality a highly complex affair with certain maneuvers requiring extreme skill.

A StarCraft battle with no micro tactics.
The green troops to the left don't move at all, and lose the battle.

The same battle with active micro tactics.
By moving around units depending on their health and cooldown level, a much better results is achieved.

So, what AI methods can we use to play StarCraft micro? There have been a number of attempts to use various forms of tree search including Monte Carlo Tree Search (MCTS), a core component in AlphaGo, the software that recently beat Lee Sedol to become world champion at the game of Go. The problem with using tree search to play StarCraft is the extremely high branching factor, meaning the extremely high number of possible actions that could be taken at any time. Where Chess has an average branching factor of around 35, and Go has an average branching factor of about 300, StarCraft micro often reaches branching factors of millions. This is because you don't just move one piece, you often have 10 to 50 different units to control at the same time. And the number of possible actions increases exponentially with the number of units that can act at the same time. Complex indeed.

Standard tree search algorithms, including MCTS, perform very badly when faced with such enormous numbers of actions to choose from. Basically, there are so many actions to consider that they run out of time before even considering a single step forward. So we need to approach this problem some other way. In some work presented earlier this year, and which concerned another strategy game, we attempted to use evolutionary computation instead of tree search to play the game. This worked very well - I wrote a separate blog post about that work.

Portfolio Online Evolution (2 scripts) in the JarCraft Simulator versus script-based UCT

The basic idea is to run an evolutionary algorithm every time step to select what to do next. Each "chromosome" (or "solution" or "individual") is a set of actions - one or more actions for each unit in the game. All the chromosomes are then scored based on how good results they achieve in simulation, and then the good chrosomes are kept, the less good ones thrown away and replaced with mutated copies of the good ones, again and again. Essentially Darwinian evolution in a computer program. Well, actually it's a bit more complicated, but that's the gist of it. We call this method Online Evolution, because it uses evolution, but not to tune a controller ("offline") as is often done, instead evolution is used as an action selection mechanism ("online").

For StarCraft we wanted to combine this very effective method with a way of incorporating domain knowledge about StarCraft playing. Fortunately, Dave Churchill at Memorial University of Newfoundland had already come up with a clever idea here, in the form of Portfolio Greedy Search. The core idea here is to not select directly among the different actions (for example move to a particular place, or attack a particular enemy). Instead, his algorithm uses a number of existing "scripts", which are simple rules for what units should do in different situations. Churchill's method uses a simple greedy search algorithm to search for what script to use to control each unit each time step.

Which finally brings us to the new algorithm we introduce in our paper: Portfolio Online Evolution. As the name suggests, this is a combination of the ideas of Online Evolution and Portfolio Greedy Search. You might already have figured this out by now, but what it does is to evolve plans for what script each unit should use each time step. Each chromosome contains a sequence of scripts for each unit, and they are evaluated by simulating a number of steps forward in simulation and seeing the results of using this sequence of scripts. (Quite simply, we use the difference in total hitpoints between our team and the adversary as the score.)

Portfolio Online Evolution (6 scripts) in the JarCraft Simulator versus script-based UCT

So does Portfolio Online Evolution work in StarCraft? Hell yes! It kicks ass. Protoss ass. We tested the algorithm using the JarCraft simulator, which is very convenient to work in as the real StarCraft game itself lacks a forward model. JarCraft comes with a several tree search methods implemented. It turns out Portfolio Online Evolution beats all of them soundly. What's more, its margin of victory gets bigger the larger the battle size (number of units on each side), and the number of scripts supplied to the algorithm. We were of course very happy with this result.

So where did this leave us? We can't play a full StarCraft yet, can we? No, because the full StarCraft game does not have a forward model, meaning that it cannot be simulated much faster than real-time. Portfolio Online Evolution, like other methods that search the space of future game states, require a fast forward model. It seems that we will have to concentrate on creating methods for learning such forward models in games such as StarCraft, to allow effective AI methods to be used.

In the nearer term, one of our ongoing projects is to learn the scripts themselves, to expand the repertoire of scripts available for the evolutionary script selection mechanism to choose from.

Finally, a note on who did what: When I say "we", I mean mostly a team of students led by Che "Watcher" Wang from NYU Shanghai. The other participants were Pan Chen and Yuanda Li, and the work was supervised by Christoffer Holmgård and myself. The project started as a course project in my course on AI for Games, and Watcher then wrote most of the paper. The paper was presented at the AIIDE conference a few weeks ago.

Tuesday, November 01, 2016

Overcoming the limits of pre-AI designs

In a VentureBeat article I argue that most fundamental game designs were made at a point in time were many AI algorithms were not invented, and the computers of the time were too weak to run those algorithms that existed. Therefore games were designed not to need AI.

Nowadays, we have much more sophisticated AI methods and better computers to run them. However, game design has not kept up. Our games are still designed not to need AI, probably because our game designs are evolutions of the same old designs. This is why many game developers argue that their games don't need AI.

We need to go beyond this, and redesign games with AI capabilities in mind. Perhaps design the games around the AI. There are some more examples from academia here.

Of course, this argument applies to many other things than games. Perhaps most things we do.

Thursday, September 29, 2016

How to organize CIG

When you run an annual conference series, it is important to maintain continuity and make sure that best practices live on from year to year. For many conferences, this seems to happen organically and/or imperfectly. For the IEEE Conference on Computational Intelligence and Games, we have a Games Technical Committee to oversee the conference, and a habit of always keeping a few people from previous years' conference organizing committee on the next year's committee. Now, we also have a set of of organizers' guidelines.

I took the initiative to formalize the rules we have informally and gradually agreed on for the conference back in 2014. I wrote a first draft, and then circulated it to all previous chairs of the conference, receiving much useful feedback. A number of people provided useful feedback, additions and/or edits of the text; among those who contributed substantially are Phil Hingston, Simon Lucas and Antonio Fernández Leiva (there are probably more, but I can't find them in the mail chain).

The complete guidelines can be found here, and also here. Please note that this is nothing like a final document with rules set in stone (and who would be to have the authority to do that anyway?). Rather, it's a starting point for future discussions about rules and practices. Our idea is that it can be very useful for future CIG organizers to have the way the conference has been organized written down in a single place. It could also be useful for people organizing other conferences, inside and outside AI and games.

While we're at it, I'd like to point out that I've also written about how to organize game-based AI competitions. This could be a useful resource for anyone who's into organizing competitions.

Tuesday, August 09, 2016

Algorithms that select which algorithm should play which game

This was meant to be a short post announcing some new papers of ours, but it grew. So what you get is a post announcing some new papers, with plenty of context.

Video games are possibly the best way of testing artificial intelligence algorithms. At least I've argued this (at some length) before. One of the things that video games offer that for example robot problems don't offer, is to easily and quickly test the same algorithm on a large number of games. That is one of the guiding principles of General Video Game AI Competition (GVGAI), which has certain advantages over other game-based AI testbeds.

Some games implemented in General Video Game AI Framework.

Now in practice we don't yet have any algorithm that can play all different video games well. (If we had, we would pretty much have solved AI. Really. Remember that "video games" is a very broad category, including Sokoban and Zork as well as Witcher 3 and Ikaruga. The GVGAI competition is more limited, but still covers a broad range of games.) For some games, we have no algorithms that can play the game well at all. For other games, we have algorithms that can play those games at human level, or sometimes even better. But an algorithm which can play Space Invaders well might not be very good at Boulder Dash, and suck completely at all a puzzle game such as A Good Snowman is Hard to Build. Conversely, a good snowman-building AI might not be very adept at fending off invading space aliens. We like to say that game-playing performance is intransitive.

But what we want is a single algorithm that can play any game well. Because intelligence is not about being able to perform a single task, but to perform all (or most) tasks (from within some reasonably general domain) well. Given that what we have are different algorithms that can each only perform some minority of tasks well, how do we get there?

One answer is to build an algorithm that includes a number of different game-playing algorithms, the "champions" of each type of game. When faced with new game, it would choose one of its constituent algorithms to play the game. If for each game the algorithm selects the best sub-algorithm to play that game, it would perform much better than any one algorithm on its own. Ex pluribus unum, stronger together, the whole is greater than the sum of its parts and all that. The question then is of course how to select the right algorithm for each game.

How well a number of different algorithms play different games in the GVGAI framework. Lighter=better. Several different categories or clusters of games appear based on the performance of the algorithms.

Luckily, people have studied the problem of how to automatically select algorithms to solve problems for some time under the names algorithm selection (obviously) and hyper-heuristics (less obviously). Various methods for selecting which algorithms (or heuristics, which means the same in this context) have been explored. But not for video games. Only for more abstract and... I can't say boring, can I? Whatever. Hyper-heuristics and algorithm selection techniques have mostly been applied to more abstract (and perhaps boring) problems such as satisfiability and combinatorial optimization. These are important problems for sure, but they are not all there is. In terms of problems that map to relevant challenges for human intelligence, general video game playing is arguably more important.

My coworkers and I therefore set out investigate how hyper-heuristics can be applied to video games, more specifically video games in the GVGAI framework. Our first result is a paper published in the IEEE Conference on Computational Intelligence and Games this September - read it here. We show that using simple features easily available to the agent, we can predict which of several algorithms will play an unseen game best with relatively high accuracy. This allows us to build an agent based on a handful of those agents that did best in the 2014 and 2015 GVGAI competitions, and verify that by cleverly choosing which one of these to use for each new game it can outperform all of them on games that it has not been trained on. The classifier at the heart of this agent is a simple decision tree, which also allows us to inspect on what basis it selects certain agents over others.

A decision tree used in one of our hyper-heuristic agents. Based on features of the game, it decides which algorithm to use to play it.

However, we felt that these results were not enough - ideally, the agent should select the best-performing algorithm for a new game every time, not just most of the time. Maybe we could find better features to feed our algorithm selection algorithm, or maybe train it differently?

In our next paper, which will be presented at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, we set out to investigate this question in more depth. Read the paper here. To begin with, we tested the performance of a very broad range of algorithms on all known GVGAI games. We identified a number of groups of games where particular algorithms do well. We also investigated a broader set of features for selecting algorithms, hoping to find better ways of identifying classes of games that would be susceptible to specific types of game-playing algorithms. One idea we had was that those classes would somehow correspond to game genres as identified by humans (e.g. some algorithms would be better at shooters, others at puzzle games etc). While we still don't have enough features to reliably classify games into genres as seen by humans, some results support this notion. In particular we found that we could divide games into four different classes, where different algorithms play well, by looking only at two features: whether the avatar in the game can shoot, and whether it can die.

This decision tree looks only at whether the avatar can shoot and whether it can die.

Of course, there's a lot of research left to do here. We still need better algorithm selection methods, and trying to (automatically?) construct features is a promising direction here. We could also probably do better if we learn to switch algorithms during gameplay - maybe some algorithm is more useful in the initial phases of a game, and another algorithm is more useful for the endgame? And let's not forget that we have many games which no existing algorithm plays well.

Sunday, July 17, 2016

Which games are useful for testing artificial general intelligence?

It is very hard to make progress on artificial intelligence without having a good AI problem to work on. And it is impossible to verify that your software is intelligent without testing it on a relevant problem. For those who work on artificial general intelligence, the attempt to make AI that is generally intelligent as opposed to "narrow AI" for specific tasks, it is crucial to have reliable and accurate benchmarks of general intelligence.

I have previously written about why games are ideal as intelligence tests for AI. Here'd I like to go into more depth about what sort of games we would like to use to test AI, specifically AGI (artificial general intelligence). These are the properties I think games used to test AGI should have:

  • They should be good games. Well-designed games are more entertaining and/or immersive because they challenge our brains better; according to converging theories from game design, developmental psychology and machine learning the fun in playing largely comes from learning the game while playing. A game that is well-designed for humans is therefore probably a better AI benchmark.
  • They should challenge a broad range of cognitive skills. Classical board games largely focus on a rather narrow set of reasoning and planning skills. Video games can challenge a broader set of cognitive skills, including not only reasoning and planning but also e.g. perception, timing, coordination, attention, and even language and social skills.
  • * Most importantly, they should not be one game. Developing AI for a single game has limited value for general AI, as it is very easy to "overfit" your solution to particular game by implementing various domain-specific solutions (or, as they're usually called, hacks). In the past, we've seen this development over and over with AI developed for particular games (though occasionally something of great general value appears out of research on a particular game, such as the Monte Carlo Tree Search (MCTS) algorithm being invented to play Go). Therefore it is important that AI agents are tested on many different games as part of the same benchmark. Preferably these would games that the AI developer does not even know about when developing the AI.

So let's look at how the main game-based AI benchmarks stack up against these criteria.

To begin with, there are a number of game-based AI benchmarks based on individual games. A pretty big number, in fact. The annual IEEE Conference on Computational Intelligence and Games hosts a number of game-based AI competitions, where the software can also be used offline as a benchmark. And of course, classic board games such as Chess, Checkers and Go have long been used as AI benchmarks. An interesting recent addition is Microsoft's Project Malmo, which uses Minecraft as the base for an AI sandbox/benchmark.

But these are all focused on individual games, and therefore not well suited to benchmark general intelligence. Let's talk about general game playing frameworks.

General Game Playing Competition

First we have the General Game Playing Competition and its associated software. This competition has been going since 2005, initiated by Michael Genesereth. For the competition, a game description language was developed for encoding the games; this language is a logic programming language similar to Prolog, and allows the definition of in theory any turn-based game with discrete world state. (Initially these games could not have any hidden information, but that restriction has since been overcome with new versions of the language.) In practice, almost all games defined in this language are fairly simple in scope, and could very broadly be described as board games.

To compete in the General Game Playing competition, you submit an agent that can play any game defined in this language. The agents have access to the full game description, and typically a large part of the agent development goes into analyzing the game to find useful ways of playing it. The actual game-playing typically uses MCTS or some closely related algorithm. New games (or variations of old games) are used for each competition, so that competitors cannot tune their AIs to specific games. However, the complexity of developing games in the very verbose game description language limits the number and novelty of these games.

Arcade Learning Environment

The second entry in our list is the Arcade Learning Environment (ALE). This is a framework built on an emulator of the classic Atari 2600 game console from 1977 (though there are plans to include emulation of other platforms in the future). Marc Bellemare and Michael Bowling developed the first version of this framework in 2012, but opted to not organize a competition based on it. Agents can interface to the ALE framework directly and play any of several dozen games; in principle, any of the several hundred released Atari 2600 games can be adapted to work with the framework. Agents are only given a raw image feed for input, plus the score of the game. To play a game in the ALE framework, your agent therefore has to decipher the screen in some way to find out what all the colorful pixels mean.

Three different games in the ALE framework.

Most famously, the ALE framework was used in Google DeepMind's Nature paper from last year where they showed that they could train convolutional deep networks to play many of the classic Atari games. Based only on rewards (score and winning/losing) these neural networks taught themselves to play games as complex as Breakout and Space Invaders. This was undeniably an impressive feat. Figuring out what action to take based only on the screen input is far from a trivial transform, and the analogue to the human visuomotor loop suggests itself. However, each neural network was trained using more than a month of game time, which is clearly more than would be expected of e.g. a human learner to learn to play a single game. It should also be pointed out that the Atari 2600 is a simple machine with only 128 bytes of RAM, typically 2 kilobytes of ROM per game and no random number generator (because it has no system clock). Why does it take so long time to learn to play such simple games?

Also note that the networks trained for each of these games was only capable of playing the specific game it was trained on. To play another game, a new network needs to be trained. In other words, we are not talking about general intelligence here, more like a way of easily creating task-specific narrow AI. Unfortunately the ALE benchmark is mostly used in this way; researchers train on a specific game and test their trained AI's performance on the same game, instead of on some other game. Overfitting, in machine learning terms. As only a fixed number of games are available (and developing new games for the Atari 2600 is anything but a walk in the park) it is very hard to counter this by enforcing that researchers test their agents on new games.

General Video Game AI Competition

Which brings us to the third and final entry on my list, the General Video Game AI Competition (GVGAI) and its associated software. Let me start by admitting that I am biased when discussing GVGAI. I was part of the group of researchers that defined the structure of the Video Game Description Language (VGDL) that is used in the competition, and I'm also part of the steering committee for the competition. After the original concepts were defined at a Dagstuhl meeting in 2012, the actual implementation of the language and software was done first by Tom Schaul and then mostly by Diego Perez. The actual competition ran for the first year in 2014. A team centered at the University of Essex (but also including members of my group at NYU) now contributes to the software, game library and competition organization.

The "Freeway" game in the GVGAI framework.

The basic idea of GVGAI is that agents should always be tested on games they were not developed for. Therefore we develop ten new games each time the competition is run; we currently have a set of 60 public games, and after every competition we release ten new games into the public set. Most of these games are similar to (or directly based on) early eighties-style arcade games, though some are puzzle games and some have more similarities to modern 2D indie games.

In contrast to ALE, an agent developed for the GVGAI framework gets access to the game state in a nicely parsed format, so that it does not need to spend resources understanding the screen capture. It also gets access to a simulator, so it can explore the future consequences of each move. However, in contrast to both ALE and GGP, agents do not currently get any preparation time, but needs to start playing new games immediately. In contrast to GGP, GVGAI bots do also not currently get access to the actual game description - they must explore the dynamics of the game by attempting to play it. This setup advantages different agents than the ALE framework. While the best ALE-playing agents are based on neural networks, the best GVGAI agents tend to be based on MCTS and similar statistical tree search approaches.

The game "Run" in GVGAI.

The GVGAI competition and framework is very much under active development, and in addition to the planning track of the competition (with the rules described above), there is now a two-player track and a learning track is in the works, where agents get time to adapt to a particular game. We also just ran the level generation track for the first time, where competitors submit level generators rather than game-playing agents, and more tracks are being discussed. Eventually, we want to be able to automatically generate new games for the framework, but this research yet has some way to go.

To sum up, is any of these three frameworks a useful benchmark for artificial general intelligence? Well, let us acknowledge the limitations first. None of them test things skills such as natural language understanding, story comprehension, emotional reasoning etc. However, for the skills they test, I think they each offer something unique. GGP is squarely focused on logical reasoning and planning in a somewhat limited game domain. ALE focuses on perception and to some extent planning in a very different game domain, and benefits from using the original video games developed for human players. I would like to argue that GVGAI tests the broadest range of cognitive skills through having the broadest range of different games, and also the best way of preventing overfitting through the simplicity of creating new games for the framework. But you should maybe take this statement with a pinch of salt as I am clearly biased, being heavily involved in the GVGAI competition. In any case, I think it is fair to say that using any of these frameworks clearly beats working on a single game if you are interested in making progress on AI in general, as opposed to a solution for a particular problem. (But by all means, go on working on the individual games as well - it's a lot of fun.)

Tuesday, July 05, 2016

Your blog post is on another website

You can find my most recent blog post, "Is actual artificial intelligence the next frontier for AI in games?" on the Mobius AI website.

Mobius is a startup focused on providing state-of-the-art AI as a service for games. Products currently in development include natural language interface technology that lets you speak to the game, abuse monitoring tools and personality models, and much more is in the pipeline. The idea is to take state of the art AI methods and build tools around them that anyone can easily integrate into their own game as easily as an API call.

I agreed to be an advisor to the company because I think this is a great idea and just might revolutionize how the game industry thinks about advanced AI methods – if such methods are available out of the box, maybe designers dare design with such methods in mind, and maybe developers dare trust these ideas?

Thursday, June 16, 2016

On level generation in general, and a new competition

Procedural content generation has been a thing in games at least since Rogue and Elite in the early 80s. Plenty of games feature some kind of procedural generation, for example of levels, maps or dungeons. There's also lots of generation of more auxiliary things, such as skyboxes, trees and other kinds of vegetation, and patterns in general. While Spelunky relit the interest in PCG among indie game developers, AAA developers are increasingly looking at generating various parts of their games. There is also a very active research community on procedural content generation, with lots of papers published every year on new ways of generating things in games. We even wrote a book on this.

Anyway. You probably already knew this, given that you somehow found your way to this blog. What I'm going to say next is also rather obvious:

Essentially all PCG systems are limited to a single game. The level generator for Spelunky only generates Spelunky levels, and will not work with Civilization, or Diablo, or even Super Mario Bros. The Civilization map generator will not work with The Binding of Isaac, or Phoenix, or FTL, or... well, you get my point. In many cases the general algorithm (be it L-systems, binary space partition, diamond-square or something else) could be made to work on other games, but various amounts of game-specific engineering (hacking?) would be necessary; in other cases, the PCG method itself is mostly game-specific engineering and it's hard to discern a general algorithm.

Now, this is actually a problem. It is a problem because we want reusability. We don't want every game designer/developer to have to develop their own PCG method for each new game. Wouldn't it be great if there was software we could just grab, off the shelf to do level generation (or generation of some other kind) in your game? Even for those designers who see PCG as a canvas for creative expression, wouldn't it be great to have something to start with? Most game developers now use some kind of game engine, where physics, collision detection, rendering etc are available out of the box. Even some kinds of PCG is available this way, in particular vegetation through SpeedTree and similar packages. Why couldn't this be the case for level generation?

Let's make another analogy. In research on game-playing AI, there is a growing realization that working only on a single game has its limits. Trying to create champion-level players of Go, Poker, StarCraft or Unreal Tournament is in each case a worthy endeavor and the results are valuable and interesting, but at the same time the resulting solution tends be pretty domain-specific. The world's best Go AI is worthless at any other game than Go, and the same goes for all the other games in the list. There's simply a lot of game-specific engineering.

This is the problem that some recent competitions and frameworks are aiming to overcome. The General Game Playing Competition, the Arcade Learning Learning Environment and the General Video Game AI Competition (GVGAI) each focus on testing game-playing agents on multiple different games. There are many differences between their respective approaches, but also strong similarities.

Tying these threads together, what would it mean to create level generators (or other kinds of game content generators) that without modifications would create good content for a large number of different games? In other words, what would general level generation look like? This is not only a question of making game designers' lives easier (I am of course always interested in that; making game designers' lives easier, or replacing them), but also a very interesting AI and computational creativity problem in its own right.

In a new paper, General Video Game Level Generation, we explore this question. We design a framework based on the GVGAI framework, that allows level generators to connect to any game that is defined in the Video Game Description Language (the basis of GVGAI). The interface gives the level generator information about how the game works, and the level generator then returns a game level in a specified format. In the paper, we describe three different generators based on different approaches, and we test them using both computational means (which agents can play these levels) and through user studies with human players.

Better yet, we are not content with doing this ourselves. We want you to get involved too. That is why we are setting up a level generation track of the General Video Game AI Competition. The competition will run at the International Joint Conference on Artificial Intelligence this year, and we have extensive documentation on how to participate. It is easy to get started, and given how important the question of general level generation is, participating in the first ever competition on general level generation could be a very good investment of your efforts.

Looking forward to seeing what you will do with our framework and competition!

Monday, April 18, 2016

The differences between tinkering and research

Some of us have academic degrees and fancy university jobs, and publish peer-reviewed papers in prestigious journals. Let's call these people researchers. Some (many) others publish bots, hacks, experimental games or apps on blogs, web pages or Twitter while having day jobs that have little to do with their digital creative endeavors. Let's call those people tinkerers.

So what's the difference between researchers and tinkerers?

This is a valid question to ask, given that there are quite a few things that can be - and are - done by both researchers and tinkerers. Like creating deep neural nets for visual style transfer, creating funny Twitter bots, inventing languages for game description and generation, writing interactive fiction or developing Mario-playing AIs. These things have been done by people with PhDs and university affiliations, and they have been done by people who do this for a hobby. Anyone can download the latest deep learning toolkit, game engine or interactive fiction library and get cracking. So why is this called research when the academic does it, and just a curious thing on the Internet when a non-academic does it?

Let me start by outlining some factors that are not defining the difference between tinkering and research.

First of all, it's not about whether you work at a university and have a PhD. People can do excellent research without a PhD, and certainly not everything that a PhD-holder does deserves to called research.

Second, it's not because research always is more principled or has a body of theory supporting it. Nor is there typically a mathematical proof that the software will work or something like that. It's true that there are areas of computer science (and some other disciplines) where research progresses trough painstakingly proving theorems building on other theorems, and I have a lot of respect for such undertakings, but this has little to do with the more applied AI and computer science research I and most of my colleagues do. On a technical level, much of what we do is really not very different from tinkering. Some of it is good code, some of it bad, but typically there is a good (or at least interesting) idea in there.

Third, it's not really the publication venue either. It's true that most of us would trust a peer-reviewed paper in a good conference or journal more than something we find on a blog, and peer review has an important role to fulfill here. But what was once a sharp boundary is now a diffuse continuum, with a myriad of publication venues with different focus and different degrees of stringency. Quite a few papers make it through peer review even though they really shouldn't. Also, the traditional publication process is agonizingly slow, and many of us might just put something online right away instead of waiting until next year when the paper comes out. (I personally think it's prudent to always publish a peer-reviewed paper on substantial projects/artifacts I contribute to, but I sometimes put the thing itself online first.) It is also becoming more common to post preprints of papers on places such as arXiv as soon as they are done, and update them when/if the paper gets accepted into a journal or conference.

So we are back to square one. What, then, is the actual difference between tinkering and research? Let me list four differences, in order of decreasing importance: scholarship, testing, goals and persistence.


Probably the most importance difference between tinkering and research is scholarship. Researchers go out and find out about what other people have done, and then they build on that so they don't have to reinvent the wheel. Or if they do reinvent the wheel, they explain why they have to reinvent the wheel and how and why their wheel is different from all the other wheels out there. In other words, researchers put the thing they have done into context.

For example, almost seven years ago I made some experiments with evolving neural networks to play Super Mario Bros, and published a paper on this. The work (and paper) became fairly well-known in a smallish community, and a bunch of people built on this work in their own research (many of them managed to get better results than I did). Last year, some guy made an experiment with evolving neural networks for Super Mario Bros and made a YouTube video out of it. The video certainly reached more people on the internet than my work did; it makes no mention of any previous work. Seen as tinkering, that work and video is good work; seen as research, it is atrocious because of the complete lack of scholarship. The guy didn't even know he was reinventing the wheel, and didn't care to look it up. Which is fine, as it was probably not meant as research in the first place, and not sold as such.

Good scholarship is hard. It is easy to miss that someone tackled the same problem as you (or had the same idea as you) last year, or 5 years ago, or 50. People use different words to describe the same things, and publish in out-of-the-way places. Once you found the literature you must read it and actually understand it in order to see how it is similar to or differs from the idea you had. Because good scholarship is not just listing a number of previous works that are vaguely related to what you, but rather telling a believable, coherent and true story where all those previous works fits in, and your own work makes a logical conclusion. Therefore good scholarship takes a lot of searching and reading, a lot of time and effort. It's no wonder that lots of people don't want to spend the time and effort, and would rather get on with the tinkering.

It's common for incoming PhD students and other students to question the need for the scholarship when they could spend the time writing code. So let me go through some reasons for doing good scholarship, in order increasing importance. (Beyond wanting to get a PhD, of course.)

The perhaps most immediately important reason is common civility and courtesy. If you do a thing and tell the world about it, but "forget" to tell the world that person X did something very similar before you did it, then you are being rude to person X. You are insulting person X by not acknowledging the work she or he did. Academics are very sensitive to this, as proper attribution is their bread and butter. In fact, they will generally get offended even if the one you fail to cite other people than themselves. Therefore, the easiest way to get your papers rejected is to not do your related works section.

What about someone who doesn't care what academics think, or about getting published in peer-reviewed journals and conferences? Any point in spending all that time in front of Google Scholar and reading all that technical text written by academics with widely varying writing skills? Yes, obviously. Knowing what others have done, you can build on their work. Stand on the shoulders of giants, or at least atop a gang of midgets who have painstakingly crawled up on the shoulders of taller-than-average people. The more you know, the better your tinkering.

But to see the primary reason to do our scholarship before (or during, or after) tinkering we must lift our eyes beyond our own little fragile egos (yours and mine). It is about the accumulation of knowledge and progress on the scale of the human species. If we learn from each other, we can ultimately push the boundaries of what we collectively know forward and outward; if we don't learn from each other, we are bound to do the same things over and over. And it's way more likely that others will learn from you if you make it clear how what you are doing is different from (and maybe better than) what was done before.

So if you want your little hack or bot or whatnot to contribute to science, in other words to the evolution of humanity, you should do your scholarship.


Here's another big thing. A tinkerer makes a thing and puts it out there. A researcher also tests the thing in some way, and writes up what happens. Tests can take many shapes, as there are many things that can be tested - it depends on what you want to test. Generally the test is about characterizing the thing you made in some way. It could be performance on some sort of benchmark. Or a quantitative characterization with statistics from running your thing multiple times. Or maybe a user study. Or why not a qualitative study, where you really take your time to interact with your software and describe it in detail. The point is that if something is worth making, it's also worth studying and describing. If you don't study it when you're done, you're not learning as much as you could. And if you don't describe it well, nobody else will learn from it.

Interestingly, the tinkering and testing can sometimes be done by different people. There are quite a few academic papers out there that systematically study software that other people built but did not care to study in detail. This ranges from performance analysis of someone else's sorting algorithm, to large parts of the academic field of game studies.


Why do you tinker? Because of the thrill of trying something new? To hone your skills with some tool or programming language? To build useful tools for yourself or others? To get attention? To annoy people? Because you had an idea one night when you couldn't sleep? All of these are perfectly valid reasons, and I must confess to having had all those motivations at point or another.

However, if you read a scientific paper those are usually not the stated reasons for embarking on the research work presented in the paper. Usually, the work is said to be motivated by some scientific problem (e.g. optimizing real-value vectors in high-dimensional spaces, identifying faces in a crowd, generating fun game levels for Super Mario Bros). And that is often the truth, or at least part of the truth, from a certain angle.

While tinkering can be (and often is) done for the hell of it, research is meant to have some kind of goal. Now, it is not always the case that the goal was to get the result that was eventually reported. A key characteristic of research is that we don't really know what the results will be (which is why most grant applications are lies). Sometimes the result comes first, and the goal afterwards. Fleming did not set out to discover Penicillin, but once he did it was very easy to describe his research as solving an urgent problem. Also, he had been working on antibacterial compounds for a long time following different leads, so he recognized the importance of his discovery quickly.

Usually, goals in research are not just goals, but ambitious goals. The reason we don't know what the results of a research project will be is that the project is ambitious; no-one (as far we know) has attempted what we do before so our best guesses at what will happen are just that: guesses. If we understand the system so well that we can predict the results with high accuracy, chances are we are tinkering. Or maybe doing engineering.

Of the papers I've written, I think most of them started with some kind of problem I wanted to solve, so in other words a goal. But many others have been more opportunistic; we had a technology and an idea, and wanted to see what happened because... well, it sounded like a cool thing to do. Interestingly, I have never found it a problem to describe the research as if we had a particular goal in mind when we did it. This is probably because I always keep a number of high-level goals in mind, which implicitly or explicitly help me shape my research ides. This brings us to the next difference between research and tinkering.


You know Einstein's paper that established the special theory of relativity? A guy in his twenties, never having published anything before, publishing a single paper that revolutionized physics? Most papers are not like that.

Most papers report tiny steps towards grand goals. Results that are not in themselves very exciting, but hopefully will help us sometime in the future solve some problem which would be very exciting to solve. Like generating good video games from scratch, curing cancer or algorithms that understand natural language. The vast majority of such breakthroughs don't just happen - they are the results of sustained efforts over years or decades.  Recent progress we have seen in Go-playing builds on decades of research, even though it is sometimes reported as a sudden move by DeepMind.

Tinkerers are content to release something and then forget about it. Researchers carry out sustained efforts over a long time, where individual experiments and papers are part of the puzzle.

Doing research therefore requires having goals on different time scales in mind at any given time, and being able to extract high-level goals from level-goals, and seeing where new results fit into the bigger scheme of things. That is why I consider my attempts (together with colleagues) to chart out the research field and establish grand challenges as some of my most important. See, for example, our paper on challenges for procedural content generation, or on challenges for neuroevolution in games, or our attempt to structure all of research on AI in games.

Interestingly, when I started doing research I did not think I had much persistence at all. I also did not understand how much it was needed. Both of these realizations came later.


This post was inspired by my recent reading of Matti Tedre's "The Science of Computing", a history of the debates about what computer science actually is. He argues that computer science has variously been seen as mathematics, engineering and science, and that this debate has gone back and forth for as long as there has been computing researchers, with no clear end in sight. Reading the book, I felt that most of my research is not science, barely engineering and absolutely not mathematics. But I still think I do valuable and interesting research, so I set out to explain what I am doing.

The post was also inspired by discussions and arguments with a number of my colleagues, some of which have rather exclusionary ideas of what Science is and how Truth should be attained; and others who don't seem to think there's much difference between a blog post and a journal paper and who question the need to do all of the boring parts of research. I hope I have been able to make a good case for doing good research.

Friday, March 25, 2016

A way to deal with enormous branching factors

When you play Chess, there are on average something like 30 different moves you can make each turn. In other words, the branching factor is around 30. Go, on the other hand, has a branching of 300 or so. The standard Minimax algorithm, traditionally used to play Chess and similar board games, can't handle this branching factor very well. This is one of the two reasons why Go is so much harder for computers to play than Chess is, and why it took 19 years between the victory of AI over humans in Chess and the same victory in Go. (The other reason is the difficulty of state evaluation in Go.)

Let's think a bit about that number: 300. Apparently that's enough to stop Minimax (and Xerxes). Ultimately, Go was conquered with the help of Monte Carlo Tree Search (MCTS), which handles higher branching factors better than Minimax. But still, 300 is not a terribly big number.

Now consider this. Chess and Go are both games where you only move a single piece each turn. Many, perhaps most, turn based games are not like this. In board games such as Risk or Diplomacy you can move a number of units each turn. The same goes for computer strategy games such as Civilization, Advance Wars or XCOM. In trading card games such as Magic and Hearthstone you can play out a large number of cards each turn. Let's call these games multi-action games. (And let's not even get started on a game such as StarCraft, where you move lots of units in tandem but have the added complication of continuous time and space.) One might point out that such games model a number of real-world scenarios with varying degrees of accuracy, so an AI method for playing such games could be useful for non-game applications.

Hero Academy, a multi-action turn-based game with a characteristically huge branching factor.
What's the branching factor of a multi-action game? If you have 6 units to move, and you can move each unit to one of ten possible positions, you have a branching factor of 10 to the power of 6, or one million! Suddenly 300 does not feel like a very big number at all. And this is even simplifying things a bit, assuming that the order in which we move the units doesn't matter.

Obviously, Minimax is incapable of handling a branching factor of a million. A two-ply search means that a trillion positions will need to be examined. Even MCTS runs into very serious problems with this kind of branching factor. The basic problem is that you need to explore all the actions from the root node before doing anything else, and that might already take more time than you have that turn. (There are some clever ways of using MCTS in this space by considering individual unit actions, but that will have to be a topic for another time.)

At EvoStar 2016, we will present a paper that takes a fresh look at this problem and comes with a somewhat surprising solution. The paper is written by Niels Justesen, Tobias Mahlmann and myself; Niels was one of our star masters students at ITU Copenhagen, and Tobias and me supervised his masters thesis. We used a re-implementation of the strategy game Hero Academy as our testbed; Hero Academy is fairly representative of computer strategy games: each turn, a number of action points are available, and any action point can be used to move any character.

The basic idea of the paper is that the branching factor is so huge that we cannot do any tree search. Instead, we will have to concentrate on searching the space of possible action sequences that make up a single turn of the game. A simple state evaluation function will have to suffice when judging the quality of a turn–even though we cannot simulate the actions of the opponent, a good execution of each single turn might be all we need to play well. But even then, the branching factor is too high to search the possibilities of a single turn exhaustively. So we need to search smartly.

Therefore, we turned to evolution. We use an evolutionary algorithm to evolve sequences of actions that could make up a single turn. The fitness function is the quality of the board state of the end of the turn, as measured by some straightforward metrics. Both mutation and crossover are used to provide variation, as is common in evolutionary computation. We call the use of evolution to search actions within a turn Online Evolution.

The crossover scheme for Online Evolution in Hero Academy.
What about the results? Simply put, Online Evolution kicks ass. Evolution outperforms both a standard MCTS and two different greedy search methods by a wide margins. While it would certainly be possible to find MCTS versions that would handle this particular problem better, Online Evolution is at the very least highly competitive on this problem.

This is significant because the use of evolution for planning actions is very underexplored. While evolutionary algorithms have been used extensively to play games, it is almost always through evolving a program of some kind to play the game (for example through neuroevolution). The use of evolution in the actual playing phase seems to be almost virgin territory. We took some inspiration from the only other example of this general idea we know, Rolling Horizon Evolution for the PTSP game. But ultimately, there is no good reason that evolution could not be used to solve all the same problems that MCTS can. In fact, in his recent PhD thesis Cameron McGuinness showed that MCTS can be used to solve many of the same problems that evolution can. There seems to be some deeper connection here, which is well worth exploring.

Are you still with me? Good! Then you can go read our paper to get all the details!