[MUD-Dev] GDC 2002: Beyond Psychological Theory: Getting Data that Improves Games

Sat Jun 15 23:17:06 CEST 2002

From:

  http://www.gamasutra.com/gdc2002/features/fulton/fulton_pfv.htm

--<cut>--
GDC 2002: Beyond Psychological Theory: Getting Data that Improves Games 

By Bill Fulton
Gamasutra
March 21, 2002
URL: http://www.gamasutra.com/gdc2002/features/fulton/fulton_01.htm

How can I make my game more fun for more gamers?

This is the question for those who want to make games that are popular,
not just critically acclaimed. One (glib) response is to "design the
games better." Recently, the idea of applying psychological theories as
a way of improving game design has been an increasingly popular topic in
various industry publications and conferences. Given the potential of
applying psychological theory to game design, I anticipate these ideas
to become more frequent and more developed. While I think using
psychological theories as aids to think about games and gamers is
certainly useful, I think that psychology has much more to offer than
theory. An enormous part of the value of psychology to games lies in
psychological research methods (collecting data), not the theories
themselves.

I should clarify some terms here. When I talk about "psychology," I do
not mean the common perception of psychology--talking to counselors,
lying on the Freudian couch, mental illness, etc. In academia, this kind
of psychology is called "clinical psychology." In this paper,
"psychology" refers to experimental psychology, which employs the
scientific method in studying "normal populations functioning normally."

But before I talk about how psychological research methods can help
improve games, I need to first explain more about how psychological
theories are helpful, and the limitations they have with respect to game
design.

Psychological theories can be useful, but data are more useful

All designers think about what people like, hate and want. Some
designers may be consciously using theories from psychology as part of
the process to evaluate what people want, but most designers probably
just rely on their intuitive theories of what they perceive gamers want.

The risks of relying on intuitive psychology. What I call "Intuitive
psychology" is the collection of thoughts, world views, 'folk wisdom'
etc. that people use to try to understand and predict others. Some
examples might make this clearer--one common intuitive psychological
belief about attraction is that "opposites attract." However many people
(and many of the same people) also believe the opposite, that "birds of
a feather flock together." Both of these ideas have some merit and are
probably true in some ways for most people. But given that they are
clearly conflicting statements, it is unclear which statements to
believe and act on--which statement is true? Or, more likely, when is
each statement more likely to be true? Does the degree of truthfulness
for these statements vary by people? By situation? By both? The problem
with intuitive psychology is that many intuitions disagree with each
other, and it is unclear which world view is more likely to be right, if
either of them are at all. You're just trusting that the designer's
theories are close enough to reality in that the design will be
compelling.

The insufficiency of formal theories of psychology. Formal theories of
psychology have been subjected to rigorous testing to see when they map
onto reality, and when they do not. In order for a theory of psychology
to gain any kind of acceptance, the advocates have to have battled with
some success against peers who are actively attempting to show it to be
incorrect or limited. This adversarial system of determining "truth" and
reliable knowledge employs the scientific method of running experiments
and collecting data. Because of this adversarial system, formal theories
of psychology are more trustworthy than intuitive theories of
psychology--you know that they are more than just one person's
unsubstantiated opinion about what people want.

But while theories of psychology from academia can be quite useful as a
lens to examine your game, their limitation is that they are typically
too abstract to provide concrete action items at the level designers
need. This lack of specificity in psychological theories hasn't really
hurt designers too much, because in the most part designers (and people
in general) have a decent enough idea of how to please people without
needing formal theories. I think very few people had light bulbs go on
when they learned that Skinner's theory of conditioning stipulates that
people will do stuff for rewards. But the work that Skinner and others
did on how to use rewards and punishers well in terms of acquisition and
maintenance of behaviors can be enlightening. But academic theories of
psychology don't get granular enough to tell us whether gamers find the
handling of the Ferrari a bit too sensitive.

An example of why academic theories of psychology aren't enough is in
order. Skinner's Behaviorism is probably one of the most well-defined
and supported theories, and the easiest to apply to games. (In fact,
John Hopson wrote an excellent article in Gamasutra in April 2001
demonstrating how to analyze your game through behaviorism's lens.) One
of Hopson's examples is about how players in an RPG behave differently
depending upon how close they are to reinforcement (e.g., going up a
level, getting a new item, etc.). He talks about how if reinforcers are
too infrequent, the player may lose motivation to get that next
level. However, how often is not often enough? Or too often? (Who wants
to level up every five seconds?) Both "too often" and "not often enough"
will de-motivate the player. Designers need to find a 'sweet spot'
between too often and not often enough that provides the optimal (or at
least a sufficient) level of motivation for the player to keep trying to
level up. Theory may help designers begin to ask the more pertinent
questions, but no theory will tell you exactly how often a player should
level up in three hours of play in a particular RPG games.

Beyond Theory: the value of collecting data with psychological methods

So I've argued that the psychological theories (both intuitive and
academic) have limitations that prevent them from being either trustable
or sufficiently detailed. Now I'm going to talk about what IS
sufficiently trustworthy AND detailed--collecting data with
psychological methods. Feedback gleaned via psychological testing
methods can be an invaluable asset in refining game design.

As I said at the beginning of this paper, the central question for a
designer who wants to make popular games is "how do I make my game more
fun for more gamers?" and that a glib response is to "design the games
better." Taking the glib answer seriously for a moment, how do you go
about doing that? Presumably, designers are doing the best they can
already. The Dilbertian "work smarter, not harder" is funny, but not
helpful. The way to help designers is the same way you help people
improve their work in all other disciplines--you provide them feedback
that helps them learn what is good and not so good about their work, so
that they can improve it.

Of course, designers get feedback all the time. In fact, I'm sure that
many designers sometimes feel that they get too much feedback--it seems
that everyone has an opinion about the design, that everyone is a
"wannabe" designer (disguised as artists, programmers, publishing execs,
etc.), as well as everyone's brother. But the opinions from others often
contradict each other, and sometimes go against the opinions of the
designer. So the designer is put in the difficult situation of knowing
that their design isn't perfect, wanting to get feedback to improve it,
and encountering feedback that makes sense, yet is often contradictory
both with itself and with the designer's own judgment. This makes it
difficult to know what feedback to act on. So the problem for many
designers is not a lack of feedback, but an epistemological
problem--whose opinion is worth overruling their own judgment? Whose
opinion really represents what more gamers want?

Criteria for good feedback and a good feedback delivery system

Before launching into a more detailed analysis of common feedback loops
and my proposed "better" one, I need to make my criteria explicit for
what I consider "good" feedback and a good feedback delivery system. The
addition of "delivery system" is necessary to provide context for the
value (not just accuracy) of the feedback. The criteria are:

   1.  The feedback should accurately represent the opinions of the
   target gamers. By "target gamers," I mean the group of gamers that
   the game is trying to appeal to (e.g., driving gamers, RTS gamers,
   etc.) If your feedback doesn't represent the opinion of the right
   group of users, then it may be misleading. This is absolutely
   critical. Misleading feedback is worse than no feedback, the same way
   misleading road signs are worse than no signs at all. Misleading
   signs can send folks a long way down the wrong road.

   2.  The feedback should arrive in time for the designer to use it. If
   the feedback is perfect, but arrives too late (e.g., post RTM, or
   after that feature is locked down), the feedback isn't that helpful.

   3.  The feedback should be sufficiently granular for the designer to
   take action on it. The information that "gamers hate dumb-sounding
   weapons" or that "some of the weapons sound dumb" isn't nearly as
   helpful as "Weapon A sounds dumb, but Weapons B, C, and D sounds
   great."

   4.  The feedback should be relatively easy to get. This is a
   pragmatic issue--teams won't seek information that is too costly or
   too difficult to get. Teams don't want to pay more money or time than
   the information is worth ($100k and 20 person hours to learn that
   people slightly prefer the fire-orange Alpha paint job to the bright
   red one is hardly a good use of resources.)

The first criterion is about the accuracy of the feedback which is
critical; the rest are about how that feedback needs to be delivered if
it is going to be useful, not merely true.

Common game design feedback systems and their limitations

There are many feedback systems that designers use (or, in some cases,
been subjected to). Most designers, like authors, recognize that they
need feedback on their work in order to improve it-- few authors have
reason to believe that their work is of publishable quality without some
revision based on feedback. I'm going to list the feedback systems of
which I am aware, and discuss how good of a feedback delivery system it
is. There are two main categories of feedback loops: feedback from
professionals in the games industry, and from non-professionals (i.e.,
gamers). While these sources obviously affect each other, it is easier
to talk about them separately

Feedback from Professionals in the games industry

There are two main sources of this kind of feedback:

   1.  Feedback from those on the development team. This is the primary
   source of feedback for the designer--people working on the game say
   stuff like "that character sucks" or "That weapon is way too
   powerful." This system is useful because it ably suits criteria two
   through four (the feedback is very timely, granular enough, and easy
   to get), but still leaves the designer with a question mark on
   criteria one--how many gamers will agree that that weapon is way too
   powerful

   2.  Feedback from gaming industry experts. Game design consultants
   ("gurus"), management at publishers, game journalists, etc. can also
   provide useful feedback. While their feedback can often meet criteria
   three (sufficiently granular), criteria two (timely) is sometimes a
   problem--long periods can go between feedback, and recommendations
   can come after you can use it. And the designer is still left with
   questions about criteria one (accurately represents gamers), although
   some could argue that they may be more accurately representing gamers
   because they have greater exposure to more games in development.

So while feedback from professionals is the current bread and butter for
most teams and definitely nails criteria two, three and four, it
operates a great deal on faith and hope on criterion one--that the
feedback from industry professionals accurately maps onto gamers'
opinions. The reason this assumption is questionable is perhaps best
illuminated by a simple thought experiment--how many games do you think
a typical gamer tries or sees in a year? How many do you think a gaming
industry professional tries or sees? They are probably different by a
factor of ten or more. Gaming industry professionals are in the top 1
percent in knowledge about games, and their tastes may simply be way
more developed (and esoteric) than typical gamers' tastes. While some
professionals in the industry are probably amazingly good at predicting
what gamers will like, which ones are they? How many think they are
great at it, when others disagree?

So while feedback from industry professionals is necessary when
designing the game, they may not be the best at evaluating whether
gamers will like something. In the end, they can only speak for
themselves.

Feedback from Non-professionals

Game teams are not unaware of the problem of their judgment not always
mapping onto what most gamers really want. Because of this, they often
try to get feedback from those who are more likely to give them more
accurate feedback, and the obvious people to talk to are the gamers
themselves. Some common ways that this is done are listed below, along
with some analysis of how good a feedback system it is according to the
four criteria.

   1.  News group postings/Beta testing/fan mail. This is reading the
   message boards to see what people say about the game. The main
   problem with this as a feedback system is with criteria two
   (timely). The game has to be able to be fairly far along (at least
   beta, if not shipped) in order to get the games to people; typically,
   that feedback arrives too late to make any but the most cosmetic of
   changes. Also, the feedback often runs into problems of not being
   sufficiently granular to take action on. ("The character sucks!") But
   at least this kind of feedback is relatively cheap in both time and
   money.

   2.  Acquaintance testing. This is where you try to get people
   (typically relatives, neighbor's kids, etc.) from outside the
   industry to play your game and give you feedback. This feedback is
   often sufficiently granular and may be relatively accurate, but it is
   often not that timely due to scheduling problems, and can be costly
   in time.

   3.  Focus groups/Focus testing. This kind of feedback system is
   typically done by the publisher, and involves talking to small groups
   (usually four - eight gamers) in a room about the game. They may get
   to see or play demos of the game, but not always. One typical problem
   with focus groups is that often tend to happen very late in the
   process when feedback is hard to action on (not timely) and not
   sufficiently granular. The costs for focus groups can also be quite
   high.

This approach has potential to be useful, in that it involves listening
to gamers who aren't in the industry. However, there are many pitfalls
to this--It is often dubious as to how accurately the feedback
represents gamers due to the situations themselves (only certain kind of
people post messages, people feel pressured to say positive things, the
people running the test often lack sufficient training in how to avoid
biasing the participants, etc.), and the relatively small number of
people. How to minimize these concerns and create a feedback system that
works on all four criteria is discussed in the next section.

Designing a better feedback system

Up to this point, I've mostly been criticizing what is done. Now I need
to show that I have a better solution. I'm going to outline some of the
key factors that have allowed Microsoft to develop a feedback system
that we think meets the four criteria that I set up for a "good"
feedback system. We call this process of providing designers with
feedback from real users on their designs "user-testing," and the people
who do this job "user-testing specialists."

The importance of using principles of psychological
testing. Experimental psychology has been studying how to get
meaningful, representative data from people for over 70 years, and the
process we use adheres to the main principles of good research. This is
not to say that all psychological research is good research any more
than to say that all code is good code; researchers vary in their
ability to do good research the same way that not all programmers are
good. But there are accepted tenets of research methodology that have
been shown to yield information worth relying on, and our processes have
been designed with those in mind. (For the sake of not boring you
senseless, I'm not going to attempt to summarize 70 years of research on
how to do research in this paper.) What I'm going to do instead is
describe the day-to-day work that the user-testing group at Microsoft
does for its dev teams (both first and third party).

The actual testing methods we use. The user-testing group provides three
major services: usability testing, playtesting, and reviews. These
services are described in detail below.

  1. Usability research is typically associated with small sample
  observational studies. Over the course of 2-3 days, 6-9 participants
  come to Microsoft for individual 2-hour sessions. In a typical study,
  each participant spends some unstructured time exploring the game
  prior to attempting a set of very specific tasks. Common measures
  include: comments, behaviors, task times and error rates. Usability is
  an excellent method to discover problems that the dev team was unaware
  of, and to understand the thoughts and beliefs of the participant and
  how they affect their interaction with the game. This form of testing
  has been a part of the software industry for years and is a staple of
  the HCI (Human-Computer Interaction) field more so than
  psychology. However, methods used in HCI can be traced to
  psychological research methods and can essentially be characterized as
  a field of applied psychology.

  2. Playtest research is typically associated with large, structured
  questionnaire studies that focus on the first hour of game play. The
  sample sizes are relatively large (25-35 people) in order to be able
  to compute reliable percentages. Each person gets just over 60 minutes
  to play the game and answer questions individually on a highly
  structured questionnaire. Participants rate the quality of the game
  and provide open-ended feedback on a wide variety of general and
  genre-specific questions. Playtest methods are best used to gauge
  participants' attitudes, preferences, and some kinds of behavior, like
  difficulty levels. This form of testing has a long history in
  psychology in the fields of attitudinal research and judgment and
  decision-making.

  3. Reviews are just another version of feedback from a games industry
  professional. However, these reviews are potentially more valuable as
  the reviewers are user-testing specialists, who are arguably have more
  direct contact with real gamers playing games than other game
  professional. Their entire job is to watch users play games and listen
  to their complaints and praises. Furthermore, teams often repeat
  mistakes that other games have made, and thus experienced user-testing
  specialists can help teams avoid "known" mistakes.

The result of each of these services is a report is sent to the team
which meticulously documents the problems along with recommendations on
how to fix those problems. Our stance is that the development teams are
the ones who decide if and how to fix the problems.

One noticeable absence in our services is "focus groups." Our belief
(supported by research on focus groups) is that focus groups are
excellent tools for generation (e.g., coming up with new ideas,
processes, etc.), but are not very good for evaluations (e.g., whether
the people like something or not). The group nature of the task
interferes with getting individual opinions, which is essential for the
ability to quantify the evaluations.

How this feedback system fares on the four criteria for a good feedback
system. So, how does the way we do user-testing at Microsoft stack up to
the four criteria? Pretty well (in my humble opinion). A recap of the
criteria, and my evaluation of how we do on them is given below.

   1.  The feedback should accurately represent the opinions of the
   target gamers. We supply reasonably accurate, trustworthy feedback to
   teams, because:

     a. We have a large database of gamers (~12,000) in the Seattle
     metro area, who play every kind of game. So we can almost always
     bring the right kind of gamers for each kind of game.

     b. We hire only people with strong backgrounds in experimental or
     applied psychology in order to minimize the biases of the
     user-testing specialist. We also have a rigid review process for
     all materials that get presented to the user.

     c. We thoroughly document our findings and recommendations, and
     test each product repeatedly, which allows us to check the validity
     of both our work and the team's fixes over multiple tests and
     multiple participants.

   2.  The feedback should arrive in time for the designer to use it. We
   are relatively fast at supplying feedback. The entire process takes
   about six days to get some initial feedback, and about 11-14 days for
   a full report. If the tests are well planned, they can happen at key
   milestones to maximize the timeliness of the feedback.

   3.  The feedback should be sufficiently granular for the designer to
   take action on it. The level of feedback in the reports is extremely
   granular, because the tests are designed to yield granular,
   actionable findings. The user-testing specialist typically comments
   at the level of which cars or which tracks caused problems, or what
   wording in the UI caused problems. The recommendations are similarly
   specific. Usability tests typically yield more than 40
   recommendations, whereas playtest tends to have anywhere from 10-30
   items to address.

   4.  The feedback should be relatively easy to get. The feedback is
   relatively easy for the dev team to get--they have a user-testing
   lead on their game, and that person sets up tests for them and
   funnels them the results. However, the feedback is relatively
   inexpensive, when compared to the multi-million dollar budgets of
   modern games. The total cost of our operation is "substantial," but
   economies of scale make the cost per game relatively small.

Vital statistics on the user-testing group at Microsoft

Group history: the usability portion of the user-testing group has been
around in a limited fashion since Microsoft entered the games business
in earnest, in 1995. Funding was at a very low level (one usability
contractor and 30+ titles to support) until the Games Group began
investing more heavily in 1998 with the introduction of the Playtest
group. The usability and playtest group merged to form the user-testing
group in 2000. The current user-testing processes have been relatively
stable since 1997 (usability) and 1998 (playtest).

Current composition of user-testing group: 15 FT user-testing
specialists, 3-5 contract specialists, 3 FT support staff. Almost all
user-testing specialists have either two or more years of graduate
training in experimental psychology, or equivalent experience in applied
psychology and are gamers. All four founding members of the user-testing
group are still with the group.

Amount of work: In 2001, we tested approximately 6500 participants in
235 different tests, on about 70 different games. 23 of those games were
non-Microsoft products. In 2002, we expect to produce about 50 percent
more than we did in 2001. From 1997 to Jan 2002, the group has produced
658 reports on 114 products (53 Microsoft, and 61 non-Microsoft
products) representing the opinions of more than 15,000 hours of
consumer reactions to games prior to their release.

Special thanks to Randy Pagulayan and Ramon Romero for their help
editing this article.
--<cut>--

--
J C Lawrence                
---------(*)                Satan, oscillate my metallic sonatas. 
claw at kanga.nu               He lived as a devil, eh?		  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.
_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev