DRAFT: Review: “Ego Depletion”, JDW 2010

Earlier this weekend, I read a fun paper (SAGE) by Job, Dweck, and Walton questioning the “strength” model of willpower, I think as part of a much larger research program on whether and how our “self-theories” influence, e.g., our ability to perform or to persist in performing difficult, frustrating, or tiring tasks.

Some things I really liked about this paper were that:

it asks a valuable research question,
it uses an initial descriptive study to motivate a randomized experiment, and
it considers some alternate explanations of the available data.

However, there were also some parts that bothered or confused me. As a result, in the hopes that my bothers and confusions may be of some use to others, I’ve written up several “bug reports” below that may be of interest.

Okay, here we go:

Selection bias: “People” is a great and accessible gender-neutral plural noun but I nevertheless wish that it was more clear in the abstract and conclusion that the “people” whom the results most directly describe are mostly? (all?) undergraduate university students.
Ethical considerations: I think it much more likely than not that the described research was approved by the Stanford or University of Zurich IRBs as minimal-risk human subjects research but I still wish that the paper made it clear one way or the other, e.g., with a link to a record of the approved protocol. (Note: the 2012 PSS submission guidelines now request this information but maybe they didn’t in 2009-2010?)

Modeling Equations: the JDW authors report that they used logistic hierarchical linear models to comprehend their data and they provide some useful coding information but their paper doesn’t contain the modeling equations for any of the models that they fit, thereby making it needlessly difficult to understand the meaning of the regression coefficients that they report. (Note: one of my favorite books, ALDA, has lots of really nice examples of how to write about hierarchical linear models like the ones that I imagine were used here.)

Effect size: the research agenda underlying all four reported study designs seems to posit that particular patterns of differences in treatment and control group sample Stroop-test accuracy statistics can falsify (or at least, cast doubt on) the Baumeister et al. “strength model of self-control” but JDW do not explain how their measured effect size should influence our belief in the “strength” model. (Note: the 2012 PSS submission guidelines now also require this information.)
Graphical integrity: the JDW paper contains five plots, numbered “Figure 1”, “Figure 2”, “Figure 3-A”, “Figure 3-B”, and “Figure 3-C”. Examining just plots “1” and “3-B”:
- Plot 1: According to the legend and axis labels for this plot, each record in the underlying dataset has been labeled with one of five conditions; namely: “Nondepleting + Nonlimited-Resource Theory”, “Nondepleting + Limited-Resource Theory”, “Depleting + Nonlimited-Resource Theory”, “Depleting + Limited-Resource Theory”, or “Not Labeled”, but how was this labeling done?
  
  According to the fine print in the figure caption,
  
  “The limited-resource-theory group represents participants 1 standard deviation above the mean on the implicit-theories measure. The nonlimited-resource-theory group represents participants 1 standard deviation below the mean on the implicit-theories measure.”
  
  There are a couple of problems here:
  1. Ambiguity: I think the claim in the figure caption is intended to mean something like “The limited-resource-theory group represents participants whose score on the implicit-theories measure was at least one standard deviation above the mean” but I can’t tell for sure.
  2. Distributional Assumptions: Labeling participants based on z-scores implicitly assumes that the underlying distribution of scores is normally distributed but no evidence is given that this is so. Why should I believe it to be true?
  3. Power: Assuming that I read the caption correctly, how many observations were thrown out as a result of being unlabeled?
  Zooming out, though, there are even bigger problems:
  1. Bad summarization: This plot only shows one number for each condition, yet each condition ostensibly labels many records. In short: “where are the box plots”?
  2. Unnecessary summarization: why group the participants at all? Why not just draw the scatterplot of all participant’s mistake-frequencies as a function of their implicit-theories-measure score, perhaps faceted or colored by their depletion treatment condition? Then you could plot the fitted models as density heat-maps in the background, thereby revealing outliers or other model-fitting problems! (Note: Hadley Wickham’s ggplot2 package makes this kind of plotting super fun and easy!)
- Plot 3b: All five plots have dependent measures with labels that begin with the prefix “Probability of a Mistake” and four of the five figures have ratio scales for these measures: that is, their scales cover intervals ranging from \([0, 0.08]\) to \([0, 0.12]\). Unlike all the other plots, Plot 3b’s scale is presented as an interval scale, covering the interval \([0.20, 0.45]\). Why? (Just to devote more ink to showing the measured inter-class differences? Or is there some deeper confusion about what scale matters for measuring effect sizes?)
Traceability: We’ve seen some otherwise interesting research brought down recently by simple slips, e.g., in calculation, model-fitting, and data entry. In the software world, we try control for this sort of problem in a bunch of ways, most notably with open source. Anyway, as many others have requested, perhaps it’s time to start providing links to the raw data and to the intermediate analysis results as part of the published supplemental materials? (Also, if the data are already up and I just couldn’t see them because of the SAGE paywall, then maybe the issue is the need for more open access, perhaps in the style of the Episciences Project (intro) or in the style of PLOS ONE (which I see that the JDW authors are already exploring; yay!)?)

Next, a couple of smaller issues:
- Reproducibility: What font + text was on the pages used for the “stimulus detection” task? (It would only take a few words to say…)
- Validity: how does color-blindness affect the results derived from the Stroop task performance measurements?
- Blinding: Were the randomized controlled trials also blinded?
Finally, a review of Simmons’ et al.’s researcher degrees of freedom checklist (note 1: introduced to me by Shauna; thanks Shauna!; note 2: also, amusingly, published in PSS!):
- Simmons #1: “Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.”
  - No data collection stopping rules were included in the paper.
- Simmons #2: “Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.”
  - Per-cell observation counts were not included in the paper.
- Simmons #3: “Authors must list all variables collected in a study.”
  - I believe that all variables collected may be reported in the supplemental material published alongside the original paper but only a subset were reported in the original paper.
- Simmons #4: “Authors must report all experimental conditions, including failed manipulations.”
  - I don’t see a claim that all experimental conditions have been reported.
- Simmons #5: *“If observations are eliminated, authors must also report what the statistical results are if those observations are included.”
  - I see some effort here, e.g., when the authors observe on their longitudinal study that the 59% of participants who did not complete the study were demographically similar to those who continued.
- Simmons #6: *“If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.”
  - Some effort is also made here, particularly in Note #1. (That being said, Note #2 and the “speed/accuracy tradeoff” covariates seem just like what Simmons et al. are asking about…)