This is the subject of a recently uploaded ArXiv preprint (filed under Physics and Society). The author, physicist and statistician Sherry Towers, carries out a detailed analysis of the productivity and career paths of a sample of 57 postdoctoral researchers (48 males and 9 females) in high-energy physics, working on the Run II D0 experiment at Fermi National Laboratory (Fermilab) during the 1998-2006 period.
Upon comparing the relative productivities (assessed through co-authorship of internal progress reports) of female and male postdoctoral researchers, as well as the rates at which researchers in both groups were invited to speak at conferences and eventually landed university faculty appointments, the paper claims evidence of systematic gender bias.
The above is quite an indictment, one to be taken very seriously, especially for a field of scientific inquiry, namely experimental high-energy physics, that prides itself of being as egalitarian and gender-blind as it gets.
The science enterprise can ill afford to exclude from participation more than half of the population, if it is to survive and thrive. Although I cannot really say that I have ever witnessed a blatant case of gender discrimination (GD) on the job, I have certainly no trouble believing that it exists. I have heard competent, brilliant female colleagues lament it; and there is really no reason why my discipline, physics, should be immune from something that is otherwise so pervasive throughout society.
But, is GD really practiced so openly that it can be detected quantitatively as easily as suggested in this paper, based on a sample of nine ? Is such a gloomy assessment of the current state of affairs, truly warranted by the data ?
I am not convinced in the least by the case expounded by the author. Some key assumptions underlying her analysis of the data are overly simplistic; I am puzzled by her claim of statistical significance of the findings, given the small size of the sample; most importantly, though, I find the charge of GD based upon the evidence at hand, exceedingly weak.
I may well be wrong, of course, in which case I thank in advance those who will set me straight. In the meantime, I am going to state my reasons for being skeptical.
Of the 48 males in the sample, sixteen went on to take faculty jobs at the end of their postdoctoral stages; four out of the nine female researchers did the same .
The author contends that female researchers in the sample proved, on average, significantly more productive than their male counterparts, during their postdoctoral tenure. Yet, they reaped significantly fewer faculty jobs than those to which their high productivity would have entitled them. Towers points to this as evidence of GD.
She also identifies conference presentations as a mechanism whereby Fermilab exercises its bias against women, unfairly enhancing the chances of less productive male researchers of landing a faculty position, as they allegedly enjoy a far greater share of allotted speaking slots than their female colleagues . Towers maintains that a more equitable allocation of conference presentations among researchers in the sample (i.e., one commensurate to their individual productivities) would likely have resulted in additional female hires. A consequence of this reasoning is that fair hiring practices should ultimately reflect differences in productivity among researchers.
The connection between conference presentations and faculty hires may be worth exploring in and of itself, but is conceptually separate from the (more important) one of female career advancement. In order to assess the merit of any charge of GD vis-a-vis hiring, the first question that one has to ask is the following:
Has the productivity of female researchers in Towers’ sample been insufficiently rewarded, based on the rate at which they landed faculty jobs ? Is a gender-blind hiring scenario (i.e., one in which researchers are preferentially hired based on productivity alone) statistically irreconcilable with the evidence presented in her paper ?
The first, obvious consideration is that, making allowance for the small number of female researchers in Towers’ sample, female researchers were hired as faculty at about 70% higher rate than their male colleagues. Towers acknowledges this fact on page 7 of her paper, quickly pointing out that there is no “statistically significant difference” between the fractions 4/9 and 16/48 (as we shall see, Towers’ case of GD in hiring rests on the alleged “statistical significance” of the difference between 4/9 and 6/9). Be that as it may, Towers claims that their noticeably higher productivity should have have been rewarded by an even greater number of female faculty hires.
It seems fair to state that her case is not self-evident. It necessarily requires a careful analysis of the data, as the raw numbers do not immediately raise suspicions of discrimination against women.
Towers introduces a measure of research productivity (Pl) equal to the sum of all Fermilab internal progress reports (papers) co-authored by the lth researcher (male or female). The author reports the average values of P for male and female researchers (Table 1):
PF = 1.70 +/- 0.39 (average yearly productivity for female scientists)
PM = 1.38 +/- 0.17 (average yearly productivity for male scientists)
I have no idea of what those reports are, have never read (and likely never will read) any of them; it seems at least debatable whether a measure of this type is a reliable indicator of talent (or even actual productivity) . But hey, what do I know ? I am no high-energy physicist after all, and the above measure is likely to be just as imperfect as any other. Moreover, it satisfies a basic criterion, namely it has no obvious intrinsic built-in bias (much less gender-specific). As we shall see, this issue turns out to matter very little, in the end.
The above average values hardly point to a “statistically significant” difference given their relatively large uncertainties, especially if the small female sample size is taken into account. More cogent information must therefore be contained in the actual productivity distributions which, as Towers maintains, are very different for males and females.
No actual data are shown in the paper, only a schematic description of such distributions is offered (page 9). Specifically, Towers divides the sample into four distinct groups:
a) One consisting of half of all males (24 researchers) producing “almost nothing”
b) A “moderately productive” one, comprising “slightly less than half” of all males
c) A group comprising all the women in the sample, of roughly equal, “high productivity”
d) A small group of “extremely productive” males
(incidentally: how do you pick your guys, high energy physicists ?).
The author contends, on the one hand, that the 24 “productive males” are at least as productive as the least productive female (page 7), and that the productivity distribution for female researchers is relatively narrow (page 9); on the other hand, she clearly makes the point that the productivity of the nine female researchers is, as a whole, “significantly above” that of the group of “moderately productive” males. Of course, it would have been a heck of a lot easier if she had actually shown the data, but it seems reasonable to conclude that, aside from few “extremely productive” male outliers, the productivity distributions of productive male, and female researchers largely overlap, the female one slightly but noticeably displaced toward higher values.
Towers’ claim of GD boils down to the following observation: four hires among women is significantly less than what strict application of productivity criteria would dictate. According to her, a gender-blind hiring process ought to have resulted in two additional positions for female researchers (page 15), a difference that the statistician Towers evidently deems too large to be accounted for by a mere statistical fluctuation .
Now, if GD were mostly at the root of the above outcome, wouldn’t you generally expect that every “extremely productive” male would get his faculty position, and that those positions (two, or, however many) which did not go to deserving women, would go preferentially to male researchers in the “moderately productive” group ? After all, the pool of productive male researchers was large enough to fill in principle all (20) available positions; even if the hiring process is indeed biased in favor of male applicants, should the “productive” ones not retain an edge over their “unproductive” colleagues ?
This is where Towers’ case begins to crack. From her paper we learn that just eleven of the twenty-four productive males landed faculty appointments (page 7). It is not even clear from Towers’ text whether all the “extremely productive” ones were successful (I suspect not). What is clear, is that in almost a third of all male hires, “unproductive” individuals were selected. In other words, not only five out nine “very productive” female researchers, but as many as thirteen “moderately” to “extremely” productive male researchers, and for that matter even nineteen “unproductive” ones, appear to have been discarded from consideration, for reasons ostensibly unrelated to what Towers refers to as “productivity”. If we restrict ourselves to “productive” researchers, then equal proportions of women and men (five out of nine women and thirteen out of twenty-four men) may rightfully allege to have been victims of (gender-unrelated) “discrimination”, based on Towers’ productivity argument.
Who should have got those jobs
Obviously, there exists another explanation for the above outcome, one that does not involve any discrimination . In general, raw productivity, no matter how measured, will be but one of many factors intervening in a hiring decision, and quite likely not the most important one. This is why, time and again, an “extremely productive” short-listed candidate walks away empty-handed, and one “less productive” (on paper, that is) gets the job. Given that Towers’ argument rests on statistical considerations, is it possible to make a simple model of the hiring process, in which the different productivities of the various candidates are taken into account probabilistically ?
Let us stick with Towers’ sample. A very simple model distribution, consistent with the information given by Towers, is the one shown here:
The following simplifying assumptions are made:
1) All individuals in the same group are equally productive.
2) A clear-cut difference in productivity exists among all groups; in particular, every female researcher is assumed to be 25% more productive than every “moderately productive” male (an assumption manifestly at odds with Towers’ admission of substantial overlap between the productivity distributions of “moderately productive” male, and female researchers, but one which we shall make nonetheless, for simplicity). The productivity of “unproductive” males is set to zero.
3) The “moderately productive” group includes 20 individuals.
4) The average female productivity is 2.0, that of the males 1.0 (in other words, we are taking a value on the high end of the uncertainty range for the females, and on the low end for the males). As a result, the few “extremely productive” males are twice as productive as the females.
In the absence of actual data, the above, crude model distribution is purposefully constructed to be most favorable to the author’s thesis, i.e., all of the aspects upon which Towers’ argument hinges are deliberately amplified. For example, female researchers are considered as a whole twice as productive, and each individually more productive than over 91% of males; the “one-sided significance test probability” Pfemale > male (see page 8 of Towers’ paper — incidentally, this is not particularly informative a quantity, as it is insensitive to the magnitudes of relative differences in productivity), takes on a value of 0.99999994 for the above distribution, as opposed to 0.98 for Towers’ sample (Table I of her paper).
Yet, it is easy to show that, even with the above distribution, on purely probabilistic grounds there is no reason to attribute to anything other than a statistical fluctuation a number of positions going to women equal to four, of the twenty available.
One way to gain some additional quantitative insight, consists of “simulating the hiring season” on a computer. The only discriminating factor among individuals is assumed to be their productivity, i.e., the process is “blind” to anything else (to phrase it differently, any other factor only exerts a random influence). We begin by making the simplest possible assumption, namely that a difference in productivity proportionally translates into a different probability of being hired. That is, a researcher who is, say, 20% more productive than a colleague, is also 20% more likely than the latter to land a faculty job. We thus assign an a priori hiring probability to each researcher proportional to his/her productivity, based on the above model distribution, namely 0 for the unproductive researchers, 1.6 for each of the twenty members of the “moderately productive” group, 2.0 for each of the nine “very productive” females, and 4.0 for each the four “extremely productive” males. Note how, by making a priori probabilities proportional to productivity, we build into the model a “fairness assumption”, i.e., none of the 24 “slackers” will get a job.
We perform then a Monte Carlo simulation, i.e., use a common random number generator to sample 20 individuals from the pool, each according to his/her own probability; any one candidate may be sampled more than once (multiple offers), but only “one offer is accepted”. The sampling stops as soon as all twenty positions are taken.
On repeating the procedure illustrated above many times, one can keep track of how many women land jobs each time, and construct a frequency histogram. The numerical implementation is straightforward.
Before discussing the results, it is worth restating that the above distribution, utilized to carry out the simulation, is just a model one. The only reason for using it, is that no raw data are provided in Towers’ paper. Such a distribution, however, is based on all the (few) numbers, and the qualitative description given by Towers, with simplifying assumptions biasing it in favor of female applicants. It thus seems very unlikely that the results and conclusions would be drastically altered, if the actual distribution were used. If anything, the use of real data can be expected to weaken Towers’ case.
What is shown in the above figure, is the computed probability f(n) that a hiring season will conclude with the hire of n or less of the nine women on the job market. Obviously, n can be as low as zero and as high as 9, with f(9)=1. As we can see, based on productivity alone as much as 44% of the time only five or less women can be expected to be hired, and four or less 18% of the time (in case you are wondering: the probability that at least one of the four “geniuses” will not be hired is 46%; 9% of the time at least two of them will not be hired… so much for “publish or perish”…).
An event that occurs one in six times (i.e., with an 18% frequency) cannot in fairness be regarded as “rare”; more importantly, if only slightly more sensible assumptions are built into the model, the predicted frequency of occurrence of what Towers regards as a suspiciously disappointing hiring outcome for female researchers (i.e., four or less), rises considerably . Thus, there really seems to be no statistical basis whatsoever to allege discrimination toward women in the hiring process, at least based on Towers’ description of her own data, which do not appear statistically inconsistent with a gender-blind operation . Based on this conclusion, one can infer that any imbalance in the allocation among researchers of conference presentations ostensibly had, at least on average, no measurable effect on the careers of the researchers.
Before exploring the (narrower) issue of equity in the allocation of conference presentations, it is worth clarifying and/or restating the following:
1) Conference presentations, in and of themselves, are only a modest “reward” for one’s scientific accomplishments. The main objective of a postdoctoral researcher is not speaking at conferences, but rather landing a university faculty position. Towers’ case of GD is ultimately about jobs, not talks. That is, talks are only relevant insofar as they can be shown to affect the likelihood of being hired. A charge of GD based on conference presentations alone is not likely to be taken very seriously.
2) As usual, correlation does not imply causation. Very often, young researchers who speak at conferences are on their way to faculty jobs anyway, typically as a result of strong support from senior mentors (which is also likely why they speak at the conference in the first place). Thus, conference presentations merely reaffirm an existing state of affairs. Conversely, a conference presentation is unlikely to reverse the fortunes of anyone enjoying only lukewarm support from his/her postdoctoral advisor.
Towers’ data would appear to indicate that opportunities to speak at conferences were mostly granted to male researchers, but the validity of this contention is difficult to assess independently, as no raw numbers of conference presentations for female and male researchers are provided. Towers’ own “conference reward ratio” is misleading, as it assigns disproportionate weight to talks given by “unproductive” researchers. A more in-depth analysis may well show that, much like in the case of hires, “productive” male researchers were put at an equal or even greater disadvantage, thereby undermining the claim of GD.
But the most striking result by far, is that the correlation between conference presentations and faculty appointments is virtually non-existent for male researchers (Table 1). In other words, while the lion’s share thereof may have gone to male researchers, they derived no measurable benefit from conference presentations (Towers herself is at a loss explaining it).
This suggests that the issue of conference presentations is likely just a “red herring”, largely irrelevant to hiring. If one is to pursue this route any further, badly needed are a clearer sense of the importance attributed to presentations by postdoctoral researchers themselves, as well as a better understanding of the process by which conference slots are allocated.
It is not at all implausible, for example, that a productive, confident researcher with a strong CV, may not feel particularly urgent a need to speak at a conference. In fact, it is not at all rare for speaking opportunities to go to postdoctoral scientists with weaker research records, in an attempt to help them jump-start difficult job searches (this normally happens with the consensus of the group).
Generally speaking, it is clear that any action (deliberate or not) whose effect is that of depriving a researcher of the proper recognition for the work accomplished (including a chance to showcase in public his/her speaking ability) is unacceptable, and should therefore be prevented and remedied. At the same time, given that the impact of conference presentations on career advancement is unclear (to say the least), serious allegations such as “gender discrimination” and/or possible violation of Title IX regulations seem unwarranted, if solely or primarily based on conference presentations.
At the end of her paper (appendix), Towers laments the unwillingness of the administration of Fermilab to follow up with an investigation her charge of discrimination against women, based on her statistical analysis. With all due respect for the academic and the statistician, I myself would not find in her data and analysis thereof, sufficient, convincing evidence of gender discrimination, even after accepting her many questionable premises and extrapolations (e.g., the notion that an “effective”, “productive” researcher is one who co-authors many reports, the ensuing inference that almost half of Fermilab postdocs spend their day surfing the web, the suggestion that, despite their lackluster research record, quite a few of these “parasites” will still manage to impress gullible search committees just by giving a conference presentation or two … hey, don’t get me wrong, I am all for making fun of high energy physicists but, in fairness, this borders on the ridiculous… it reads more like Dilbert than academia, or science).
Naturally, search committees, being composed of humans, are fallible. It appears from the data, however, that if mistakes were made (i.e., the possible hires of unproductive researchers), both the female and male populations in the sample were equally affected.
Now, that there is no evidence of hiring bias in this case, is obviously not the same as saying that there is no gender discrimination in particle physics, or all physics for that matter. Simply, it is not nearly as blatant as Towers suggests. In any case, a statistically convincing case has, in my opinion, not been built this time. Or, yet. Quite possibly, as the size of the sample at the disposal of researchers grows, clearer evidence will emerge and Towers’ thesis will be vindicated. However, I find her contention that a measure of productivity based on raw paper count should directly correlate with hiring, incredibly naive. Most reasonable scientists (male and female) would readily agree that evaluating a faculty candidate is much more complex and multi-faceted a proposition than that. Her apparent eagerness to build a case of discrimination upon such a shaky foundation is surprising.
The reason of the current under-representation of women at the faculty level, as it emerges from the above examination, seems to be merely the sheer outnumbering by men, a well-known, long recognized problem of modern science. Doubtless, discrimination of various forms has a lot to do with that; but I suspect that the most devastating kind is at work long before women reach college age. I am afraid it is very hard to try and find a remedy, that late into the game.
 No information is offered in the study, as to what fractions of female and male researchers in the sample actually did seek faculty appointments in the first place, how aggressively, how many offers were turned down, etc (only a rather superficial comment is made, in a footnote at the bottom of page 7). It is not even clear to me whether the author regards any of that as relevant. These are clearly highly non-trivial aspects, if one is intent on building a case of GD within a certain professional sector. Everything else being equal, various societal pressures are widely believed to affect professional choices of women to a greater degree than men. Family reasons, for example, may induce a woman not to apply for a job, or turn down a job offer, more often than a man. The paper comes across as tacitly making the assumption that both male and female researchers ought to be regarded as equally free to pursue their career ambitions, and that any female under-representation at the faculty level, is largely attributable to more or less deliberate discrimination at hiring time, on the part of the scientific community.
 An invitation to speak at a conference constitutes an explicit acknowledgment of one’s leading contribution to a scientific project. Conference presentations are believed to be important for young physicists, typically more so than mere co-authorship of the article describing the research presented, especially in a field like experimental high energy physics, where collaborations involve hundreds of researchers, and spotting talent and creativity can be a complex proposition. By speaking at a conference, a researcher gains exposure to the broader community, conceivably strengthening his/her chances of a successful faculty job search. However, the actual impact of conference presentations on one’s job search is far from clear. Towers’ own data, for example (Table I), show at best a weak correlation.
 These numbers mean that most of the researchers in the sample co-authored between one and two internal reports per year. On page 11, Towers states that it is “not unusual” for a report to have “in excess of 20 to 30 authors”. Towers seemingly attributes no importance to the number of co-authors, i.e., each report is counted as one paper, for the purpose of assessing the productivity of a researcher.
Also, a distinction is made by the author, between “physics” papers, as opposed to “service” papers, the former focusing on scientific advances (presumably of greater prestige and impact on one’s career), the latter having to do instead with the overall operation of the facility (i.e., the laboratory), and therefore generally less qualifying from a professional standpoint. The author carries out separate statistical analyses for the two different types of papers (and related conference presentations), as well as a “global” one, in which all papers and conferences are counted equally (this is the one given in the text). The average “physics” productivity of males and females is identical within statistical fluctuations (0.78+/-0.22 for females and 0.72+/-0.10 for males).
 That’s right, Towers’ case of GD is based on a deviation of two from a target of six in a sample of nine… and, what is the basis for Towers’ estimate of six, anyway ? One would expect her to have simply ranked all researchers according to her measure of productivity (after all, her entire case is one of discrimination of productive individuals), and that six would be just the number of women ranked in the top twenty most productive researchers. Surprisingly, however, that is not how Towers arrived at her estimate (in fact, that information is not given at all, in her paper). Instead, she relies on her own parametric model of career advancement, which includes, besides productivity, also variables such as “socialization”, and is different for males and females (page 12). Aside from the doubtful reliability of such an approach, why use a model when actual productivity data are at hand ?
 My personal take is that Towers’ “productivity measure” does not really measure much of anything relevant to faculty hiring (e.g., overall scientific ability). This is in line with its relatively weak correlation with actual hiring (Table 1 of Towers’ paper), and explains the “anomaly” of so many supposedly “unproductive” researchers, who were deemed suitable faculty candidates by (reputedly competent) hiring committees. More generally, the only lesson to be learned from Towers’ work may be that the evaluation of the activity of a scientist through a mere paper count is misguided and misleading. This is, of course, neither novel nor surprising a concept.
 The simplest, most obvious correction consists of abandoning the draconian assumption of zero productivity for the twenty-four less productive researchers, assuming instead a small but non-zero value (say ten times smaller than that of the “moderately productive” males). Adjusting the weight of “extremely productive” males to 3.04 (in order to keep the average at 1.0), and leaving the weights of the remaining two groups unchanged, immediately brings to 27% the probability that four or less women be hired, i.e., over one in four times (note that we are still assuming female researchers to be on average twice as productive as male researchers; on taking a less severe 1.7:1.3 female vs male average productivity ratio, the four-or-less outcome probability goes up even further, to 34%, i.e., over one in three times). The relatively large size of the “unproductive” group has a significant effect on the outcome, even if individual hiring probability is small.
Now, as anyone who has served on a physics search committee knows, the odds of landing a position do not increase linearly with one’s productivity. In fact, “number of papers” essentially ceases to be relevant, above a minimal threshold below which a candidate is regarded as not productive enough. Other qualities, such as effectiveness in communicating (a faculty will spend much of the time teaching), a well-defined research plan, strong letters of recommendation, will play a much more important role. Thus, a more realistic model is one that assigns essentially the same likelihood of being hired to all productive candidates. In that scenario, the probability that four or less women will be hired is 36%. One does not see how a credible statistically-based case of “gender discrimination” can ever stand on such odds, which are, of course, a direct consequence of the very small size of the sample.
 The numerical experiment described in note  yields additional interesting information. The simulation in which each member of the “unproductive” group is assumed to be ten times less likely to be hired than any of the “moderately productive” males, yields a probability for four or less women to be hired of approximately 27%, but as low as 0.1% for five or more unproductive male hires. This certainly points to a statistically significant deviation of the actual male hiring pattern from what one would expect based on productivity alone; that is, “unproductive” researchers were rewarded well beyond their “merit”. However, in our model this happens almost exclusively at the expense of other men. Obviously, this may be a sign of questionable, scarcely transparent hiring practices (but not of GD); more likely, though, it may merely reflect the dubious value of Towers’ productivity metric, i.e., the scarce importance attributed by search committees to the number of internal reports co-authored.