For the past decade, social scientists have been unpacking a â€œreplication crisisâ€� that has revealed how findings of an alarming number of scientific studies are difficult or impossible to repeat. Efforts are underway to improve the reliability of findings, but cognitive psychology researchers at the University of Massachusetts Amherst say that not enough attention has been paid to the validity of theoretical inferences made from research findings.
Using an example from their own field of memory research, they designed a test for the accuracy of theoretical conclusions made by researchers. The study was spearheaded by associate professor Jeffrey Starns, professor Caren Rotello, and doctoral student Andrea Cataldo, who has now completed her Ph.D. They shared authorship with 27 teams or individual cognitive psychology researchers who volunteered to submit their expert research conclusions for data sets sent to them by the UMass researchers.
â€œOur results reveal substantial variability in expertsâ€™ judgments on the very same data,â€� the authors state, suggesting a serious inference problem. Details are newly released in the journal Advancing Methods and Practices in Psychological Science.
Starns says that objectively testing whether scientists can make valid theoretical inferences by analyzing data is just as important as making sure they are working with replicable data patterns. â€œWe want to ensure that we are doing good science. If we want people to be able to trust our conclusions, then we have an obligation to earn that trust by showing that we can make the right conclusions in a public test.â€�
For this work, the researchers first conducted an online study testing recognition memory for words, â€œa very standard taskâ€� in which people decide whether or not they saw a word on a previous list. The researchers manipulated memory strength by presenting items once, twice, or three times and they manipulated bias â€“ the overall willingness to say things are remembered â€“ by instructing participants to be extra careful to avoid certain types of errors, such as failing to identify a previously studied item.
Starns and colleagues were interested in one tricky interpretation problem that arises in many recognition studies, that is, the need to correct for differences in bias when comparing memory performance across populations or conditions. Unfortunately, this situation can arise if memory for the population of interest if equal to, better than, or worse than controls. Recognition researchers use a number of analysis tools to distinguish these possibilities, some of which have been around since the 1950â€™s.
To determine if researchers can use these tools to accurately distinguish memory and bias, the UMass researchers created seven two-condition data sets and sent them to contributors without labels, asking them to indicate whether or not the conditions were from the same or different levels of the memory strength or response bias manipulations. Rotello explains, â€œThese are the same sort of data theyâ€™d be confronted with in an experiment in their own labs, but in this case we knew the answers. We asked, â€˜did we vary memory strength, response bias, both or neither?â€™â€�
The volunteer cognitive psychology researchers could use any analyses they thought were appropriate, Starns adds, and â€œsome applied multiple techniques, or very complex, cutting-edge techniques. We wanted to see if they could make accurate inferences and whether they could accurately gauge uncertainty. Could they say, â€˜I think thereâ€™s a 20 percent chance that you only manipulated memory in this experiment,â€™ for example.â€�
Starns, Rotello and Cataldo were mainly interested in the reported probability that memory strength was manipulated between the two conditions. What they found was â€œenormous variability between researchers in what they inferred from the same sets of data,â€� Starns says. â€œFor most data sets, the answers ranged from 0 to 100 percent across the 27 responders,â€� he adds, â€œthat was the most shocking.â€�
Rotello reports that about one-third of responders â€œseemed to be doing OK,â€� one-third did a bit better than pure guessing, and one-third â€œmade misleading conclusions.â€� She adds, â€œOur jaws dropped when we saw that. How is it that researchers who have used these tools for years could come to completely different conclusions about whatâ€™s going on?â€�
Starns notes, â€œSome people made a lot more incorrect calls than they should have. Some incorrect conclusions are unavoidable with noisy data, but they made those incorrect inferences with way too much confidence. But some groups did as well as can be expected. That was somewhat encouraging.â€�
In the end, the UMass Amherst researchers â€œhad a big reveal partyâ€� and gave participants the option of removing their responses or removing their names from the paper, but none did. Rotello comments, â€œI am so impressed that they were willing to put everything on the line, even though the results were not that good in some cases.â€� She and colleagues note that this shows a strong commitment to improving research quality among their peers.
Rotello adds, â€œThe message here is not that memory researchers are bad, but that this general tool can assess the quality of our inferences in any field. It requires teamwork and openness. Itâ€™s tremendously brave what these scientists did, to be publicly wrong. Iâ€™m sure it was humbling for many, but if weâ€™re not willing to be wrong weâ€™re not good scientists.â€� Further, â€œWeâ€™d be stunned if the inference problems that we observed are unique. We assume that other disciplines and research areas are at risk for this problem.â€�