Coaching researchers: Why should biomedical researchers learn R?

02 October 2013

Coaching researchers: Why should biomedical researchers learn R?

R is the language of choice for a large number of data scientists. And for good reasons, including the outstanding group of core developers, a huge number of packages covering almost everything imaginable, and a language combining object orientation and functions that is easy to learn if you have some previous programming experience.

When it comes to biomedical researchers, however, the prerequisite of having some previous programming experience is missing in most of the cases, and that creates a problem. The end result is that every time I mention R as a possible language, the reaction is almost invariably that (a) R is hard to learn and (b) X is what I am using now since it is easy. X will usually translate into SPSS, Stata, or SAS-JMP, and easy usually means that there is a graphical user interface for them to point & click and do their own analyses.

In contrast with most people in what I call the old statistics school, I am absolutely in favor of biomedical researchers directly working with their data. The concerns from the old school stem, in their words, from a concern of biomedical researchers getting results results that they don't understand and, worse, propagating those results as if they were valid. I share that concern and think that it is legitimate. In most cases, however, the true, vested concern from the old school is actually simpler: Turf protection. I disagree entirely, and think that biomedical researchers not playing with their data are simply shifting the errors in interpretation to the statistician or data scientist since the latter do not have the biomedical insights to explore the data sets as they should be explored. Where I am trying to get with this is that if the biomedical researcher works alone or the statistician/data scientist is not supported by the full insights from the biomedical researcher, the odds of having incomplete or misleading information is significantly increased.

Now, what does all this have to do with R? Well, if we agree that the way to go is to have teams where biomedical researchers have insights from their field and data scientists know their methods in depth, then biomedical researchers need to understand what data scientists are saying and doing. This is where R comes in. So, in an ideal research team, biomedical researchers would know at least the following:

  1. Read an R source code: in other words, which tests and models are being run and with which variables
  2. Interpret whether the choice of tests and models were appropriate given the type of variables serving as input to them
  3. Interpret the general results of each test and model

In other words, it's not even to write code, but instead to critically read code. I also think that the ability to explore the data set is incredibly useful, and this can certainly be done with a graphical user interface, in R or any other software.

by Ricardo Pietrobon