sampler" Computational Imagination


Profile in Computational Imagination: Jonah Gabry

Romanced by the Mathematics of Uncertainty


Jonah Gabry was recommended to me by Rob Trangucci for his work on developing ShinyStan. In this interview we discuss the development of ShinyStan, Jonah's continued work on the Stan team and his professional journey from music, Spanish literature and Latin American history into a statistician and developer.


Developing ShinyStan

(Mike): Jonah tell us about the genesis and development of ShinyStan.

(Jonah): ShinyStan is a graphical user interface for exploring Markov chain Monte Carlo (MCMC) simulations. Most applied Bayesian data analysis relies on MCMC algorithms to generate draws from the probability distribution of interest (the posterior distribution). After fitting a model using MCMC, ShinyStan can be used to check various diagnostics and create plots and tables of the quantities of interest. There are some extra features implemented for models fit using Stan, but ShinyStan can also be used with models fit using other MCMC implementations.


ShinyStan

I started working on ShinyStan while I was taking a graduate course on Statistical communication and graphics with Andrew Gelman at Columbia. There was an open project to be worked on throughout the semester. It coincided nicely with a time when I was working on some research of my own using Stan but I wanted some sort of immediate visual feedback that wasn’t available. At the time, we were essentially just using some very long scripts that would generate the summaries and plots we wanted to look at, but then we had to have a system for saving everything in some organized fashion and that was never very satisfying. I had started playing around with some of the ideas behind ShinyStan on my own, but Andrew’s class gave me the chance to work on it more formally and with the help of some other students in the class. By the end of the semester we had a working prototype, student type work, primitive, but there was interest from some of the other Stan developers and some users were already starting to suggest interesting new features, so I just kept working on it (and continue to work on it).

The ShinyStan interface is part of the shinystan R package, so the easiest way to use it is to fit a model using the rstan package (the R interface to Stan) or the new rstanarm package. (If you use Stan via the command line or Python or one of the other interfaces (we also have MATLAB, Julia, and Stata interfaces) you can export your results to R and then use ShinyStan.) You can then easily pass your fitted model to ShinyStan with a single function call and it launches in a web browser.

In addition to generating a variety of customizable (and exportable) plots and tables, it allows you to immediately view what we’ve found to be the most useful diagnostics to look at after fitting a model using Stan’s algorithms.

Another feature is that it will store all the results needed to recreate these plots and diagnostics in a single object. I find this to be so much more convenient than having to repeatedly generate each plot and table or keep them stored in some labyrinth of directories. This single object can also be shared with collaborators who can then easily view all of the content. There’s also a function for uploading to RStudio’s ShinyApps service, which means you can easily publish a model online and share it simply by sending a URL.

Part of the motivation for creating ShinyStan was my own struggle with keeping my projects organized. I wasn’t happy with the chaotic system I had for storing results and code, especially when working on multiple projects at the same time and with various sets of collaborators. So it kind of grew out of my own needs and frustrations, but I thought other people might find it useful so why not create a more general tool.

(M): Isn't that how so many tools start? A developer gets frustrated and starts to think there has to be a better way to do this. Which leads me to my next question. Not every statistician is a tool builder, a programmer. How do you think about yourself professionally? How did you pick up the skills to build tools and not just use them?

(J): That evolved for me pretty rapidly and recently. I wasn't one of those kids who was always interested in computers, my friends weren’t into programming, etc. I didn’t know I enjoyed programming until I started doing statistics and quickly realized that anything I wanted to do was going to require some minimal level of programming. And it turned out that I actually found it quite enjoyable. I have always enjoyed learning languages, spoken languages that is. So maybe it makes sense that I would enjoy learning programming languages as well. I ended up spending a lot of time doing it without intending to. What I mean is spending a lot of time following a particular thread, like when you read something online, click on a link and then click on another link and find yourself somewhere you hadn't anticipated (perhaps reading this interview). It was like that for me with programming. I would teach myself something I needed to learn for a particular project I was working on and, in doing that, I would come across something else that intrigued me and I would follow up on that and end up spending a lot of time programming and learning. If I had an idea for writing more efficient code I would take the time to do it just for the practice, even if I was working on something without efficiency concerns. Anyway, it turns out to be something I enjoy, but I didn’t know that until I started doing it out of necessity.

(M): Talk to us a little about doing development in R.

(J): I’m programming mostly in R because it’s so ubiquitous in applied statistics, but there can be advantages to working in other programming languages (even some commercial statistical software packages -- they have people paid to work on testing and developing documentation). The peculiarities of R as a programming language tend to drive many experienced programmers mad (at least the ones I know). R is simultaneously endlessly vexing and amazingly flexible. It may be the best tool for doing statistical analyses, but for someone creating a tool it can be quite frustrating. The documentation for R (both the core and user-developed packages) is also not very good in my opinion, although there are certainly exceptions to that. I try to put a lot of time into writing documentation because I’ve found it to be one of the best ways to check my own understanding of what I’m working on. Oddly enough I have come to really enjoy writing documentation.

Regarding R, I would say that I very much appreciate it (particularly all of the work that is done by volunteers to maintain it) but I’m not an R "fundamentalist". I know several people who can be accurately described that way (that is, they have an R or nothing mentality). If better tools come along I am willing to learn those. As a team I suppose we will continue to use R so long as it remains popular among statisticians.

I’d also like to say that I’ve become a much better programmer and software developer since joining the Stan development team. In particular, Bob Carpenter, who has a background as both a computer scientist and a linguist, has been incredibly helpful to me in how to think about software design and the development process.

(M): What are you working on next?

(J): Ben Goodrich and I are working on RStanARM (Applied Regression Modeling via RStan). This is an R package (the rstanarm package) designed to bring the power of Stan to users familiar with regression modeling in R. In particular, it can be a good introduction to Bayesian data analysis for people who are used to classical methods. The rstanarm package lets you specify many of the most common applied regression models in R (using the familiar R modeling syntax), but then uses Stan as the back-end to do full Bayesian inference without the user having to program in the Stan language directly. Our hope is that it will help bring Stan to a wider audience because if you can use the glm function in R to estimate a generalized linear model you will be able to use the stan_glm function to estimate a Bayesian version of the same model.

(M): Is this just about accessibility or is it targeted toward teaching Bayesian statistics?

(J): The package is not going to teach Bayesian statistics by itself, but we think it will be useful in educational settings to aid in teaching Bayesian data analysis. We are putting a lot of emphasis on the documentation, including vignettes, examples and explanations of how each type of model is particularly useful, and emphasis on the four steps of a Bayesian analysis, which are:

  1. specifying a joint probability distribution for the outcome variable and all of the model parameters and unknown quantities of interest
  2. drawing from the relevant posterior distribution, probably using Markov Chain Monte Carlo (MCMC)
  3. evaluating how well the model fits the data and possibly revising the model
  4. drawing from the posterior predictive distribution of the outcome to see how manipulations of predictors affect (various functions of) the outcome
One of our challenges is making it simple enough to reduce the typical user's learning curve without sacrificing what we think are best practices.


Professional Journey

(M): When I looked at your LinkedIn profile what really struck me was your transition from literature and history to statistics and programming. Share with us some of that journey.

(J): My undergraduate studies were in music, Spanish literature, Latin American history ...very humanities oriented. Not the typical background you find with statisticians, although I do know a number who are musicians.

(M): That combination I see a lot - there seems to be something deep connecting music to computer science and math.

(J): Agreed... I guess the transition from literature, history and languages is not nearly as common. When I started college (undergrad) I didn't have any plans, so I took a lot of different subjects and ended up being drawn to those topics in the humanities. In hindsight I think the professors I had in humanities courses during my freshman year were much better teachers than the professors in my math and science courses (I’m not claiming that’s true in general). Essentially, I just continued taking classes with the professors I liked and ended up with those majors. After graduating I worked for a while doing Spanish-English translation of fiction and poetry, spending long-stretches of time in Spain and South America. I also worked for a company that imports and distributes foreign language literature and language learning materials. And I was on tour for a little while with a band.

(M): Tell us a bit more about the band and your own role?

(J): Well I used to play primarily jazz guitar, but the band I went on tour with was more folk music. It was a collaboration with a folk singer from Mississippi, Munny Townsend. She has quite an enchanting, deep voice that I really love. We started performing together in college but the tour didn’t happen until a few years after we graduated. We recorded an album (mostly her songs with my arrangements) in the summer of 2010 under the name Munny & the Cameraman, and we spent the following summer traveling from Maine to New Orleans playing about 30 shows along the way. On the album I play about five or six different instruments but live I played mostly guitar and we had other musicians join us for many of the shows to fill out the sound. (If anyone is curious, the album is available on iTunes, Amazon, and other similar sites, but we’re not trying to make money off of it so I recommend checking it out on bandcamp here, where you can listen to all of it for free.)


Munny and the Cameraman


It was actually while I was on tour and focusing on playing music that I first started to get interested in probability. I had printed off a bunch of articles to read on the road while we traveled, and I remember one of them was one of those pop-sciency articles trying to explain some complicated biological phenomenon to people without any subject matter knowledge (like me). I don’t remember the precise topic unfortunately (it’s been a while), but the author made a probabilistic argument that bugged me. I had a sense that the argument was fishy. But I just wasn't equipped with the relevant knowledge to be able to follow up on that intuition and figure out the flaw in the argument. So I started reading up on probability theory and I soon realized that I needed to brush up on my math. So I started re-teaching myself calculus (it had been many years) and then it just spiraled from there.

I started thinking a lot about uncertainty and how to measure and reason about uncertainty. I realized that there were principled ways to do that but I didn't know how any of it worked and I wanted to find out. So I started to read a lot on my own, teaching myself, and then after a year or so I realized that it was something that I wanted to keep doing. I was living in Philadelphia at the time, so I enrolled in some math courses at Penn as a way to gauge how serious I was about it. I eventually decided to go to graduate school and it worked out nicely because I had been acquiring a lot of the prerequisite knowledge just out of my own curiosity.

(M): So it sounds like this journey arose more out of intellectual curiosity than a planned career move.

(J): Yeah, I certainly did not get into it because of the data science craze or because it was useful in the job market. The truth is that I was oblivious to all that (probably too much so). I wasn't thinking of any of those things, although I suppose it was a nice surprise to find out that these skills are increasingly in demand. Otherwise I probably would have ended up in the same situation as when I studied literature and languages.

Also, I did immediately find connections to things I had studied in the past, so not everything was unfamiliar. For example, the history of probability and statistics is itself a fascinating topic, and there are also a lot connections to philosophy. I spent a lot of time reading about the philosophy of statistics (which is an entire field of study) and there are many philosophical issues related to probability (for example, the most basic question of what the hell probability actually is). But there are also intersections on topics like how to have precision in the face of uncertainty and how you learn in a principled way from the information that is available to you. And these are all questions that already interested me in other contexts.

(M): A different framework for inquiry?

(J): Right, so in some ways this didn't seem like as big of a jump for me as for family and friends who saw me moving into heavy math. Although I had been a good math student I don't think I had ever expressed much excitement about it. But I was finding that I needed to rely on math to answer the questions I was interested in. It wasn't so much that I wanted to do math... although after spending so much time and effort on math in order to do statistics I started becoming interested in math for its own sake. So along the way I ended up taking some less applied math courses, e.g. number theory, which I don’t think I’ve ever really used... actually that is not true, I have used some of what I learned in that course in a few cases. And number theory underlies a lot of the math behind cryptography (e.g., many encryption methods are based on the fact that it’s easy to multiply large prime numbers but very hard to factor them).

(M): I sort of expected that you might have taken statistical approaches and applied them back to your roots in literature or translation, using natural language processing or semantic analysis corpus analysis or maybe machine translation.

(J): I’ve never done any of that myself, but I work with someone who has that kind of background. Bob Carpenter was a professor at Carnegie Mellon in computer science before he went into private industry and then came to Columbia to work with Andrew Gelman on Stan. Bob is an expert in NLP and computational linguistics and after hanging around with him I am glad I didn't try to go that route (it’s really hard), but I do find it interesting.

(M): A question about your path. You went to Columbia as a grad student but now you are working there?

(J): Yeah, now I am just working there. I was a graduate student and I got very lucky. I was at the right place at the right time. I finished school at a time when they happened to have some grant money available to fund me.

(M): It let you keep working on something of deep interest to you and get paid to show up every day.

(J): Yeah, it’s really fantastic and I don’t take it for granted. There is a group of five or six of us based at Columbia working on Stan. Andrew Gelman and Ben Goodrich also teach, but the rest of us (myself, Bob Carpenter, Daniel Lee) are just doing research and every once and awhile filling in to give a lecture. We also work with researchers in other departments, many of whom are using Stan in their work. For example part of the funding for my position comes from the Colombia Population Research Center, where we are working on a series of surveys to study poverty in New York City. I am working with them on analyses of survey data, dealing with missing data (a huge and under-appreciated issue when working with survey data), survey weighting and various other topics. Along with some of the other Stan developers, I’ve also collaborated with researchers in political science, sociology, bioinformatics, ecology and some other fields I’m forgetting at the moment.

(M): What would you do if you were not constrained by resources?

(J): The truth is that I really like my job. I am certainly paid less in academia than I would make working in industry, but if I weren’t constrained by resources I would probably be doing something very similar to what I am already doing. I guess if I really had the resources I would pay to move the Stan operation somewhere outside of New York. Maybe out west somewhere. I find New York to be a bit overwhelming.


Tools

(M): What kinds of tools beside R are you working with?

(J): In my larger workflow I rely a lot on Git and Github for version control. We couldn't do our work as a team without these tools or something comparable. And even when I am working on an analysis or code or some sort that is not going to be shown to anyone else, I still use them. The ability to branch off your main repository and experiment with new ideas without worrying about corrupting your main files is so important. I know many people who store tens of copies of the same file, each with tiny modifications, and then struggle to piece things back together to make it all work. Git makes this trivial (well, mostly). When I started programming I would use version control when I was working on something important for an audience. Now I use it for everything. It has become such an essential tool for me and saved me from many code nightmares.

Through Github we use Travis for continuous integration. We write unit tests for all our software and anytime we make a change we’ll get a notification if that change caused any of our tests to fail.

In addition to R I write some code in Javascript and HTML (I need to for ShinyStan but I don’t use those regularly otherwise) as well as a bit of Python and C++. Stan is written in C++ so I need to be at least somewhat familiar with it, although my C++ programming leaves much to be desired. I will sometimes write C++ functions to use in R packages when speed is an issue, but otherwise I don’t use it too much myself.

(M): Are you doing much work outside of academia?

(J): Individually a number of us on the Stan team do some workshops and training sessions on Stan as well as some more general statistical consulting.

(M): What kinds of companies and use cases are you finding in industry?

(J): I think Stan is still mainly used by academics. We know of research using Stan in psychology, ecology, physics, sociology, medicine, political science, astronomy, and many other fields.

But Stan also certainly has a growing presence in various industries. There are professional sports teams, pharmaceutical companies, publishing companies, all sorts of different kinds of businesses using Stan. Stan is agnostic to the domain; it doesn’t care what the topic is. If you have data and a statistical model then Stan will let you take a Bayesian approach to your analysis.

(M): What in your professional life surprises you?

(J): I am still surprised by just how knowledgeable and insightful the people I work with are. And not because I have low expectations, but rather because they consistently exceed my high expectations. Also by how everyone’s background is so different. For example, Bob Carpenter, whom I already mentioned, is a computer scientist and linguist, Daniel Lee I believe has a formal background in math, Michael Betancourt is a physicist-turned-statistician, Ben Goodrich is a political scientist but knows more statistics than most statisticians I know, Andrew Gelman is a statistician and political scientist. We are all drawn together by a (mostly) shared perspective on applied statistics and the Stan project in particular.

(M): Jonah thank you for sharing your work and professional journey with us.