M: This is a Profile in Computation Imagination with John Cook. John’s career started with a PhD in applied mathematics; he got into software development for a number of years, and then got back into a more academic kind of environment at MD Anderson as a research statistician. Over the years he has done a lot of consulting, and I believe that’s his main gig right now.
I first found John on Hacker News, started following him and reading some of his writing. So it’s my pleasure to welcome John.
J: Good morning.
M: Could you describe and put some boundaries around the domain that you work in?
J: Sure. Well, I’ve had several careers. I’ve worked in several domains. My first career was in partial differential equations. That’s what I did in graduate school and as a postdoc. And then I left academia and worked as a software developer for a while. Then I worked in biostatistics at MD Anderson and now I’m a consultant and entrepreneur. The common thread through these things is working in the overlap of applied math and computation.
M: You mentioned biostatistics at MD Anderson. What are some of the other fields or domains that you applied that knowledge to?
J: When I started working as a software developer, I was working for an oil field services company. There I was doing some digital signaling processing. Now I’m working with a variety of clients sometimes doing business applications, sometimes doing mathematical modeling.
M: You’ve been doing this for quite a number of years. What are some of the things that still interest you about this domain and this intersection between computation and mathematics?
J: It’s interesting to see how the math maps onto the real world. That’s the hard part. That’s usually more challenging than the mathematics itself. And it’s interesting to see when that works well; sometimes it doesn’t, but when it does it’s very satisfying.
M: Maybe you could give our listeners and myself a sense of the kinds of problems that you’ve solved. Maybe give us a little sampling like in the biostatistics area. What are the kinds of things that you would be solving in a day-to-day working environment?
J: Well, one of the biggest things I did in biostatistics was working on adaptive clinical trial design. So looking at ways to make decisions more intelligently by taking advantage of the data as it accumulates. Can you treat the tenth patient with a slightly higher probability of success given what you’ve learned from the first nine patients? So that’s very satisfying when it can work. More recently, I’ve been doing some work with modeling the biological processes, not so much the clinical trial but at a deeper level getting into some of the physiology.
M: So thinking a little bit about your professional path, you kind of gave us the arc of it, how early in your life did you realize you wanted to focus on this domain, and what attracted you to it?
J: Well I’ve been interested in math most of my life. Pretty early on, that’s what I decided I wanted to go into. As far as how that shaped, led into a more specific career path, that’s been evolving ever since. I started out in fairly pure math and kept moving toward the applied direction--more and more applied, so applied that maybe I’ve moved into some areas that aren’t called math anymore.
M: What are some of the milestones along the way in your career?
J: Well, one milestone was discovering numerical analysis. I had a friend in college that worked for the Center of Numerical Analysis at UT, and that sounded mysterious and/or boring. I thought what in the world is this about? And it turns out that it’s a fascinating field. You take what’s ostensibly a simple problem, solving a linear system of equations, something that in principle is well understood; it’s something you learn within a week or two taking a linear algebra course. There’s been a tremendous amount of work on actually computing these solutions: computing them robustly, accurately, quickly. And it’s interesting how such a simple problem from one perspective turns out to be so rich when you look at it in detail.
I would say another milestone is the decision to leave academia. That was a pretty big decision, one that I’m glad I made, but it wasn’t easy.
M: So you had a lot invested at that point. … all the way through a PhD in mathematics. What prompted you to head in the direction of industry at that point?
J: I decided I would rather have a good job in industry than a bad job in academia. The job market was pretty bad when I finished my postdoc. It looked like if I were to stay in academia I would be taking a job that I really didn’t want. And so I decided that I would move into the private sector. That’s something I had in mind as a possibility all along. Moving from applied math into industry is not such a leap compared to, say, moving from Renaissance literature to industry.
M: What was your biggest adjustment? If we end up having some postdocs who are frustrated and they’re thinking about applying their research skills to industry, what are a couple of cautions or insights you could share to help them making that decision?
J: For one thing, don’t expect people to be impressed by academic credentials. In fact, they may be put off by them. I had one recruiter suggest that I leave my PhD off my resume when I was looking for a job. It’s because people assume that someone with a PhD is going to be theoretical and impractical, and if they’re a programmer they’re probably a terrible programmer. These assumptions are not without some basis. So you have to overcome some of the prejudice and say no, I’m not one of those kind of people. Yes, I’m an academic, or have been, but I can actually program and write clean code. That sort of thing.
M: One of the things I noticed about you was that you spent a lot years developing some really solid software development skills...You’ve been kind of doing the combination before data science was a term. Do you have any advice for people are trying to hit that intersection of mathematics, statistics, and programming?
J: One thing that I recommend to everyone who asks for advice in software development is that they read the book Code Complete by Steve McConnell. It’s filled with things you just don’t learn in school, sort of the ninety percent of what you need to know that is not programming language syntax, or algorithms. Just really practical things: how to name things well, what sort of practices are more likely to lead to errors or less likely. Just really good, solid advice. That’s something I would recommend as far as improving your software development skills. On the mathematical side, I would say the most important things to know are probability, linear algebra, and calculus. If you really know those things well, then you’re in good shape to learn what else you need as needed.
M: As opposed to starting with some of the endgame things like machine learning. I see some of these people beginning and they want to jump straight to machine learning, neural networks and all of that. What you just said is really the foundations that people need to put in place.
J: Well, it’s good to jump into what you want to do, but I’d say at least in parallel, it’s good to have some deeper understanding of what’s going on.
M: I’ve used the phrase “cost-per-observation” --it’s costing us less and less to get one atomistic piece of data--and in many fields, it’s trending towards zero and causing data explosions. But this trend is unevenly distributed. If you think about the areas that you work with, what kind of data is still too expensive to get?
J: I guess I would want more direct data. We’re usually looking at some sort of proxy for what we really want to know. We’re looking at something that is correlated with what we’re interested in, or at least we hope it is, or there’s some relationship there, but it might be weak. If you have a few dozen highly relevant data points, that’s actually very useful. But when people are trying to strain something out of millions of data points, it’s because these data points are not very directly related to what they want to know. They’re trying to pick up on a very weak signal. It doesn’t take a whole lot of data if you have relevant data, but sometimes that’s not ethically possible or logically possible.
M: The genomic tool chain has gone through enormous changes. They’re one of the poster children for big data. The cost of genome sequencing going from roughly $3 billion probably for the first one to, depending on who you believe about the quality of their technology, maybe down around $1,000 to sequence a genome. That just boggles the mind. Are there other areas in biostatistics or in biology research that you say well that’s great, but we still can’t measure X? We still don’t really don’t really know how to measure X? Or it’s too expensive?
J: Sure. For example, in cancer treatment trials, you really want to extend someone’s life expectancy and their quality of life. That is what people are hoping for in cancer treatment, but that’s not always practical to measure. So you measure some surrogate like tumor shrinkage in the hopes that if we can shrink somebody’s tumor, we can improve their chances of survival, which is plausible but it’s not always true.
Ironically, things have gotten harder in some areas as survival has improved; so like in breast cancer, it’s common for women to live many years with breast cancer--like maybe seven years--which is good, it’s great that there’s been an improvement, but it makes things harder for the clinical trial because you can’t observe each patient for seven years before you treat the next patient. Instead, you have to measure something that’s more convenient than survival.
M: So I want to ask you an “imagination” question. What’s the crazy idea that you would pursue, if, for example you get the MacArthur Grant or some person gives you $10 million to work on whatever you want? What’s the crazy idea that’s in the back of your head, and what intrigues you about it?
J: Well, one thing that comes to mind is I saw a paper maybe a year ago, I think the title was “What Is a Statistical Model?” And the author was looking at using category theory as a sort of sanity check on statistical models. The idea being that you could provide some sort of check for statistical model analogous to dimensional analysis. In the physical problem you have some solution, does the answer at least come out in the right units? Something like that is just spookily powerful sometimes. It’s amazing what you can conclude sometimes just from getting the units right.
So for example, if you know that there’s some relationship between mass and energy, Einstein's equation is the only possibility up to a constant. If e equals anything involving math, it’s got to be proportional to mc². So would it be possible to do something like that for statistics? Where you have some sort of type checking or dimension checking, and this paper was the first attempt to do that using category theory, which is really interesting because category theory is not the most practical applied subject. In fact, it’s probably about as far from that as you can get. And yet, this would be a really useful application if it panned out. So other than this one paper, I don’t know what else has been done. I don’t know if anyone else has looked into that, because the intersection of people who know statistics and category theory has got to be really small.
M: Well you now, if we end up with some rich benefactor listening to this, maybe he’ll give you a call and give you the grant to work on it. Wouldn’t that be funny?
J: If I got the grant, the first thing I would do is hire somebody who knows more about these things than I do.
M: I think the world is becoming more and more multidisciplinary. It’s really hard either to do big science or to do really compelling industry startups without becoming multidisciplinary. When you’re daydreaming about your perfect professional team, what does your multidisciplinary team look like? Who would you be hiring? I’m not looking for names, but what are the kind of people you’d like on your team?
J: Well, I guess a lot of the things are fairly traditional; it’s good to have one person who’s a system administrator, someone who keeps all the machines running, someone who’s anticipating problems; it’s good to have someone who’s an expert in the tools that you’re using, maybe someone who’s a toolsmith who makes custom tools for everybody else.
Within just mathematicians, one combination I’ve worked with that was really productive is--I was on a project that had a Russian-trained and a French-trained mathematician, and that was really interesting because these are sort of opposite approaches to mathematics. The Russian school tends to be very concrete, you know, calculating things with special functions and classical mathematics. The French / Bourbaki school is much more abstract. When you have people approaching things from these two opposite ends and working toward the middle, it can be very interesting.
M: What does your tool chain look like? I guess now that you’re on your own it’s a somewhat different tool chain than you were using at MD Anderson.
J: Sure. It varies depending on what kind of problem I’m working on, but my go-to tools, the things I’m most likely to use most often, would be Python, the Python scientific libraries like NumPy, SciPy, and Pandas. That sort of thing. Emacs and LaTex. Those are kind of the basic things. Sometimes I work with C++ for efficiency if Python is not fast enough for something I need to do. That doesn’t happen so often. I used to use Visual C++, the Microsoft C++ compiler. I haven’t used it much lately. It’s a great tool for a certain kind of software development, not the kind I’m doing at the moment, but it’s a great tool and I imagine I’ll use it again.
M: Have you looked at Julia at all? It’s a new language coming out of MIT?
J: Yeah, I’ve looked at Julia. I’ve met a few of the guys who are working on it, and they’re really sharp. I expect great things from it. I’m not using it right now. Maybe in a few years it is what I’ll be using.
M: Now do you end up building any of your own custom tooling or libraries? Say you’re working in Python, have you built some of your own tooling in terms of actually doing your own packages or libraries for that, or other kinds of custom software?
J: Yeah, I’m always writing little things for myself, mostly fairly small tools these days. When I was at MD Anderson, I worked on a fairly large numerical library in C++. These days working with Python, I can usually find the things I need in an existing library. On the C++ side, it was harder to find things, and especially harder to find things that fit together. Here’s a library to do this task and here’s a library to do that task, but you can’t pipe the output of one as the input of the next one because they’re using different ways to represent a matrix or whatever. You get a lot more just trying to create a C++ library where all the pieces would fit together. And especially things that were customized for the work we were doing in biostatistics. There are some things that are common in that world that are not really that common in the wider world of scientific computing, so we would have to do those ourselves.
M: I know there are some venders with highly optimized numerical libraries, but that wasn’t going to work in your environment apparently. So you guys built your own?
J: Yeah. When this came up, I said the last thing I want to do is develop my own library, and I was pretty adamant about that. Let’s not do this, let’s use what’s out there. But it ended up being easier to do our own. It’s not something I recommend in general, but because we were developing something specific to our needs it worked better. And especially when we were using C#. At first we were using C++, and there were other people putting a user interface on top of that using C# --a web interface or a desktop interface. And then around the time I left MD Anderson, we were moving toward doing the numerical analysis in C# as well, so the whole project could be in one language. Well there in C#, the .Net world, it’s even harder to find numerical libraries than on the C++ side. So we ended up writing our own library in C# as well, mostly a port of the C++ library.
J: Sure. I was just going to say, it’s all about tradeoffs. How valuable is it to you to stay in one language, for example. If you’re hiring frontend developers, they’re not going to know C++ these days. Do you want them to be able to debug into the backend code? If they hit a binary that’s opaque to them, they can’t do anything about it. There’s some advantage to having everything in one language.
M: There absolutely is. I wasn’t meaning to suggest otherwise. I guess I was just thinking that there’s also tools that are really appropriate and optimized for certain kinds of tasks, and there’s reasons those tools exist. You are always doing a balancing act between accessibility to your whole team and taking advantage of those past optimizations.
J: Oh, absolutely. You don’t want to write your own linear algebra library, for example, you want to use LAPACK.
M: As you think about the area of tools, are there tools that you have high hopes for? I had mentioned Julia, but are there other tools on the horizon that you think could be big?
J: Well I guess the tool I’ve been most excited about lately has been Docker. It’s a lightweight version of a virtual machine. It serves much the same purpose as the virtual machine, but it’s a lot lighter weight. And it’s just magical how it can work when it works well.
M: I may have to dig into that. I’ve seen some of the articles about it, but I just haven’t touched it. Excellent, you have high hopes for Docker.
J: Yeah, it’s sort of, uh--Docker is a metaphor for a shipping container. They’re hoping to do for software deployment what shipping containers did for shipping. And to some extent, I think they can.
M: And they can get around the performance issues, or do you just think that the accessibility and the cross-platform will be important enough that people will tolerate another layer in between for performance?
J: I haven’t seen a performance problem. I’ve been on a project where, over and over we were bitten by “Well it works on my machine,” and Docker eliminated that. There are a lot of things that purport to eliminate the problem. This or that kind of sandbox and virtual machine, and none of them really did, but Docker did.
M: Okay. Good, good. So as you look around the domain that you’ve been working in more broadly, has there been anything that surprised you? Maybe Docker surprised you in the last couple years. Is there anything else that’s surprised you in the last couple of years?
J: I guess one thing that’s surprised me is the way people gush about data as if data is this new discovery. That no one has ever used data before. Like Kepler was auguring chickens, or something.
M: Yeah, I remember I was sitting at a small conference at University of California San Diego, and Hal Varian a statistician from Google was speaking and he says statistics is going to be the next sexy profession. And I was thinking, thank goodness it’s finally coming around my way! To have people say that the kind of stuff I do is exciting is sort of a weird feeling.
So as you look ahead, put on your forecasting hat for the area of statistics and programming and mathematics. What are the big advancements you see coming in the next few years?
J: I don’t really know, but I think one possibility would be some sort of synthesis of traditional statistics in machine learning. They’re pretty much carried out by separate practitioners. Different vocabulary, different philosophies. And I think that there’s a lot that each side could learn from the other. If there were some sort of synthesis bringing these two together--I think that statistics can be sort of anal retentive sometimes, creating the perfect models that maybe don’t depend on realistic assumptions. In that sense, machine learning is much more pragmatic. But at the same time, machine learning may not have that sense of accountability, coming up with error distributions and some of the traditional restraint that you get from statistics. I think a synthesis of those two would be really valuable.
M: I come from the statistics side but play around with machine learning here and there. It probably reveals my biases, but I tend to think of machine learning as just another model; in other words, it’s another way to map input and output.
So do you see dangers ahead? You know, the popular press has all this discussion of the AI winter being over. And, we have people as diverse as Stephen Hawking and Elon Musk warning us about the dangers of general AI. It’s a curious question. I don’t have the expertise to have a really clear opinion myself, but I’m curious if you have any opinions or thoughts on that.
J: One danger that I see is algorithms without a human failsafe. So you could have false positives, for example, in anti-terrorist algorithms. And then there’s some twelve-year-old girl that’s arrested for being a terrorist because of some set of coincidences that set off an algorithm, which is ridiculous. Something more plausible would be more dangerous, right? I think the danger could increase as the algorithms get better.
M: Because we start to trust them so much, because they’ve been right so often?
J: Right. If an algorithm is right half the time, it’s easy to say well, that was a false positive. If an algorithm is usually right--if it’s right ninety-nine percent of the time--that makes it harder when you’re in the one percent. And the cost of these false positives is not zero. If you’re falsely accused of being a terrorist, it’s not as simple as just saying oh no, that’s not me. Move along nothing to see here. It might take you months or years to get your life back.
M: Or if there’s not a human in the loop, and a drone has weapons, you might be dead.
J: Well, yeah. In the less high-stakes sense, I think about phone trees. When I call tech support on something, I’m calling because I can’t solve my problem online. There’s something unusual about my situation, that’s when I most need to speak to a human. And when instead I get another computer program, one that’s clunkier than the one on the website, that’s very frustrating.
M: As we wrap up here, you mentioned one book already--Code Complete. Are there other books that you find yourself turning to fairly often?
J: Well, at different times I’ve had different books at my fingertips. For years I had Ambramowitz and Stegun’s Handbook of Mathematics. Always had it at my side; I kept a copy at the office and a copy at home. I haven’t looked at that one in a while, but at one point I was wearing that one out. At times, I’ve gotten a lot of good out of Knuth’s Art of Computer Programming. Again, something I haven’t touched as much recently, but at one time I was in it more often. Concrete Mathematics by Graham and Knuth is a great book on discrete math that I turn to once in a while.
M: I found you because I was on HN and started reading your blog and some of the things that you write. So are there any blogs that you recommend?
J: Sure. Well, for statistics there’s Andrew Gelman’s blog; he always has good things to say. I subscribe to Terry Tao's blog, not that I can always understand it. I seldom understand it all, but when I can it’s exciting. And Mike Croucher’s blog, Walking Randomly has some good articles on high-performance computing and scientific computing, that sort of thing.
M: Thanks so much for your time today. Make sure you follow John at www.johndcook.com