sampler" Computational Imagination

Profile in Computational Imagination: Rob Trangucci

Creating Trading Strategies from Sentiment

Rob Trangucci, the focus of our profile today, comes recommended by Dr. Jon Krohn, profiled here. Rob particularly impressed Jon with his work on the R implementation of the Stan language coming out of Andy Gelman's lab at Columbia. We will get to that work and the great team he worked alongside in a moment but lets start with Rob's current work as a data scientist at iSENTIUM.

Trading Models at iSENTIUM

Mike(M): Rob start us off with a quick description of what you are working on today with your role as Senior Data Scientist at iSENTIUM.

Rob(R): I look for connections between what people say on Twitter today and how stock prices move in the future. We quantify the strongest and most stable of these connections, and then we develop trading strategies.

M: What interests you, even excites you about this analytical domain?

R: Developing statistical models is the most fun part of my job. It's sort of like clarifying your thoughts by writing something down. Modeling forces you to formalize your beliefs about how your data is being generated. You learn so much by codifying your hypotheses, testing these hypotheses against data with a model, and then iteratively honing your model.

M: Tell us a little more about the models you build and the kinds of problems you solve.

R: The models involve using Twitter sentiment data to predict equity returns. That means that I do a lot of regression and time-series modeling. I wrote most of the modeling code to enable easier backtesting and to ensure that I know exactly what is going on internally. I'm the type that needs to write code and get into the guts of an algorithm before I really understand the output. Luckily it turns out that this code has been usable both in research and production.

I also spend a lot of time writing Python code to put models into production, where they generate trading signals on the fly. This involves understanding all of the idiosyncrasies of financial data and how to systematically and methodically handle those idiosyncrasies.

Professional Pathway

M: Rob you studied physics for your undergraduate work, how did you end up in data science and quantitative trading model development?

R: My initial excitement about physics was about modeling physical processes with math. I thought the idea of reducing a seemingly complex system to an interpretable set of equations was just so cool. Later on in college I took a regression course in the math department and optimistically and naively thought the same could be done for social/economic processes. Given the complexity of the world, there are many more assumptions that have to be made, but this is a good thing because it makes data science and stats jobs indispensable.

M: What were some key milestones along the journey to where you are today professionally?

R: Figuring out what I didn't like to do day-to-day was one of the most important milestones for me. I worked as a management consultant after college. After working there for 3 years, I knew that my day-to-day didn't have enough formal mathematical modeling to be fulfilling. I wasn't sure precisely what I wanted to do, but I knew I needed a deliberate step towards quant modeling, so I decided to go back to school. Luckily, I was able to find the Quantitative Methods in the Social Sciences program at Columbia, which offered a flexible curriculum; this was perfectly suited for someone who wanted lots of exposure to disparate fields, and the ability to home in on particular field later on in the program. QMSS also introduced me to Stan and Andrew Gelman.

M: What are some of the most gratifying outcomes from your work?

R: The most gratifying outcomes from the work I did for Stan, which is a statistical modeling language developed at Columbia, was making certain processes run faster. For example, I was able to speed up the computation of the gradient of the Cholesky decomposition (a matrix factorization function) by about 50% and decrease the memory requirements for the computation from O(D^3) to O(D^2), where D is the dimension of the matrix, by implementing Mike Giles's algorithm. This makes sampling from Gaussian process latent variable models in Stan much more feasible than it was before. Any incremental speed-up in inference translates to more modeling, which is good for everyone.

At iSENTIUM, the most gratifying outcomes are identifying strong trading signals and helping our clients capitalize on those signals.

R Libraries for Stan

M: I want to dig into your work on Stan, tell us about that project

R: For the uninitiated, Stan is a statistical modeling language that allows for Bayesian inference and maximum likelihood on any user-defined model in the language. It could be a regression model, a time series model, a survival model, etc. Stan is named after Stanislaw Ulam, who invented Monte Carlo methods while he worked on the Manhattan Project.

The Bayesian inference (meaning, we want to sample from the posterior distribution over the model parameters instead of finding the values of the parameters that maximize the likelihood function) is done through a special Markov Chain Monte Carlo algorithm called the No-U-Turn Sampler, which adaptively tunes the parameters of a Euclidean Hamiltonian Monte Carlo sampler.

Most of what I was involved with while I was at Stan full-time was writing C++ code to enable automatic calculation of higher-order derivatives, which would allow for Stan to use maximum marginal likelihood and Riemannian Hamiltonian Monte Carlo as inference methods, and adding more specialized math functions along with pre-computed gradient functions. I also worked on building an equivalent to lme4, which is a mixed-effects modeling language and package in R. It's widely used to do inference on these types of models because of the succinct formulas used to build these relatively complex models. However, because of the inference procedure (maximum marginal likelihood) it tends to understate the uncertainty in group-level variance parameters. We wanted to provide an option for Stan code to be generated by the same formula, but for the inference to be done in Stan.

However, the version I built generated fairly inefficient, though easily interpretable, Stan code. We didn't end up releasing this version as an R package, but Jonah Gabry and Ben Goodrich, two core Stan developers, are putting together rstanarm, which has pre-compiled, very efficient Stan models, and wraps a bunch of useful post-processing functions, like posterior predictive checks and approximate leave-one-out cross validation. I'm excited for them to release this!

M: What kinds of problem sets is Stan particularly well suited to solving?

R: Hierarchical models, latent variable models, and models with discrete parameters that can be integrated out (finite mixture models, change-point models).

Really, any model where you can define a log-probability density over a set of continuous parameters, Stan will be able to sample from. Stan's forte is exploring high-dimensional probability spaces with its specialized implementation of a Euclidian Hamiltonian Monte Carlo sampler. All of this is implemented in efficient C++ code, so it's extremely speedy compared to other samplers, and allows for proper Bayesian inference on very large models.

gif illustrating Stan

Euclidian Hamiltonian Monto Carlo involves simulating the trajectory of a particle by solving discretized partial differential equations that relate the particle's kinetic and potential energy functions to the particle's momentum and position. If you simulate the trajectory for a fixed number of time steps with a fixed time step size, the particle's position at end of the trajectory will represent a valid draw from the distribution represented by the potential energy function (with a slight correction for discretization error).

This shows a particle's trajectory with a potential energy function equal to the negative of the log-density of a bivariate normal distribution with correlation of 0.95.

Animation by Bob Carpenter, 2015.

The power of Stan is its flexibility admitted through its modeling language, which is connected to this high-powered sampler and is tightly coupled to an extensive math library. EHMC requires gradient information about the log-density and the math library implements reverse-mode autodifferentation to automatically compute all the gradients of any density defined in Stan's modeling language. Furthermore, the team has implemented pre-computed gradient functions for many specialized functions that come up all the time in stats like the normal density, the gamma density, etc.

Given Stan's flexibility, and its ability to sample from arbitrary densities, researchers aren't forced to make the tradeoff between tractable models that don't accurately reflect the researcher's beliefs and models that do reflect researcher's beliefs but are hard to sample from. For example, if researchers didn't want to write their own Markov Chain Monte Carlo sampler, they were forced to use an inverse Wishart distribution as the prior for covariance matrices, but inverse Wishart distributions have bad properties (see this paper).

This is pernicious because covariance matrices can be parts of a statistical model where priors have strong effects; they're hard to estimate when they get quite large, or when they're describing the relationship between latent variables. For covariance matrix modeling, Stan allows for a class of priors called LKJ distributions over the correlation matrix and marginal priors for the variance parameters. This is really useful from a modeling perspective because this mirrors how we think about multivariate relationships, which is typically in terms of correlations and univariate standard deviations. It allows the researcher to encode their beliefs for each separately, rather than requiring the researcher to model both with the same distribution.

M: How did you initially get involved in that project?

R: I took a class from Ben Goodrich at Columbia called Missing Data (fascinating, really, and a bit disconcerting that more isn't done to formally model missing data in many studies). After I graduated the QMSS program, I wanted to continue learning, and I loved Stan, so he suggested I talk to Andrew about working on the Stan team.

M: What was your biggest challenge during the implementation?

R: The biggest challenge in working on Stan, and really any big, technical project with multiple contributors is that knowledge is hard to transfer. I spent a lot of time working with Bob Carpenter and Daniel Lee (two of the core developers) just trying to run up the learning curve on the algorithms and data structures that undergird Stan. Before I could contribute anything, I needed to understand what was going on. Luckily, Bob and Daniel were infinitely patient, and answered all my questions. But Stan is open-source with developers from all over the world, and I'm not sure how to enable this type of learning when you're not sitting next to the core developers looking over their shoulders as they code. Stan depends on regular contributions from non-core developers (there's just too much to do and maintain to expect the current core team to do everything). We have to be able to teach people how to contribute and how the algorithms work, which is key if you're planning to make a contribution to the specialized math library. Bob just put up a paper on how the automatic differentiation code in Stan works, which will go a long way towards helping people understand what's going on under the hood:

Computational Imagination

M: What would you like to be able to measure that you currently don't?

R: I'm not certain that we need more data or more measurement; we're already getting unprecedented amounts of data. We need the capability to appropriately model the complexity of the data we're generating.

I don't usually find myself saying: "I wish we could measure/had measured this process", but I do find myself saying: "I know data exist on this somewhere, why is it so hard to access/why isn't it collected by a central agency more often, etc." Data on police shootings is a good example. It turns out there's not a great dataset on police shootings. This isn't because these aren't measured, but because there are incentives to not sharing the information, because it would take a Herculean effort and lots of money to go precinct by precinct to collect all the cases and manually code quantitative variables for each case. There's a guy out of Bowling Green who was profiled on FiveThirtyEight recently who uses Google News alerts to record media-reported police shootings and brutality cases, but there's a clear selection bias problem with that dataset and it's unclear how it would impact conclusions drawn from the dataset. It's such an important issue, but our lack of data on the phenomenon prevents us from getting a coherent picture.

That said, personally and selfishly, I can't believe the New York subway lines don't all have precise train timing clocks that update in real time in response to delays. A friend of mine, James Somers, wrote an article about this that just came out, so it's gotten me thinking about these "countdown clocks". Some lines do have this, like the 2/3 and 4/5/6 so I know there are ways to measure the location of trains. I happen to be dependent on two subway lines that don't have countdown clocks :)

M: If you had substantially greater independent resources what "crazy" idea would you pursue? Why? What intrigues you about this idea? What are the constraints or risks that would cause some to view it as "crazy"?

R: Hmm...I'd focus on increasing the statistical literacy of the general population. There are problems with the way statistics is taught (often not at all) at the undergraduate level. Introductory courses are too often focused on teaching students formulas as if they were handed down from on high rather than arising naturally from probability models. Understanding the connection between the probability model and the statistic you're calculating brings to the forefront the assumptions that go into the model, which can then elucidate what happens when your data violates the assumptions of the model.

I think this is "crazy" because stats has been taught this way for years, so changing the curriculum would be hard.


M: What kind of tool-chain are you working with at iSENTIUM?

R: The current tool chain is extremely Python oriented; I work in a Jupyter notebook all day long that's connected to an ipython session that's running on one of our servers. Most of what we've done wouldn't be possible without pandas. PyStan (Python's Stan interface) enables fitting Stan models in Python. Jupyter has really been transformative (thank you to our VP of Engineering, Gaurav Hariani, for making this possible and for introducing me to the whole Jupyter workflow) because much of my analysis is of the embarrassingly-parallel sort, and Jupyter allows you to set up a cluster and run code in parallel very easily.


M: What is on your list to learn next?

R: I'd like to integrate Gaussian process regression into my analytical toolset but it turns out the learning hyperparamters can be tricky. The GPstuff Toolbox developed at Aalto University handles all of the approximations that are sometimes necessary to run GP regression on larger datasets, such as expectation propagation and maximum marginal likelihood.

M: What books do you find yourself turning to repeatedly?

R: I'm a Bayesian and I run a lot of regression models, so the two references I'm always turning to are Data Analysis Using Regression and Mutlilevel/Hierarchical Models by Gelman and Hill (also know as ARM for Applied Regression Modeling), and Bayesian Data Analysis by Gelman et al.

M: What blogs do you follow regularly?

R: I follow Andrew Gelman's blog pretty religiously, along with Christian Robert's blog.

M: Rob thanks for sharing your professional journey and insights to this point. I am confident that there are many good pathways ahead for you.

R: My pleasure.