Model Baselines Are Important

written by Eric J. Ma on 2018-05-06

For any problem that we think is machine learnable, having a sane baseline is really important. It is even more important to establish them early.

Today at ODSC, I had a chance to meet both Andreas Mueller and Randy Olson. Andreas leads scikit-learn development, while Randy was the lead developer of TPOT, an AutoML tool. To both of them, I told a variation of the following story:

I had spent about 1.5 months building and testing a graph convolutions neural network model to predict RNA cleavage by an enzyme. I was suffering from a generalization problem - this model class would never generalize beyond the training samples for my problem on hand, even though I saw the same model class perform admirably well for small molecules and proteins.

Together with an engineer at NIBR, we brainstormed a baseline with some simple features, and threw a random forest model at it. Three minutes later, after implementing everything, we had a model that generalized and outperformed my implementation of graph CNNs. Three days later, we had an AutoML (TPOT) model that beat the random forest. After further discussion, we realize then that the work that we did is sufficiently publishable even without the fancy graph CNNs.

I think there’s a lesson in establishing baselines and MVPs early on!

Did you enjoy this blog post? Let's discuss more!


Consolidate your scripts using click

written by Eric J. Ma on 2018-03-30

Overview

click is amazing! It's a Python package that allows us to add a command-line interface (CLI) to our Python scripts easily. This blog post is a data scientist-oriented post on how we can use click to build useful tools for ourselves. In this blog post, I want to focus on how we can better organize our scripts.

I have found myself sometimes writing custom scripts to deal with custom data transforms. Having them refactored out into a library of modular functions can really help with maintenance. However, I still end up with multiple scripts that might not have a naturally logical organization... except for the fact that they are scripts that I run from time to time! Rather than have them scattered in multiple places, why not have them put together into a single .py file, with options that are callable from the command line instead?

Template

Here's a template for organizing all those messy scripts using click.

import click


@click.group()
def main():
    pass


@main.command()
def script1():
    """
    Makes stuff happen.
    """
    # do stuff that was originally in script 1
    click.echo('script 1 was run!')  # click.echo is recommended by the click authors.


@main.command()
def script2():
    """Makes more stuff happen."""
    # do stuff that was originally in script 2.
    print('script 2 was run!')  # we can run print instead of click.echo as well!

if __name__ == '__main__':
    cli()

How to use

Let's call this new meta-script jobs.py, and make it executable.

$ chmod +x jobs.py

To execute it at the command line, we now a help command for free:

$ ./jobs.py --help
Usage: jobs.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  script1  Makes stuff happen.
  script2  Makes more stuff happen.

We can also use just one script with varying commands to control the execution of what was originally two different .py files.

$ ./jobs.py script1
script 1 was run!
$ ./jobs.py script2
script 2 was run!

Instead of versioning multiple .py files, we now only have to keep track of one file where all non-standard custom stuff goes!

Details

Here's what's going on under the hood.

With the decorator @click.group(), we have exposed the main() function from the command line as a "group" of commands that are callable from the command line. What this does is then "wrap" the main() function (somehow), such that now it can be used to decorate another function (in our case, script1 and script2) using the decorator syntax @main.command().

Recap

Did you enjoy this blog post? Let's discuss more!


Lessons learned and reinforced from writing my own deep learning package

written by Eric J. Ma on 2018-02-28

At work, I’ve been rolling my own deep learning package to experiment with graph convolutional neural networks. I did this because in graph-centric deep learning, an idea I picked up from this paper, the inputs, convolution kernels, and much more, are being actively developed, and the standard APIs don’t fit with this kind of data.

Here’s lessons I learned (and reinforced) while doing this.

autograd is an amazing package

I am using autograd to write my neural networks. autograd provides a way to automatically differentiate numpy code. As long as I write the forward computation up to the loss function, autograd will be able to differentiate the loss function w.r.t. all of the parameters used, thus providing the direction to move parameters to minimize the loss function.

Deep learning is nothing more than chaining elementary differentiable functions

Linear regression is nothing more than a dot product of features with weights and adding bias terms. Logistic regression just chains the logistic function on top of that. Anything deeper than that is what we might call a neural network.

One interesting thing that I've begun to ponder is the shape of the loss function, and how it changes when I change model architecture, activation functions, and more. I can't speak intelligently about it right now, but from observing the training performance live (I update a plot of predictions vs. actual values at the end of x training epochs), different combinations of activation functions seem to cause different behaviours of the outputs, and there's no first-principles reason why that I can think of. All-told, pretty interesting :).

Defining a good API is hard work

There are design choices that go into the API design. I first off wanted to build something familiar, so I chose to emulate the functional API of Keras and PyTorch and Chainer. I also wanted composability, in which I can define modules of layers and chain them together, so I opted to use Python objects and to take advantage of their __call__ method to achieve both goals. At the same time, autograd imposes a constraint in that I need to have functions differentiable with respect to their first argument, an array of parameters. Thus, I had to make sure the weights and biases are made transparently available for autograd to differentiate. As a positive side effect, it means I can actually inspect the parameters dictionary quite transparently.

Optimizing for speed is a very hard thing

Even though I'm doing my best already with matrix math (and hopefully getting better at mastering 3-dimensional and higher matrix algebra), in order to keep my API clean and compatible with autograd (meaning no sparse arrays), I have opted to use lists of numpy arrays.

Graph convolutions have a connection to network propagation

I will probably explore this a bit more deeply in another blog post, but yes, as I explore the math involved in doing graph convolutions, I'm noticing that there's a deep connection there. The short story is basically "convolutions propagate information across nodes" in almost exactly the same way as "network propagation methods share information across nodes", through the use of a kernel defined by the adjacency matrix of a graph.

Ok, that's a lot of jargon, but I promise I will explore this topic at a later time.

Open Sourcing

I'm an avid open source fan. Lots of my work builds on it. However, because this "neural networks on graphs" work is developed on company time and for company use, this will very likely be the first software project that I send to Legal to evaluate whether I can open source/publish it or not -- I'll naturally have to make my strongest case for open sourcing the code base (e.g. ensuring no competitive intelligence is leaked), but eventually will still have to defer to them for a final decision.

Did you enjoy this blog post? Let's discuss more!


Joy from teaching

written by Eric J. Ma on 2018-02-26

It always brings me joy to see others benefit from what I can offer.

Thanks for sharing the fruits of your journey on LinkedIn, Umar!

Also a big thanks to the others who have finished the course! I hope you have enjoyed the learning journey, and were able to find problems to apply your newly-gained knowledge!

With big thanks to DataCamp as well, for the development of their platform, enabling us to do teaching even outside of the academy! (Special shout-out to Hugo Bowne-Anderson and Yashas Roy, with whom I've personally partnered with to make the content go live.)

Did you enjoy this blog post? Let's discuss more!


Annotating code tests and selectively running tests

written by Eric J. Ma on 2018-02-25

I just learned about a neat trick when using pytest - the ability to "mark" tests with metadata, and the ability to selectively run groups of marked tests.

Here's an example:

import pytest

@pytest.mark.slow  # annotate it as a "slow" test
def test_that_runs_slowly():
    ....

@pytest.mark.slow  # annotate test as a "slow" test.
@pytest.mark.integration  # annotate test as being an "integration" test
def test_that_does_integration():
    ....

What's really cool here is that I can selectively run slow tests or selectively run integration tests:

$ py.test -m "slow"   # only runs "slow" tests
$ py.test -m "integration"  # only runs "integration" tests
$ py.test -m "not integration"  # only runs tests that are not "integration" tests.

Did you enjoy this blog post? Let's discuss more!