Overfitting

by

This will be a fairly technical post, but it shouldn’t be too bad.

I’m currently working at a job where we do a ton of machine learning. Machine learning is basically do statistics with a lot of computers. There’s really no magic there, despite warnings from people with a lot of money and brainpower.

Machine learning is really a set of tricks that we’ve learned to solve a particularly hard kind of problem. The problem is summarized here.

  • Go get a bunch of data.
  • Look at the data really, really hard.
  • Try to figure out what the data is telling you about how the universe works.

This is pretty much a summary of every scientific discipline ever.

Note that the “try to figure out” part is often expressed in different ways, the other two common forms being: “make a prediction that is accurate” or “decide what we should do differently to get different results.” Both of these are just restatements of the problem of figuring out what the data tells you about the universe.

Now, there are a number of ways people go wrong when they do this kind of thing. I won’t bore you with all the tiny details of how to even begin to understand what you are looking at and how to make sense of it.

One common problem is called “overfitting”. The way it manifests itself is you propose a theory about the data that explains the data really, really well. In fact, remarkably well. But then, when you try out the idea in reality, it is horrible.

In order to understand this, imagine a scatterplot of data points. You want to predict what value you should get depending on where you’re at along the X-axis. What you could do is just draw lines connecting all the point together. This graph will accurately predict every value you’ve seen in the data, but it will not be a very good predictor of how reality behaves (in the vast majority of cases.)

In physics, we sometimes do the same things. We have common patterns we follow to try and avoid this. One of the patterns is “Don’t look at the data before making your theories.” That is, try to make your theories out of previous theories and new assumptions. This is like trying to hit a bullseye wearing a blindfold. The problem with this method is it is very inefficient, and there is only so many ideas we can come up with. However, when you do find that needle in the haystack, the theory that does a good job at matching the data, then we think we’re pretty close to reality. The best part is we know how that theory was put together, and we can think about it and reason about it.

This is what Newton did. Or really, it would’ve been what he did except he was familiar with the data, and he was looking for a reason why things moved the way they did. So really, it didn’t work that way in practice. And it never does. Theoretical physicists do look at data. They get inspired by it.

The issue is when you look at the data, and you see shapes, you propose math that explains those shapes, and then you try to figure out what it all means. The truth is that there are a lot of shapes that will fit that data. Some of them are worse than others. And you really have no way of knowing that the shape that fits best is really the shape that represents reality. True, the more points of data you have, the more certain you are about that shape being the right shape, but you can never reach a point where you can say “This is the only shape that works well.” It gets even worse when you consider the fact that the data you have collected is not and never can be 100% accurate.

What does this have to do with conservatism?

Conservatism is one of those “inside out” philosophies on par with Newtonian Mechanics. It is a collection of ideas, “shapes” if you will, about how the world works. Philosophers and logicians have argued about these ideas for a very long time. They’ve been around for such a long time that they aren’t new anymore.

Granted, sometimes the shapes fit the data really well, and sometimes they don’t. There are other shapes you can find that fits the data better than conservatism. That isn’t really the problem we’re trying to solve, though. Focusing on what fits the data best gets you shapes that fit the data well but don’t have much power in understanding what is really going on.

The other type of philosophy when it comes to these sorts of things are the “outside in” philosophies. In these philosophies, you look really, really hard at the data, find a really good shape that fits, and then declare that to be the ultimate truth. Then from that newly discovered ultimate truth, you make predictions and take courses of action. This seems very reasonable, but as I said earlier, it has the fundamental flaw of overfitting.

The way this philosophy pops up is in comments like, “There are poor people. We have to do something!” or “The rich make a lot of money! We have to do something!”

And if that something is aimed at changing the metric, and you make proposals based on conclusions drawn solely on the data, you’re going to get some really bad ideas. For instance, we could kill all the rich people and the poor people and that would certainly eliminate the problem of poverty and wealth disparity. Obviously, something is hopefully telling you something is fundamentally wrong with this proposal. But don’t you also feel like there is something wrong with the idea of taxing and giving money away?

I can’t tell you the number of times I’ve personally seen data-driven thinking lead people astray. When I worked at Amazon, we were almost cult-like in our devotion to data, but we knew about overfitting and we tried to avoid it. This was years ago, but we used to run A/B testing on various changes to the website. We developed ideas about why people clicked on some things and not other things. Some of our ideas were pretty fantastic, and entire teams were formed to pursue them. In the end, we discovered that the data seemed to be telling us people click on new things but they really don’t like things to change very much. That’s why Amazon.com doesn’t really look very much different than it did a decade ago.

In your own life, be very careful about making decisions based on data. Be careful about overfitting. Be careful about reading too much into the data you see. Remember that the shape you think fits the data isn’t necessarily true even if it is the best shape.

If you really want to understand something, you have to develop theories in an almost clean-room environment free from data. Once those ideas are developed, then you can test them to see if they pan out. But be careful about reading the data too closely, since it can lead you astray.

Also, this is why I don’t like string theory. It’s a plain and simple case of overfitting. They literally pick and choose which versions of string theory they like based on how well it fits the data.

And this is why I don’t like political decisions being made with graphs in the background.

Advertisements

Tags: , , ,

3 Responses to “Overfitting”

  1. Jason Gardner Says:

    Good post.

    I think one of the things wrong when data driven analysis is given to politics is that humans are are not at all introspective. I mean not at all.

    Quite to opposite. We seem to devote an inordinate amount of time and effort to the denial of the reality that surrounds us. All of us do that. I’ve never met someone, including myself, who doesn’t lie to themselves in one way or another.

    I believe it is a very human trait.

    For example, an alien Anthropologist would have exactly no problem classifying us as a mildly polygamous (due to the ratio of male to female size) sexually reproducing, sexually dimorphic species of mammal.

    However, we act as if we are some bizarre new entity (we are not a species of animal, of course) that has nothing in common with the other mildly polygamous, sexually reproducing, sexually dimorphic species of mammals.

    Example: Feminism, which is wildly popular, talks of a “double standard” in our culture. Well, duh, of course there is a double standard. We are a sexually reproducing, sexually dimorphic species. All such species have double standards for the two sexes. Duuuuhhhh. Let me repeat. Duuuuuuhhhhhh. (You mean a male elephant seal acts different from a female elephant seal? Blown away!)

    Of course, most of us believe these sexual facts do not apply to humans. We are a magical separate entity, removed from nature. In fact, we teach the philosophy that the rules of sexual dimorphism and sexual reproduction doesn’t apply at all to us humans. In fact, if I were to suggest such rules do apply to humans that would make me ignorant and evil.

    Of course, the academics then trot out the graphs and curves to show how evil we are for, like most people, acting like we are sexually dimorphic and sexually reproducing. Any evidence of a contrary view off this truth is considered a great conspiracy and cause for alarm.

    So much for being self aware.

    “Conservatism is one of those “inside out” philosophies on par with Newtonian Mechanics. It is a collection of ideas, “shapes” if you will, about how the world works.”

    This is misleading I think. Conservatism works to a certain extent but I don’t believe conservative people understand why. They evidently don’t understand well enough to defend conservatism and win new converts in mass. In fact, conservatism has been on the retreat so long I don’t recognize it anymore.

    To have a solid political foundation you need a solid understanding of life, a solid understanding of biology and a solid understanding of actual human psychology.

    I’m mystified that conservatives haven’t picked up what has been going on in the last 10 years or so in regards to our understanding of ourselves and nature. All of the data has come in on their side. For some reason conservatives can’t recognize the win they have been given by science and academia.

    I’m talking much, much broader than economics. (The conservatives have won that one, even though they don’t notice the win and keep fighting.) I’m talking about the conservative values that they gave up 25 or so years ago. (Nuclear families, father in charge, different roles for women and men, homosexuality being an aberration, nationalism, the modern world being a good thing, etc, to minor things like hunting.)

    It’s weird how conservatives have all the data and all of science and academia behind them and yet they don’t recognize it or capitalize on those facts.

    It’s like fighting the Romans 2000 years ago and not noticing that you have nuclear weapons. And then losing to the Romans having never used your good weapons. Truly bizarre stuff.

    • Jonathan Gardner Says:

      Shower thought: Since conservatism is focused on preserving the past, and since we know things will always change, conservatism is always fighting a losing battle. We’re doomed to repeat the mistakes of history time and again.

      The point about the lying to ones self: Scientists who are extraordinarily careful have to lie to themselves too. I remember in my senior year I took a class on particle physics and on the first day the professor pulled up charts and graphs and says, “Can any you you explain why there are bumps here? If you can, you’ve earned a Nobel Prize. The bottom line is this: What you are going to learn is simply not true. It works for a lot of cases, but for a lot of other cases it simply doesn’t work. Now that you’re out of the textbook and closer to the actual research, you’re going to see that nothing is cut and dry, and there is no prevailing theory that explains everything.”

      I wonder if such an attitude towards all political philosophies, including conservatism, is healthy. It’s good to know that you’re wrong and always will be wrong, and the moment you’ve got something that actually works just a little better than what worked before, it’s worth investigating. Knowing that new information is likely just as wrong as the old means you render deference to the old.

  2. Jason Says:

    I think conservatism generally means the collective wisdom of the English people. Broadly, what we in the English culture have learned since the Romans left. It’s a set of “rules of thumbs” that have proven to work without anybody really understanding why.

    The Marxists of the last century challenged the collective wisdom and the conservatives couldn’t fight back. Totally, understandable.

    It’s like knowing that you plant wheat when the temperature is above 50 in April for four straight days. A farmer knows it generally works, even though he has no understanding of climatology, planetary mechanics or plant biology. If the farmer is challenged he can’t really effectively defend why he does it other than it works.

    I think the big change is that we are understanding why those rules of thumb are right. Matxist/leftist ideology is so under seige that they actually create safe spaces where Marxist students can go so that the now proveable facts don’t confront them too harshly.

    Think about that.

    Why not re-mold conservatism in light of the new facts?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: