Welcome! Please take a few seconds to create your free account to post threads, make some friends, remove a few ads while surfing and much more. ClutchFans has been bringing fans together to talk Houston Sports since 1996. Join us!

What factors are most predictive of a game's outcome for the Rockets? A multiple/logistic regression

Discussion in 'Houston Rockets: Game Action & Roster Moves' started by hollywoodMarine, Feb 16, 2014.

Page 1 of 4

hollywoodMarine Member

Joined:

Jan 15, 2014

Messages:

246

Likes Received:

32

Hey guys. So we're going over regression in my stats class now, and I thought, what a great way to get some prac app practice here by forming a couple models based on rockets games so far! Since the first time I posted a similar analysis type thread people (rightfully) were requesting a TL;DR section, I'll post a short conclusion at the end with the main points in bullet point format If you are pressed for time but still want to read more than a couple bullet points, I've underlined the significant portions for you to skim through.

As always, I will probably make quite a few mistakes, so constructive criticism is appreciated! I am also open to suggestions for more suitable variables to analyze.

Goals of this Post

This post aims to provide some insight on how much do different variables (different parts of the box score) contribute to or predict wins/losses for the Rockets. Multiple regression and logistic regression models are created to help answer these questions. Additionally, regression models can also identify outlier games, such as games where Rockets “should” have won, but lost (and vice versa).

A Quick Summary of Multiple Regression

Regression analysis is a statistical process for explaining (or trying to explain) relationships among variables. For basketball, it's a fancy way of saying “what would this game look like if I were to put it in the form of y=mx+b,” where “x” would be the thing you want to examine (e.g., number of turnovers in a game), “y” is the outcome of the thing you want to examine (e.g., point differential / margin of victory) and m describes the relationship (so if TO's and margin of victory have a negative relationship, then m would be a negative number). “b” is the intercept/constant, which is what y would be if x were to be zero. If the rockets never made a single turnover at all this season and produced a margin of victory of 25 points most of the time, then “b” here would be 25. Each additional turnover would lower 25 by the amount of “m” amount.

What's great about multiple regression is that you don't have to limit your model to just y=mx+b. You can add a lot more factors, such as rebounds, steals, shooting percentage, etc. So the model would look something like y = m1x1 + m2x2 + m3x3 + …. + b with each “x”s (x1, x2, x3, etc..) contributing to y in their own amount.

What exactly does the model do? As said earlier, it explains the relationship between all the independent variables (the “x's”) and the dependent variable (“y”). This relationship can be used to predict “y” given whatever “x's” you put into the model. However, the predictive aspect of the model is not too useful; you can't really know how many turnovers, or rebounds, or whatever the rockets will have next Wednesday to plug into your model and predict if they'll win or not (unless you can see into the future, in which case this predictive aspect would be even more useless). What is useful about this model is that it can suggest which games are anomalies, i.e., which ones had outcomes that are really really different from what the model predicts. It can suggest if a game was closer than it should have been, or given how well the team played, should it have won some game that it lost.

However, the most important thing that this model tells us is how much does some independent variable (which I'll refer to as IV from now on) contribute to the dependent variable (which I'll refer to as DV from now on) compared to the other IV's of interest. In other words, the model helps us in answering questions like: Do turnovers hurt the team more than poor three point shooting? Is defense really more important than offense? Are there other factors that contribute significantly to winning/losing that maybe many of us have overlooked?

How we determine the amount of effect some IV has on the DV is from looking at the coefficients, or the slope of “m”s before all the “x”s. In general, the greater the “m” (positive or negative), the greater the effect, although this can be misleading too (more on this later).

A Quick Summary of Logistic Regression

While multiple regression can be a great tool for the reasons listed above, it requires a DV that is continuous or quantitative, in other words, a DV with numerical value that is measured on a continuum. That is why I listed “point differential” or “margin of victory” as the DV, rather than “Win” vs “Loss.” The latter is a categorical DV, which does not fit into a multiple regression analysis.

Why does it matter? Doesn't point differential basically tell us if you won or lost (with the added benefit of providing information on how close/blowout a win/loss was)? Wouldn't multiple regression measuring point differential be sufficient then?

The problem with measuring win/loss on a continuous scale is that it kind of devalues the effects of IV's that don't make huge differences in point differential but can still often mean the difference between a win and a loss. A good example, again, is turnovers. A couple more turnovers may not actually lower your margin of victory by as much as poor shooting would, but there are times when those turnovers may really mean the difference between winning and losing. Therefore, measuring DV on a categorical level can help identify those “small difference makers” that can still significantly predict a win/loss (even if they don't make much difference in terms of DV on a continuous scale like point differential). This is where logistic regression comes into play. It has a formula that is similar to multiple regression (logit=a+bX, kinda like y=mx+b) except the “y” is a logit, a natural log of the odds. Don't worry about exactly what that is, just know that logit can be converted into a probability. So basically logistic regression is similar to multiple regression but it tells you what the odds are that the DV belongs to some category (such as Win) vs some other category (such as Loss) given the IV's.

Note: This does not mean you should completely ignore the continuous aspect of winning/losing. Logistic regression and multiple regression both gives us important pieces of information

Multiple Regression IV's of Interest

All Rockets games were included in this SPSS analysis.

The variables of interest are:
1) Two-point Field Goal Percentage
2) Three-point Field Goal Percentage
3) Free Throws Percentage
4) Offensive Rebounds
5) Steals
6) Turnovers
7) Personal Fouls
8) Opponent Offensive Rebounds
9) Opponent Two-point Field Goal Percentage
10) Opponent Three-point Field Goal Percentage
11) Opponent Free Throws Percentage
12) Opponent Personal Fouls

(If you're wondering why assists and blocks were not included, this will be touched upon shortly)

My reasoning behind choosing these variables:

I think many of these are obvious, so I will go over the ones you may be wondering. To begin, why did I choose not to include defensive rebounds? Defensive rebounding IS significant. We've seen quite a few times where the Rockets just kept giving the opponents golden opportunities to catch up by slacking in this department. However, IMO, how well our rockets were rebounding defensively was best reflected by the number of opponent offensive rebounds (lower OppOREB suggested that Rockets were rebounding well defensively). Initial analysis showed that it was difficult to even see an effect of defensive rebounding (the amount of defensive rebounds simply do not do a good job of predicting point differential) but the effect of opponent offensive rebounding was quite substantial. Therefore, I decided to leave opponent offensive rebounding in, and take defensive rebounding out. What about opponent steals and turnovers? I felt that Rockets' steals and turnovers can, to a degree, give you the same information (more steals, the more opponent turnovers, and the more turnovers, the more opponent steals). Of course I know steals and TO's do not give you the full picture of what is going on in the other side, but having opponent steals and opponent TO's included would have messed up the analysis (see underlined paragraph later in this section).

Why did I split up shooting into two point, three point, and free throws? For sure, comparing overall shooting (e.g., team TS% and opponent TS%) created a model that better predicted the actual outcomes, but such an analysis would provide much less useful information. We already know that if the Rockets shoot better then they have a higher chance of winning. It's obvious. Splitting the shooting up, however, can help us identify how much effect do different kinds of shooting have on winning/losing. Additionally, splitting up the opponents shooting from their free throws can help us identify areas of significance where the Rockets can actually MAKE A DIFFERENCE. Knowing, for example, that opponents' two-point and three-point shooting percentage makes a huge difference means that our Rockets need to play better defense. However, knowing that the opponents' free throws percentage also makes a huge difference is not really consequential here because the Rockets' cannot really control how they shoot at the free throw line. If opponent TS% was used for this analysis, you would have an all encompassing IV that you cannot separate to get the above pieces of information.

Why not just include TS%, FG% (twos and threes combined) in addition to the shooting variables listed earlier to get a good picture of everything? This relates to a huge confound in multiple regression that has to do with collinearity. In a nutshell, if you have explanatory IV's in your equation that are closely related to each other, it really screws up the model (and TS% or FG% is obviously related to two-pointFG%, three-pointFG%, and free-throw%). If you were wondering why assists, blocks, opponent TO's, and opponent steals were not included, that also has to do with concerns about collinearity (assists and FG% are highly related, blocks and opponent FG% are also highly related, opponent TO's and Rockets' steals are highly related, and opponent steals and Rockets' TO's are highly related).

What about offensive rebounds? Surely that should be related to twos and threes FG% (poorer shooting by the rockets/opponent would mean higher chances for getting offensive rebounds for each respective team)? That is true, but collinearity diagnostics suggest that the degree to which offensive rebounds are related to other IV's such as shooting related variables is not high enough to be too significant a problem. You will see more on this in the opponent offensive rebound section.

Multiple Regression Results

Using the 12 variables above, the multiple regression model is:

Point_Differential= 0.849*TwosFGP + 0.643*ThreesFGP + 0.191*FTP + 0.524*OREB + 0.837*STL + -1.01 * TOV + -0.278*PF + -0.643*OppOREB + -1.257 * OppTwosFGP + -0.456*OppThreesFGP + -0.19 * OppFTP + 0.138*OppPF + 25.974

(All FGP and FTP's are percentages, so you would plug in 40 for 40%, rather than .40)

By this model, given how our Rockets played against the Wizards recently, the point differential should have been 0.849*57.14 + 0.643*45.45 + 0.191*74.47 + 0.524*13 + 0.837*4 + -1.01*24 + -0.278*19 + -0.643*12 + -1.257 * 43.08 + -0.456*50 + -0.19 * 50 + 0.138*32 + 25.974 = 8.82 !!

What does this mean? Given how well our rockets played, despite the turnovers and poor defense (mainly on Ariza), we should have actually won that game by 8 or 9 points rather than one. The way I personally like to interpret this, is that rather than think of that game as evidence for our Rockets not being that good, we should regard that game as a fluke game where the Rockets should have won by more points but the Wizards simply got lucky and brought the statistically predicted differential of 9 points down to one

This model is decent, and for the most part accurately predicts wins vs losses (I think there were 4 games where the prediction of outcome was wrong, but two of those were very very close games and it could have gone either way). Point differential wise, the model prediction was 70% of the time within 4 points (+/- 4) of the actual outcome. Games like the one vs Washington as touched upon earlier where the actual point differential was more than 7 points different from the prediction may be perceived as atypical (or the alternative explanation is that this model is flawed, but I like to be optimistic )

Other games that were predicted to be significantly different from the outcome:

Jan 28 HOU vs SAS was predicted to be a loss by 1 point instead of actual win by 7.
Jan 24 HOU vs MEM was predicted to be a much worse loss by 12 points rather than 1.
Dec 12 HOU @ POR was predicted to be a win by one point instead of the actual loss by 8.
Nov 16 HOU vs DEN was predicted to be a close win by 3 points instead of 11.

There was also that terrible loss in Indiana which the model predicted to be a 25 point blowout instead of a 33 point blowout... but I don't think anyone cares.

So I've noticed that just like with my previous post on Lin's consistency, this post has gotten very long, so I've decided against posting the remaining list of predictions vis-a-vis the actual outcome, but if anyone wants to see them, let me know!

Comparing the Effects of IV's

Here we get to the main point of the post. So which variables seem more important than others in terms of predicting point differential? And how does one determine this information? One's initial guess would be to look at the B column (these are the coefficients that represent the slopes that are analogous to the “m” in y=mx + b), and come to the conclusion that the highest coefficient would represent the greatest effect. In a way, that's kind of correct. For instance, an increase in the opponent two point field goal percentage by "5" would drop the point differential by around -6 points (5*-1.257), while an increase in opponent free throw percentage by "5" would only drop the point differential by about a point (5*-0.19).

However, looking at the B column to determine IV effect is flawed because it does not take into account the IV's scale. For example, if I were to have represented the field goal percentage IV's as decimals instead of percentage (e.g., 0.48 instead of 48%), then the corresponding coefficients would have been much larger. Additionally, even if the IV's were on the same scale, a change in one unit of IV does not mean the same thing as the same change of another IV (example, increasing FGP by one point is not as big a change for that IV compared to increasing one turnover). The standardized coefficients column is a much better representation, because (according to my understanding) it tells you how much effect an IV has on point differential if it increases by one standard deviation, rather than just one unit of whatever scale the IV uses.

So what does that column show?

The following is the list of IV's in order of greatness of effect:

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Defense vs Offense, and Importance of Three Point shooting

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

What is obvious (and probably most expected) is that the group of 4 shooting percentage IV's outside of free throws has the greatest effect on the margin of victory/defeat.

What is interesting, though, is that opponent two-point shooting has a significantly greater effect than Rocket's two point or three point shooting. And while opponent three point shooting has less of an effect, as a whole, opponent shooting better predicts the outcome than Rockets' shooting! (I don't know if you're allowed to just add up the coefficients like that, but not to worry, I already created a multiple regression model using field goal percentages as a whole, and the standardized coefficient for opponent field goal percentage was of a greater magnitude than Rocket's field goal percentage. If anyone wants the data I can show them).

What this suggests is that perhaps defense IS more important than offense. James Harden may be one of the best and most efficient scorers in the NBA, but this model suggests that trying to overcome defensive deficiencies or lapses by simply being more brilliant on the offensive end may not actually be a viable strategy in the long run. This may also be a possible explanation for why McHale has not used Brooks extensively this season. He is a dynamic scorer but his defensive problems may not be worth it. But then as a counter-argument to that, Brooks is also a great 3 point shooter, which leads us to the second most influential factor...

Three point shooting for the Rockets appears to be a better predictor of the outcome than two point shooting! We all know how the Rockets 3 point shooting has not been brilliant this season. Poor 3 point shooting causes a huge problem because the Rockets' overall strategy is to rely on basically that and scoring in the paint. If the 3's don't fall to spread the floors, the defenders have a much much easier time guarding Howard by doubling him or making sure players like Lin, who makes a living off driving in the paint, have a difficult time scoring.

It's thus no surprise that the majority of games where the Rockets shot above 40% in threes was a win (and the Rockets did not have one loss where they shot above 50%). The last game vs the Suns, where the Rockets shot almost 70% from 3 point range, unsurprisingly resulted in a blowout. This is one reason why Beverley's 3 point shooting is so highly valued, why the bench's problems in this area (e.g., Francisco Garcia) is really hurting this team, and why we need Harden's 3 point shot to come back. And even though Lin's 3 point shooting is decent now, if he could just raise that percentage to the high 30's like earlier in the season, along with everyone else getting their 3 point shot back, this Rockets team could be unstoppable.

Turnovers Are Bad

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Outside of non-free throw related shooting, turnovers is the most significant factor. What this hopefully shows is that, yes, turnovers are bad. They are not some necessary evil that somehow allows the Rockets to play better. Turnovers are so bad, in fact, that an increase in two standard deviations of TO's (that would be increasing TO's by 7 or 8 in game) has almost the equivalent negative effect on the outcome as increasing one standard deviation of the opponent's two point field goal percentage (that would be an increase in 7%), and would be more damaging than decreasing one standard deviation in the Rockets' three point shooting (that's a whole 10%). In very simplified terms (and correct me if I'm wrong stats experts!), it's like saying: committing 7 or 8 more turnovers negatively affects the outcome as much as lowering the Rockets' 3 point shooting from a respectable 40% to a dismal 30%!

(Random rant from a Lin fan...)

This is why I really really REALLY want Lin to improve his decision making in terms of passing. I know that playing aggressively is worth the increase in TO's, but there are some of his TO's that are COMPLETELY preventable and unrelated to aggressive attempts at play-making. Just take that game vs Washington, Lin was playing excellent, but in the 4th he would make these completely inexplicable passes that had no chance of making it to any of his teammates. These weren't from him driving and trying to make things happen.. these were passes from where he just picked up his dribble without any pressure from defenders and then for some reason made completely unexplainable bad passes. The good news is that these are completely fixable problems, and Lin has already slightly improved from last year in terms of TO's per 36. IMO, given his improvement in shooting and defense (to at least average), TO's are really the last significant obstacle he needs to overcome. Once he limits those turnovers he could have an even more positive effect on the team.

(And yes, Lin's not the only one making TO's, everyone needs to improve in that aspect!)

Opponent Offensive Rebounds and Steals

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Outside of non-free throw shooting related variables, opponent offensive rebounds were the second most influential factor in terms of predicting outcome. It's no surprise that giving up offensive rebounds is harmful, but this model gives us an idea of just how bad it is. Every offensive rebound given up is about 2/3 as bad as committing a turnover. Note that in the far right is a column labeled VIF (variance inflation factors). Remember the collinearity concept that I mentioned earlier? This column displays numbers that give you the degree of “relatedness” that one IV may have with another IV (or a couple IV's). Generally, VIF's greater than 5 are problematic (which is why I decided to keep OppOREB in this analysis). However, since the VIF for OppOREB (at 2.008) is still comparatively large compared to the other variables, this suggests that the effect of opponent offensive rebounds may be slightly inflated in this model (so the actual standardized effect may be a little bit less than -0.193). Regardless, the negative effect of opponent offensive rebounds is quite significant.

Steals also has a positive effect, but not nearly enough to counter the negative effect of turnovers. This can partially be explained by the fact that Harden (and a few other Rockets) sometimes gambles for steals which does not really have a very positive effect on the game's outcome. Playing solid defense and lowering opponents' field goal percentage is a much greater contributing factor to the outcome (and this model is good evidence for this). This is why, for example, despite Lin's decrease in steals compared to last year, I actually consider him playing overall better defense this year. I'd prefer that he maintain his improved D and not gamble for steals. This is not to downplay steals that are the result of good defense of course (e.g., many of Beverley's steals). But this model simply suggests that overall, the positive effect of steals is simply not that high, or not that it is not very predictive of the outcome.

The Rest

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Opponent free throw shooting, Rockets' free throw shooting, Rockets' offensive rebounding, and personal fouls from either side have the least significant effect. For personal fouls, they not only have a very small effect, but their significance values are also high (p= 0.182 and 0.342 for Rockets and opponent fouls respectively). This means they also affect the outcome very inconsistently.

What is interesting is that the effect of Rockets' free throw shooting percentage on point differential is as low as it is. As suggested earlier, this may simply be because free throw shooting does not contribute as many points compared to the other factors, but as you will see in the logistic regression analysis, this is probably not a good explanation.

Logistic Regression Analysis and Results

Using the 12 variables above, the logistic regression model is:

Win/Loss_Logit=5.057*TwosFGP + 3.777*ThreesFGP + -0.057*FTP + 0.32* OREB + 1.538*STL + -5.888*TOV + -1.122*PF + -2.173*OppOREB + -5.459*OppTwosFGP + -3.926*OppThreesFGP + -2.4*OppFTP + 0.818*OppPF + 309.961

Similar to the regression model, you simply plug in the relevant numbers, but the outcome is a win/loss logit rather than point differential. To convert the logit into a probability (of being a win), use the equation Winchance=1/(1+e^(Win/Loss_logit)). SPSS simulation using this model was able to predict with 100% accuracy whether Rockets win or lose given the relevant information, but I would not really use this as a good predictive model. It's hard to really explain the reasons, so I will just leave it at that for now (if anyone wants clarification I can try to explain).

Logistic Regression and IV's Effects

The purpose of this model is more to show the significance of effect of IV's on WIN vs LOSS, rather than point differential. As explained earlier, this would increase the effect of IV's that may have smaller contributions to victory margins, but still have solid contribution to the outcome regardless of how close/blowout a win or loss was. Because SPSS does not standardize the coefficients like it does for multiple regression, I manually standardized the IV's to produce the above table.

And what does it show?
The following is the list of IV's in order of greatness of effect (on Win/Loss group membership rather than point differential!):

OppThreesFGP (-41.498)
ThreesFGP (39.493)
OppTwosFGP (-37.451)
TwosFGP (33.885)
OppFTP (-24.382)
TOV (-22.934)
OppOREB (-8.465)
OppPF (4.878)
STL (4.342)
PF (-4.49)
OREB (1.062)
FTP (-0.559)

Unlike the multiple regression analysis results, here the primary factor in predicting wins vs losses is opponent three point shooting! Additionally, opponent non-free throw related shooting now accounts for the 1st and 3rd most important variables in determining wins/losses instead of 1st and 4th in the multiple regression analysis. This further emphasizes the importance of DEFENCE over OFFENCE. Rockets' three point shooting on the other hand, is still second place in terms effect on the outcome, consistent with the multiple regression model, which again re-emphasizes 3 point shooting as the 2nd most significant thing that the Rockets need to improve on.

Curiously, turnovers, while still incredibly significant in predicting wins/losses, did not increase in effect compared to the multiple regression model (what I mean by that is comparing turnover's effect to other IV's, such as that of twos or threes field goal percentage, does not suggest a greater relative effect in this model than the same comparison would suggest in the multiple regression model). My interpretation of this is that turnovers are just as good at predicting wins/losses as predicting point differential. In any case, turnovers are obviously still problematic, and outside of shooting related variables, it is the most significant factor in determining the outcome of a game.

Finally, opponent offensive rebounding, just like in the multiple regression model, is the 2nd most important non-shooting related variable in terms of predicting the game's outcome

Free Throw Percentages in the Logistic Regression Model

In many other variables, the magnitude of effects are similar to that of multiple regression, with two major exceptions: opponent and Rockets' free throwing shooting. Standardized opponent free throw shooting percentage has a huge -24.382 in magnitude of effect, slightly beating out turnovers in terms of importance! However, given that the Rockets cannot really control how well the opponent shoots free throws, this does not really point to anything that the team needs to change or improve on (except, maybe try not to let them go to the line so much... but then the low significance of PF's in this model seems to contradict that statement).

Lastly, Rockets' free throw shooting percentage has an almost zero influence on win/loss outcome. In fact, it is slightly negative! How is this possible? I am honestly not very certain, but my best guess is that because earlier this season the Rockets have had many poor FT shooting games (from Hack a Howard) but were often good enough offensively to overcome it and win, the logistic regression analysis interpreted low FT shooting as being a small predictor of a win. The fact that there has not been enough games where Rockets had high FT shooting to analyze probably also skews this model (and the two games where the Rockets shot insanely high -almost 90%!- in FT's were split evenly, with the recent game vs the Timberwolves being a win, and the game vs the Suns on December 4th being a loss). I have no doubt that as Dwight Howard continues to improve his FT shooting and the sample size of high FT shooting games increases, logistic regression analyses will likely begin to show FT percentage has not only having a positive effect, but having a significantly large positive effect on the game outcome. Of course, we'll just have to see.

Some Caveats

Just wanted to say a reminder that statistical models are not perfect. As demonstrated from the strange result from the logistic regression model that suggests FT percentage has zero to slightly negative effect on the outcome of a game (which is absurd!), these models and coefficients must be placed in the proper context to better understand them. The conclusions that I drew from these results are my interpretations but it is very possible that I placed them in the wrong context or misunderstood them. That is why I made the data available to everyone so that everyone can come to their own conclusions. Additionally, my choice in IV selection is probably not perfect either. I understand not everyone has access to SPSS (or has the time to understand how to use it), so if anyone has a strong case for analyzing a different group of IV's that may better explain what factors are good/bad for the Rockets, I am all ears!

Lastly, remember that some IV's like assists were not included in this analysis. Thus, when I make conclusions about what is significant or most important, or 2nd most important, etc., that is only with regards to the variables that were examined here, not a general statement about some variable's importance out of everything. Also, not being able to include assists in this analysis is really, really unfortunate. I really want to know how much it helps the team. If any stats experts here (stats, new, torocan, cn0gd, and many others) has any suggestion on how to include that variable in a way that does not produce collinearity confounds, PLEASE speak up!

Conclusion

Thanks for your patience in reading this longgg post! (at least for those of you who didn't skip the entire thing just to read this section )

Multiple and Logistic Regression analyses suggest that for the Rockets:

-Opponent shooting percentages are the most significant predictor of game outcome

-(Opponent 2 point shooting is most significant in predicting point differential, opponent 3 point shooting is most significant in predicting wins vs losses)

-Three point shooting is the second most significant predictor of game outcome

-Two point shooting is the 3rd or 4th most significant predictor of game outcome

-Turnovers are the 5th or 6th most significant predictor of game outcome

-Opponent offensive rebounds are the 6th or 7th most significant predictor of game outcome.

-Steals follow opponent offensive rebounds in significance in predicting point differential, but are less significant in terms of predicting wins vs losses. Overall, steals are of middling importance.

-Rockets' offensive rebounds have a small positive effect, but does not seem to be a significant factor.

-Personal fouls do not seem to be a factor

-Free throw shooting for Rockets or opponents do not seem to be a significant factor, however:

-Opponent free throw shooting, while not a significant predictor of point differential, is incredibly significant as a predictor of wins vs losses (more significant than turnovers even)

-Rockets' free throw shooting significance is probably messed up in both multiple and logistic regression models because the team has won so many games with low FT percentages and do not provide a large enough sample size of games with high FT percentages

-The above should change as Dwight Howard continues to improve on his FT's (fingers crossed!)

The above relationships suggest that for the Rockets:

-Number 1 priority should be for the team to play better defense, as relying on offensive brilliance to offset poor defense is likely not a viable long term strategy

-Number 2 priority is for the team to figure out how to start getting their 3's to consistently fall

-Number 3 priority is to improve 2 point shooting, although since this is usually not as much of a problem (and the fact that an improvement on shooting 3's may also have a positive effect on shooting 2's) maybe the Rockets team ought to focus on the other areas while simply maintaining the status quo in this area.

-Number 4 priority (although really it should be number 3) is to cut down on turnovers

-Number 5 priority is decrease the amount of offensive rebounds given up to the opposing team

Note: if we begin seeing more games where wins are correlated with high FT percentages, then an updated model may suggest that maintaining a good FT shooting percentage should be a top priority as well

Caveats:

-Assists and blocks were not included in this analysis

-If someone can figure out a way to include assists without messing up the model, we may very well see “moving the ball better” as another top priority as well.

#1 hollywoodMarine, Feb 16, 2014

15 people like this.

Carl Herrera Contributing Member

Joined:

Feb 16, 2007

Messages:

45,153

Likes Received:

21,570

With 3 pt shooting and FTs, the number of attempts per possession is likely more important than what % the teams shoot.

Also, OReb%, which adjusts for the number of rebounds available, is a more relevant measurement than # of ORebs.

#2 Carl Herrera, Feb 16, 2014
Codman Contributing Member

Joined:

Jun 24, 2001

Messages:

6,765

Likes Received:

11,710

Oh my damn. Props for the effort. I will need to go back and read all of this.

Where do you find the time?

#3 Codman, Feb 16, 2014
iconoclastic Member

Joined:

Oct 10, 2007

Messages:

6,100

Likes Received:

422

Basketball is a continuous game, so a team's offense affects its defense and vice versa, so it's almost pointless to try to separate stats of different parts of the game out, except free throw shooting or something like that which is separate from the rest of the game. Also, you may want to run these analyses with other teams' season data or Rockets data from other seasons to give yourself a baseline of significant predictors, rather than just significant predictors for the Houston Rockets for this season.

#4 iconoclastic, Feb 16, 2014
hollywoodMarine Member

Joined:

Jan 15, 2014

Messages:

246

Likes Received:

32

Carl Herrera said: ↑

With 3 pt shooting and FTs, the number of attempts per possession is likely more important than what % the teams shoot.

Also, OReb%, which adjusts for the number of rebounds available, is a more relevant measurement than # of ORebs.
Click to expand...

Thanks for your input. But how would I separate 3 point shooting vs 2 point shooting per possession? Is there a way I can get the stats for that?

Good point on the OReb%, I ran the analysis using that measurement instead, and if it is assumed that that is indeed the best measure, then opponent offensive rebounding drops in significance (below steals!). This means opponent OREB is no longer the 6th or 7th important factor in predicting game outcome!

Codman said: ↑

Oh my damn. Props for the effort. I will need to go back and read all of this.

Where do you find the time?
Click to expand...

Thanks! After my last post on Lin's scoring consistency, I breezed through the examination section on descriptive statistics on my midterm. So I treat these as good review sessions for my stats exams (which is why I make myself summarize what each statistical analysis means and how they are properly used). Also, I am a nerd.

iconoclastic said: ↑

Basketball is a continuous game, so a team's offense affects its defense and vice versa, so it's almost pointless to try to separate stats of different parts of the game out
Click to expand...

But that is why there is the VIF column in the regression statistics results table that give you an idea of how much one variable may be affecting or affected by another. And that is why I kept all of those variables in, because they did not suggest high amounts of collinearity.

I am quite sure people have done logistic and multiple regression analyses in a similar way as I have (with the IV's separated in this way). But of course it is true that even seemingly unrelated IV's can sometimes have effects on others even if the collinearity diagnostics don't say so, but that is why I listed in the caveats that these results are not 100% proof of anything, and must be interpreted with the proper context. I.e., opponent offensive rebounding may be shown to have a significant effect, but since we know that opponent FG% also affects that IV, the oppOREB effect is probably inflated (and in fact a re-test using the measures Carl Herrera suggested now suggests that the opponent OREB may be insignificant overall).

iconoclastic said: ↑

Also, you may want to run these analyses with other teams' season data or Rockets data from other seasons to give yourself a baseline of significant predictors, rather than just significant predictors for the Houston Rockets for this season.
Click to expand...

Now that is a good plan. But this process can be time-consuming, so I can't do one for like 10 different teams. What team do you suggest would be a good "base-line" to compare this team too?

#5 hollywoodMarine, Feb 16, 2014
The Jabberwock Member

Joined:

Jul 8, 2013

Messages:

331

Likes Received:

26

hollywoodMarine said: ↑

Hey guys. So we're going over regression in my stats class now, and I thought, what a great way to get some prac app practice here by forming a couple models based on rockets games so far! Since the first time I posted a similar analysis type thread people (rightfully) were requesting a TL;DR section, I'll post a short conclusion at the end with the main points in bullet point format If you are pressed for time but still want to read more than a couple bullet points, I've underlined the significant portions for you to skim through.
Click to expand...

Great post, that needs (and deserves) some practical follow-up,
so as not to be too academic.

(Perhaps some future predictions before games, based on Rockets' & opponents' stat sheets?
Or maybe some player/lineup analysis using the measures that were found to be most predictive?)

As a stat-minded guy - I loved this.

#6 The Jabberwock, Feb 16, 2014
Voice of Aus Contributing Member

Joined:

Jun 28, 2013

Messages:

5,157

Likes Received:

410

Like codman, I'll get around to reading it and letting you know my thoughts

#7 Voice of Aus, Feb 16, 2014

do work son Member

Joined:

Oct 24, 2013

Messages:

342

Likes Received:

22

"This post aims to provide some insight on how much do different variables (different parts of the box score) contribute to or predict wins/losses for the Rockets."

In my non scientific research, I've found that the rockets win when their points scored is greater than their points allowed, and lose when the opposite is true. I post a thesis on it later.

#8 do work son, Feb 16, 2014
do work son Member

Joined:

Oct 24, 2013

Messages:

342

Likes Received:

22

Sarcasm aside, awesome post.

#9 do work son, Feb 16, 2014
hollywoodMarine Member

Joined:

Jan 15, 2014

Messages:

246

Likes Received:

32

hollywoodMarine said: ↑

If the rockets never made a single turnover at all this season and produced a margin of victory of 25 points most of the time, then “b” here would be 25. Each additional turnover would lower 25 by the amount of “m” amount.
Click to expand...

For those of you who are puzzled by my explanation of the b intercept or constant, here is a quick clarification because the above makes absolutely zero sense lol. If a regression model was y=mx + b where y was margin of victory, x was turnovers, m was relationship between x and y (so let's say it is -1), then b represents the baseline point differential for the Rockets if they were to have zero turnovers. This model would be derived from a large sample of games where everytime the Rockets had zero turnovers, they would win by "b" amount (the earlier explanation that this would be an explanatory model for when Rockets never made a single turnover makes absolutely no sense, because in that case you would not have an "m" slope at all).

#10 hollywoodMarine, Feb 16, 2014
hollywoodMarine Member

Joined:

Jan 15, 2014

Messages:

246

Likes Received:

32

do work son said: ↑

In my non scientific research, I've found that the rockets win when their points scored is greater than their points allowed, and lose when the opposite is true. I post a thesis on it later.
Click to expand...

That's a pretty bold theory. I'm going to have to run some tests to see if I can get some statistical evidence in favor of your hypothesis.

#11 hollywoodMarine, Feb 16, 2014
wizkid83 Contributing Member

Joined:

May 20, 2002

Messages:

6,335

Likes Received:

847

Disclaimer: I'm not an statistician so don't claim expertise, just trying to learn.

If for example, the Rockets offense is extremely consistent (let's just say we never shoot more than 55% and never less than 51%) while our defenses FG% varies greatly between 40% and 60%, then defensive FG% will likely be a much better predictor of our win/loss while the offensive FG% likely won't even register as significant in modeling.

However, it does not necessarily mean it's more important, wouldn't just mean that it's the more inconsistent part of Rocket's game or even just a Rocket's game. I think the data is alos pretty thin to build a good model after no? and we would probably need to normalize/clean the base data quite a bit before just throwing it into SPSS and see what comes out. Any stats PHDs want to jump in on this discussion?

Edit:

Additional thought, wouldn't the opponent FG% just be more a factor of the quality of opponents offense rather than anything in Rocket's control? (and likely highly correlated with opponent win-loss record).

I'd be interested to see what the model what say if we used defensive FG% difference (delta of what opponent shot vs. their season average) and offesnive FG% (delta of what we shot vs. our season average) in the models instead and see which one comes out the better predictor.

#12 wizkid83, Feb 16, 2014
Last edited: Feb 16, 2014
don grahamleone Contributing Member

Joined:

Aug 11, 2001

Messages:

23,376

Likes Received:

33,525

Buddy Love, can you put that in rich dummy terms?

I wish I could understand the symbols and at least try to figure it out on my own.

#13 don grahamleone, Feb 16, 2014

haoafu Contributing Member

Joined:

Jun 29, 2006

Messages:

2,021

Likes Received:

56

Applaud the effort. Aside from small sample size, the strength of schedule and lineup may need to be accounted for.

There's multicollinearity issue in the model as well with highly correlated predictor variables.

#14 haoafu, Feb 16, 2014
gene18 Rookie

Joined:

Dec 29, 2012

Messages:

990

Likes Received:

23

hollywoodMarine said: ↑

Hey guys. So we're going over regression in my stats class now, and I thought, what a great way to get some prac app practice here by forming a couple models based on rockets games so far! Since the first time I posted a similar analysis type thread people (rightfully) were requesting a TL;DR section, I'll post a short conclusion at the end with the main points in bullet point format If you are pressed for time but still want to read more than a couple bullet points, I've underlined the significant portions for you to skim through.

As always, I will probably make quite a few mistakes, so constructive criticism is appreciated! I am also open to suggestions for more suitable variables to analyze.

Goals of this Post

This post aims to provide some insight on how much do different variables (different parts of the box score) contribute to or predict wins/losses for the Rockets. Multiple regression and logistic regression models are created to help answer these questions. Additionally, regression models can also identify outlier games, such as games where Rockets “should” have won, but lost (and vice versa).

A Quick Summary of Multiple Regression

Regression analysis is a statistical process for explaining (or trying to explain) relationships among variables. For basketball, it's a fancy way of saying “what would this game look like if I were to put it in the form of y=mx+b,” where “x” would be the thing you want to examine (e.g., number of turnovers in a game), “y” is the outcome of the thing you want to examine (e.g., point differential / margin of victory) and m describes the relationship (so if TO's and margin of victory have a negative relationship, then m would be a negative number). “b” is the intercept/constant, which is what y would be if x were to be zero. If the rockets never made a single turnover at all this season and produced a margin of victory of 25 points most of the time, then “b” here would be 25. Each additional turnover would lower 25 by the amount of “m” amount.

What's great about multiple regression is that you don't have to limit your model to just y=mx+b. You can add a lot more factors, such as rebounds, steals, shooting percentage, etc. So the model would look something like y = m1x1 + m2x2 + m3x3 + …. + b with each “x”s (x1, x2, x3, etc..) contributing to y in their own amount.

What exactly does the model do? As said earlier, it explains the relationship between all the independent variables (the “x's”) and the dependent variable (“y”). This relationship can be used to predict “y” given whatever “x's” you put into the model. However, the predictive aspect of the model is not too useful; you can't really know how many turnovers, or rebounds, or whatever the rockets will have next Wednesday to plug into your model and predict if they'll win or not (unless you can see into the future, in which case this predictive aspect would be even more useless). What is useful about this model is that it can suggest which games are anomalies, i.e., which ones had outcomes that are really really different from what the model predicts. It can suggest if a game was closer than it should have been, or given how well the team played, should it have won some game that it lost.

However, the most important thing that this model tells us is how much does some independent variable (which I'll refer to as IV from now on) contribute to the dependent variable (which I'll refer to as DV from now on) compared to the other IV's of interest. In other words, the model helps us in answering questions like: Do turnovers hurt the team more than poor three point shooting? Is defense really more important than offense? Are there other factors that contribute significantly to winning/losing that maybe many of us have overlooked?

How we determine the amount of effect some IV has on the DV is from looking at the coefficients, or the slope of “m”s before all the “x”s. In general, the greater the “m” (positive or negative), the greater the effect, although this can be misleading too (more on this later).

A Quick Summary of Logistic Regression

While multiple regression can be a great tool for the reasons listed above, it requires a DV that is continuous or quantitative, in other words, a DV with numerical value that is measured on a continuum. That is why I listed “point differential” or “margin of victory” as the DV, rather than “Win” vs “Loss.” The latter is a categorical DV, which does not fit into a multiple regression analysis.

Why does it matter? Doesn't point differential basically tell us if you won or lost (with the added benefit of providing information on how close/blowout a win/loss was)? Wouldn't multiple regression measuring point differential be sufficient then?

The problem with measuring win/loss on a continuous scale is that it kind of devalues the effects of IV's that don't make huge differences in point differential but can still often mean the difference between a win and a loss. A good example, again, is turnovers. A couple more turnovers may not actually lower your margin of victory by as much as poor shooting would, but there are times when those turnovers may really mean the difference between winning and losing. Therefore, measuring DV on a categorical level can help identify those “small difference makers” that can still significantly predict a win/loss (even if they don't make much difference in terms of DV on a continuous scale like point differential). This is where logistic regression comes into play. It has a formula that is similar to multiple regression (logit=a+bX, kinda like y=mx+b) except the “y” is a logit, a natural log of the odds. Don't worry about exactly what that is, just know that logit can be converted into a probability. So basically logistic regression is similar to multiple regression but it tells you what the odds are that the DV belongs to some category (such as Win) vs some other category (such as Loss) given the IV's.

Note: This does not mean you should completely ignore the continuous aspect of winning/losing. Logistic regression and multiple regression both gives us important pieces of information

Multiple Regression IV's of Interest

All Rockets games were included in this SPSS analysis.

The variables of interest are:
1) Two-point Field Goal Percentage
2) Three-point Field Goal Percentage
3) Free Throws Percentage
4) Offensive Rebounds
5) Steals
6) Turnovers
7) Personal Fouls
8) Opponent Offensive Rebounds
9) Opponent Two-point Field Goal Percentage
10) Opponent Three-point Field Goal Percentage
11) Opponent Free Throws Percentage
12) Opponent Personal Fouls

(If you're wondering why assists and blocks were not included, this will be touched upon shortly)

My reasoning behind choosing these variables:

I think many of these are obvious, so I will go over the ones you may be wondering. To begin, why did I choose not to include defensive rebounds? Defensive rebounding IS significant. We've seen quite a few times where the Rockets just kept giving the opponents golden opportunities to catch up by slacking in this department. However, IMO, how well our rockets were rebounding defensively was best reflected by the number of opponent offensive rebounds (lower OppOREB suggested that Rockets were rebounding well defensively). Initial analysis showed that it was difficult to even see an effect of defensive rebounding (the amount of defensive rebounds simply do not do a good job of predicting point differential) but the effect of opponent offensive rebounding was quite substantial. Therefore, I decided to leave opponent offensive rebounding in, and take defensive rebounding out. What about opponent steals and turnovers? I felt that Rockets' steals and turnovers can, to a degree, give you the same information (more steals, the more opponent turnovers, and the more turnovers, the more opponent steals). Of course I know steals and TO's do not give you the full picture of what is going on in the other side, but having opponent steals and opponent TO's included would have messed up the analysis (see underlined paragraph later in this section).

Why did I split up shooting into two point, three point, and free throws? For sure, comparing overall shooting (e.g., team TS% and opponent TS%) created a model that better predicted the actual outcomes, but such an analysis would provide much less useful information. We already know that if the Rockets shoot better then they have a higher chance of winning. It's obvious. Splitting the shooting up, however, can help us identify how much effect do different kinds of shooting have on winning/losing. Additionally, splitting up the opponents shooting from their free throws can help us identify areas of significance where the Rockets can actually MAKE A DIFFERENCE. Knowing, for example, that opponents' two-point and three-point shooting percentage makes a huge difference means that our Rockets need to play better defense. However, knowing that the opponents' free throws percentage also makes a huge difference is not really consequential here because the Rockets' cannot really control how they shoot at the free throw line. If opponent TS% was used for this analysis, you would have an all encompassing IV that you cannot separate to get the above pieces of information.

Why not just include TS%, FG% (twos and threes combined) in addition to the shooting variables listed earlier to get a good picture of everything? This relates to a huge confound in multiple regression that has to do with collinearity. In a nutshell, if you have explanatory IV's in your equation that are closely related to each other, it really screws up the model (and TS% or FG% is obviously related to two-pointFG%, three-pointFG%, and free-throw%). If you were wondering why assists, blocks, opponent TO's, and opponent steals were not included, that also has to do with concerns about collinearity (assists and FG% are highly related, blocks and opponent FG% are also highly related, opponent TO's and Rockets' steals are highly related, and opponent steals and Rockets' TO's are highly related).

What about offensive rebounds? Surely that should be related to twos and threes FG% (poorer shooting by the rockets/opponent would mean higher chances for getting offensive rebounds for each respective team)? That is true, but collinearity diagnostics suggest that the degree to which offensive rebounds are related to other IV's such as shooting related variables is not high enough to be too significant a problem. You will see more on this in the opponent offensive rebound section.

Multiple Regression Results

Using the 12 variables above, the multiple regression model is:

Point_Differential= 0.849*TwosFGP + 0.643*ThreesFGP + 0.191*FTP + 0.524*OREB + 0.837*STL + -1.01 * TOV + -0.278*PF + -0.643*OppOREB + -1.257 * OppTwosFGP + -0.456*OppThreesFGP + -0.19 * OppFTP + 0.138*OppPF + 25.974

(All FGP and FTP's are percentages, so you would plug in 40 for 40%, rather than .40)

By this model, given how our Rockets played against the Wizards recently, the point differential should have been 0.849*57.14 + 0.643*45.45 + 0.191*74.47 + 0.524*13 + 0.837*4 + -1.01*24 + -0.278*19 + -0.643*12 + -1.257 * 43.08 + -0.456*50 + -0.19 * 50 + 0.138*32 + 25.974 = 8.82 !!

What does this mean? Given how well our rockets played, despite the turnovers and poor defense (mainly on Ariza), we should have actually won that game by 8 or 9 points rather than one. The way I personally like to interpret this, is that rather than think of that game as evidence for our Rockets not being that good, we should regard that game as a fluke game where the Rockets should have won by more points but the Wizards simply got lucky and brought the statistically predicted differential of 9 points down to one

This model is decent, and for the most part accurately predicts wins vs losses (I think there were 4 games where the prediction of outcome was wrong, but two of those were very very close games and it could have gone either way). Point differential wise, the model prediction was 70% of the time within 4 points (+/- 4) of the actual outcome. Games like the one vs Washington as touched upon earlier where the actual point differential was more than 7 points different from the prediction may be perceived as atypical (or the alternative explanation is that this model is flawed, but I like to be optimistic )

Other games that were predicted to be significantly different from the outcome:

Jan 28 HOU vs SAS was predicted to be a loss by 1 point instead of actual win by 7.
Jan 24 HOU vs MEM was predicted to be a much worse loss by 12 points rather than 1.
Dec 12 HOU @ POR was predicted to be a win by one point instead of the actual loss by 8.
Nov 16 HOU vs DEN was predicted to be a close win by 3 points instead of 11.

There was also that terrible loss in Indiana which the model predicted to be a 25 point blowout instead of a 33 point blowout... but I don't think anyone cares.

So I've noticed that just like with my previous post on Lin's consistency, this post has gotten very long, so I've decided against posting the remaining list of predictions vis-a-vis the actual outcome, but if anyone wants to see them, let me know!

Comparing the Effects of IV's

Here we get to the main point of the post. So which variables seem more important than others in terms of predicting point differential? And how does one determine this information? One's initial guess would be to look at the B column (these are the coefficients that represent the slopes that are analogous to the “m” in y=mx + b), and come to the conclusion that the highest coefficient would represent the greatest effect. In a way, that's kind of correct. For instance, an increase in the opponent two point field goal percentage by "5" would drop the point differential by around -6 points (5*-1.257), while an increase in opponent free throw percentage by "5" would only drop the point differential by about a point (5*-0.19).

However, looking at the B column to determine IV effect is flawed because it does not take into account the IV's scale. For example, if I were to have represented the field goal percentage IV's as decimals instead of percentage (e.g., 0.48 instead of 48%), then the corresponding coefficients would have been much larger. Additionally, even if the IV's were on the same scale, a change in one unit of IV does not mean the same thing as the same change of another IV (example, increasing FGP by one point is not as big a change for that IV compared to increasing one turnover). The standardized coefficients column is a much better representation, because (according to my understanding) it tells you how much effect an IV has on point differential if it increases by one standard deviation, rather than just one unit of whatever scale the IV uses.

So what does that column show?

The following is the list of IV's in order of greatness of effect:

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Defense vs Offense, and Importance of Three Point shooting

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

What is obvious (and probably most expected) is that the group of 4 shooting percentage IV's outside of free throws has the greatest effect on the margin of victory/defeat.

What is interesting, though, is that opponent two-point shooting has a significantly greater effect than Rocket's two point or three point shooting. And while opponent three point shooting has less of an effect, as a whole, opponent shooting better predicts the outcome than Rockets' shooting! (I don't know if you're allowed to just add up the coefficients like that, but not to worry, I already created a multiple regression model using field goal percentages as a whole, and the standardized coefficient for opponent field goal percentage was of a greater magnitude than Rocket's field goal percentage. If anyone wants the data I can show them).

What this suggests is that perhaps defense IS more important than offense. James Harden may be one of the best and most efficient scorers in the NBA, but this model suggests that trying to overcome defensive deficiencies or lapses by simply being more brilliant on the offensive end may not actually be a viable strategy in the long run. This may also be a possible explanation for why McHale has not used Brooks extensively this season. He is a dynamic scorer but his defensive problems may not be worth it. But then as a counter-argument to that, Brooks is also a great 3 point shooter, which leads us to the second most influential factor...

Three point shooting for the Rockets appears to be a better predictor of the outcome than two point shooting! We all know how the Rockets 3 point shooting has not been brilliant this season. Poor 3 point shooting causes a huge problem because the Rockets' overall strategy is to rely on basically that and scoring in the paint. If the 3's don't fall to spread the floors, the defenders have a much much easier time guarding Howard by doubling him or making sure players like Lin, who makes a living off driving in the paint, have a difficult time scoring.

It's thus no surprise that the majority of games where the Rockets shot above 40% in threes was a win (and the Rockets did not have one loss where they shot above 50%). The last game vs the Suns, where the Rockets shot almost 70% from 3 point range, unsurprisingly resulted in a blowout. This is one reason why Beverley's 3 point shooting is so highly valued, why the bench's problems in this area (e.g., Francisco Garcia) is really hurting this team, and why we need Harden's 3 point shot to come back. And even though Lin's 3 point shooting is decent now, if he could just raise that percentage to the high 30's like earlier in the season, along with everyone else getting their 3 point shot back, this Rockets team could be unstoppable.

Turnovers Are Bad

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Outside of non-free throw related shooting, turnovers is the most significant factor. What this hopefully shows is that, yes, turnovers are bad. They are not some necessary evil that somehow allows the Rockets to play better. Turnovers are so bad, in fact, that an increase in two standard deviations of TO's (that would be increasing TO's by 7 or 8 in game) has almost the equivalent negative effect on the outcome as increasing one standard deviation of the opponent's two point field goal percentage (that would be an increase in 7%), and would be more damaging than decreasing one standard deviation in the Rockets' three point shooting (that's a whole 10%). In very simplified terms (and correct me if I'm wrong stats experts!), it's like saying: committing 7 or 8 more turnovers negatively affects the outcome as much as lowering the Rockets' 3 point shooting from a respectable 40% to a dismal 30%!

(Random rant from a Lin fan...)

This is why I really really REALLY want Lin to improve his decision making in terms of passing. I know that playing aggressively is worth the increase in TO's, but there are some of his TO's that are COMPLETELY preventable and unrelated to aggressive attempts at play-making. Just take that game vs Washington, Lin was playing excellent, but in the 4th he would make these completely inexplicable passes that had no chance of making it to any of his teammates. These weren't from him driving and trying to make things happen.. these were passes from where he just picked up his dribble without any pressure from defenders and then for some reason made completely unexplainable bad passes. The good news is that these are completely fixable problems, and Lin has already slightly improved from last year in terms of TO's per 36. IMO, given his improvement in shooting and defense (to at least average), TO's are really the last significant obstacle he needs to overcome. Once he limits those turnovers he could have an even more positive effect on the team.

(And yes, Lin's not the only one making TO's, everyone needs to improve in that aspect!)

Opponent Offensive Rebounds and Steals

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Outside of non-free throw shooting related variables, opponent offensive rebounds were the second most influential factor in terms of predicting outcome. It's no surprise that giving up offensive rebounds is harmful, but this model gives us an idea of just how bad it is. Every offensive rebound given up is about 2/3 as bad as committing a turnover. Note that in the far right is a column labeled VIF (variance inflation factors). Remember the collinearity concept that I mentioned earlier? This column displays numbers that give you the degree of “relatedness” that one IV may have with another IV (or a couple IV's). Generally, VIF's greater than 5 are problematic (which is why I decided to keep OppOREB in this analysis). However, since the VIF for OppOREB (at 2.008) is still comparatively large compared to the other variables, this suggests that the effect of opponent offensive rebounds may be slightly inflated in this model (so the actual standardized effect may be a little bit less than -0.193). Regardless, the negative effect of opponent offensive rebounds is quite significant.

Steals also has a positive effect, but not nearly enough to counter the negative effect of turnovers. This can partially be explained by the fact that Harden (and a few other Rockets) sometimes gambles for steals which does not really have a very positive effect on the game's outcome. Playing solid defense and lowering opponents' field goal percentage is a much greater contributing factor to the outcome (and this model is good evidence for this). This is why, for example, despite Lin's decrease in steals compared to last year, I actually consider him playing overall better defense this year. I'd prefer that he maintain his improved D and not gamble for steals. This is not to downplay steals that are the result of good defense of course (e.g., many of Beverley's steals). But this model simply suggests that overall, the positive effect of steals is simply not that high, or not that it is not very predictive of the outcome.

The Rest

OppTwosFGP (-0.665)
ThreesFGP (0.519)
TwosFGP (0.439)
OppThreesFGP (-0.372)
TOV (-0.303)
OppOREB (-0.193)
STL (0.182)
OppFTP (-0.149)
FTP (0.143)
OREB (0.134)
PF (-0.086)
OppPF (0.064)

Opponent free throw shooting, Rockets' free throw shooting, Rockets' offensive rebounding, and personal fouls from either side have the least significant effect. For personal fouls, they not only have a very small effect, but their significance values are also high (p= 0.182 and 0.342 for Rockets and opponent fouls respectively). This means they also affect the outcome very inconsistently.

What is interesting is that the effect of Rockets' free throw shooting percentage on point differential is as low as it is. As suggested earlier, this may simply be because free throw shooting does not contribute as many points compared to the other factors, but as you will see in the logistic regression analysis, this is probably not a good explanation.

Logistic Regression Analysis and Results

Using the 12 variables above, the logistic regression model is:

Win/Loss_Logit=5.057*TwosFGP + 3.777*ThreesFGP + -0.057*FTP + 0.32* OREB + 1.538*STL + -5.888*TOV + -1.122*PF + -2.173*OppOREB + -5.459*OppTwosFGP + -3.926*OppThreesFGP + -2.4*OppFTP + 0.818*OppPF + 309.961

Similar to the regression model, you simply plug in the relevant numbers, but the outcome is a win/loss logit rather than point differential. To convert the logit into a probability (of being a win), use the equation Winchance=1/(1+e^(Win/Loss_logit)). SPSS simulation using this model was able to predict with 100% accuracy whether Rockets win or lose given the relevant information, but I would not really use this as a good predictive model. It's hard to really explain the reasons, so I will just leave it at that for now (if anyone wants clarification I can try to explain).

Logistic Regression and IV's Effects

The purpose of this model is more to show the significance of effect of IV's on WIN vs LOSS, rather than point differential. As explained earlier, this would increase the effect of IV's that may have smaller contributions to victory margins, but still have solid contribution to the outcome regardless of how close/blowout a win or loss was. Because SPSS does not standardize the coefficients like it does for multiple regression, I manually standardized the IV's to produce the above table.

And what does it show?
The following is the list of IV's in order of greatness of effect (on Win/Loss group membership rather than point differential!):

OppThreesFGP (-41.498)
ThreesFGP (39.493)
OppTwosFGP (-37.451)
TwosFGP (33.885)
OppFTP (-24.382)
TOV (-22.934)
OppOREB (-8.465)
OppPF (4.878)
STL (4.342)
PF (-4.49)
OREB (1.062)
FTP (-0.559)

Unlike the multiple regression analysis results, here the primary factor in predicting wins vs losses is opponent three point shooting! Additionally, opponent non-free throw related shooting now accounts for the 1st and 3rd most important variables in determining wins/losses instead of 1st and 4th in the multiple regression analysis. This further emphasizes the importance of DEFENCE over OFFENCE. Rockets' three point shooting on the other hand, is still second place in terms effect on the outcome, consistent with the multiple regression model, which again re-emphasizes 3 point shooting as the 2nd most significant thing that the Rockets need to improve on.

Curiously, turnovers, while still incredibly significant in predicting wins/losses, did not increase in effect compared to the multiple regression model (what I mean by that is comparing turnover's effect to other IV's, such as that of twos or threes field goal percentage, does not suggest a greater relative effect in this model than the same comparison would suggest in the multiple regression model). My interpretation of this is that turnovers are just as good at predicting wins/losses as predicting point differential. In any case, turnovers are obviously still problematic, and outside of shooting related variables, it is the most significant factor in determining the outcome of a game.

Finally, opponent offensive rebounding, just like in the multiple regression model, is the 2nd most important non-shooting related variable in terms of predicting the game's outcome

Free Throw Percentages in the Logistic Regression Model

In many other variables, the magnitude of effects are similar to that of multiple regression, with two major exceptions: opponent and Rockets' free throwing shooting. Standardized opponent free throw shooting percentage has a huge -24.382 in magnitude of effect, slightly beating out turnovers in terms of importance! However, given that the Rockets cannot really control how well the opponent shoots free throws, this does not really point to anything that the team needs to change or improve on (except, maybe try not to let them go to the line so much... but then the low significance of PF's in this model seems to contradict that statement).

Lastly, Rockets' free throw shooting percentage has an almost zero influence on win/loss outcome. In fact, it is slightly negative! How is this possible? I am honestly not very certain, but my best guess is that because earlier this season the Rockets have had many poor FT shooting games (from Hack a Howard) but were often good enough offensively to overcome it and win, the logistic regression analysis interpreted low FT shooting as being a small predictor of a win. The fact that there has not been enough games where Rockets had high FT shooting to analyze probably also skews this model (and the two games where the Rockets shot insanely high -almost 90%!- in FT's were split evenly, with the recent game vs the Timberwolves being a win, and the game vs the Suns on December 4th being a loss). I have no doubt that as Dwight Howard continues to improve his FT shooting and the sample size of high FT shooting games increases, logistic regression analyses will likely begin to show FT percentage has not only having a positive effect, but having a significantly large positive effect on the game outcome. Of course, we'll just have to see.

Some Caveats

Just wanted to say a reminder that statistical models are not perfect. As demonstrated from the strange result from the logistic regression model that suggests FT percentage has zero to slightly negative effect on the outcome of a game (which is absurd!), these models and coefficients must be placed in the proper context to better understand them. The conclusions that I drew from these results are my interpretations but it is very possible that I placed them in the wrong context or misunderstood them. That is why I made the data available to everyone so that everyone can come to their own conclusions. Additionally, my choice in IV selection is probably not perfect either. I understand not everyone has access to SPSS (or has the time to understand how to use it), so if anyone has a strong case for analyzing a different group of IV's that may better explain what factors are good/bad for the Rockets, I am all ears!

Lastly, remember that some IV's like assists were not included in this analysis. Thus, when I make conclusions about what is significant or most important, or 2nd most important, etc., that is only with regards to the variables that were examined here, not a general statement about some variable's importance out of everything. Also, not being able to include assists in this analysis is really, really unfortunate. I really want to know how much it helps the team. If any stats experts here (stats, new, torocan, cn0gd, and many others) has any suggestion on how to include that variable in a way that does not produce collinearity confounds, PLEASE speak up!

Conclusion

Thanks for your patience in reading this longgg post! (at least for those of you who didn't skip the entire thing just to read this section )

Multiple and Logistic Regression analyses suggest that for the Rockets:

-Opponent shooting percentages are the most significant predictor of game outcome

-(Opponent 2 point shooting is most significant in predicting point differential, opponent 3 point shooting is most significant in predicting wins vs losses)

-Three point shooting is the second most significant predictor of game outcome

-Two point shooting is the 3rd or 4th most significant predictor of game outcome

-Turnovers are the 5th or 6th most significant predictor of game outcome

-Opponent offensive rebounds are the 6th or 7th most significant predictor of game outcome.

-Steals follow opponent offensive rebounds in significance in predicting point differential, but are less significant in terms of predicting wins vs losses. Overall, steals are of middling importance.

-Rockets' offensive rebounds have a small positive effect, but does not seem to be a significant factor.

-Personal fouls do not seem to be a factor

-Free throw shooting for Rockets or opponents do not seem to be a significant factor, however:

-Opponent free throw shooting, while not a significant predictor of point differential, is incredibly significant as a predictor of wins vs losses (more significant than turnovers even)

-Rockets' free throw shooting significance is probably messed up in both multiple and logistic regression models because the team has won so many games with low FT percentages and do not provide a large enough sample size of games with high FT percentages

-The above should change as Dwight Howard continues to improve on his FT's (fingers crossed!)

The above relationships suggest that for the Rockets:

-Number 1 priority should be for the team to play better defense, as relying on offensive brilliance to offset poor defense is likely not a viable long term strategy

-Number 2 priority is for the team to figure out how to start getting their 3's to consistently fall

-Number 3 priority is to improve 2 point shooting, although since this is usually not as much of a problem (and the fact that an improvement on shooting 3's may also have a positive effect on shooting 2's) maybe the Rockets team ought to focus on the other areas while simply maintaining the status quo in this area.

-Number 4 priority (although really it should be number 3) is to cut down on turnovers

-Number 5 priority is decrease the amount of offensive rebounds given up to the opposing team

Note: if we begin seeing more games where wins are correlated with high FT percentages, then an updated model may suggest that maintaining a good FT shooting percentage should be a top priority as well

Caveats:

-Assists and blocks were not included in this analysis

-If someone can figure out a way to include assists without messing up the model, we may very well see “moving the ball better” as another top priority as well.
Click to expand...

I could not read the whole post because of the length. However, I did not see the RSquared nor a significance level . If the Rsquared is small then your IV's do not predict the DV well. If the significance level is below .05 then one must question the out come. Also, stat programs that use logistic regressions will produce a confusion matrix that will classify each prediction as a false positve, hit, miss, false negative, It also give you the sensitivity of the model. The stats are important when one discusses the quality of the model. Could you please post them. Also wouldn't a step wise regression be more appropriate as it would eliminate variables that do not contribute subtantially to the Multiple R. It can be done using forward or backward steps. The stepwise will also give you the increase in RSquared of a particular variable. A variable can be stat significant but contribute very little to the predictive ability of the model. A good job but I am used to seeing the data that I listed above when I evaluate the quality of a multiple regression of any sort.

#15 gene18, Feb 16, 2014
Hakeemtheking Member

Joined:

Feb 26, 2009

Messages:

9,193

Likes Received:

6,059

Didn't read. Too Damn short.:grin:

Btw, ask Morey for a job.

#16 Hakeemtheking, Feb 16, 2014
Noob Cake Member

Joined:

Mar 10, 2008

Messages:

3,541

Likes Received:

699

gene18 said: ↑

I could not read the whole post because of the length. However, I did not see the RSquared nor a significance level . If the Rsquared is small then your IV's do not predict the DV well. If the significance level is below .05 then one must question the out come. Also, stat programs that use logistic regressions will produce a confusion matrix that will classify each prediction as a false positve, hit, miss, false negative, It also give you the sensitivity of the model. The stats are important when one discusses the quality of the model. Could you please post them. Also wouldn't a step wise regression be more appropriate as it would eliminate variables that do not contribute subtantially to the Multiple R. It can be done using forward or backward steps. The stepwise will also give you the increase in RSquared of a particular variable. A variable can be stat significant but contribute very little to the predictive ability of the model. A good job but I am used to seeing the data that I listed above when I evaluate the quality of a multiple regression of any sort.
Click to expand...

1. R2 is never indicative of generalized linear model fit or performance.
2. Significant tests are listed in the tables.
3. Confusion matrix does not apply in this case. OP is attempting to do inference instead of classification. OP is not doing statistical machine learning here.
4. Stepwise is only useful for variable selection. All covariates have shown to be statistically significant. Stepwise is therefore a moot point. There are no issues with multicolinearity.

To OP: state.
1. If you are trying to throw in sections on logit link, VIF and multicolinearity, you really should go back to the basics and present the diagnostic plots. I'm highly doubtful that linearity, homoscedasticity and normality assumptions are satisfied.
2. Your logistic regression section is straight up garbage in its current. There is something wrong with the logistic regression. Check out your standard errors and wald test statistics. There is either a numeric instability issue or something is seriously wrong with the methods/model/data of this analysis. You can't do any inference with beta's that are not statistically significant. You seem to be trying to interpret all the 0 beta's in the logistic regression.

#17 Noob Cake, Feb 16, 2014
jtr Contributing Member

Joined:

Dec 4, 2011

Messages:

7,470

Likes Received:

275

After a quick scan:

Free throw percentage is pretty much a meaningless stat. It does not effect the outcome of games per se. However free throws made have an enormous effect on games. Rather than go on about which stats are actually meaningful you should just look at the four factors. These are metrics that have basically been vetted by the basketball community for importance. The four factors have been through rigorous statistical investigations in the past. But your post certainly is an interesting exercise.

#18 jtr, Feb 16, 2014

gene18 Rookie

Joined:

Dec 29, 2012

Messages:

990

Likes Received:

23

Noob Cake said: ↑

1. R2 is never indicative of generalized linear model fit or performance.
2. Significant tests are listed in the tables.
3. Confusion matrix does not apply in this case. OP is attempting to do inference instead of classification. OP is not doing statistical machine learning here.
4. Stepwise is only useful for variable selection. All covariates have shown to be statistically significant. Stepwise is therefore a moot point. There are no issues with multicolinearity.

To OP: state.
1. If you are trying to throw in sections on logit link, VIF and multicolinearity, you really should go back to the basics and present the diagnostic plots. I'm highly doubtful that linearity, homoscedasticity and normality assumptions are satisfied.
2. Your logistic regression section is straight up garbage in its current. There is something wrong with the logistic regression. Check out your standard errors and wald test statistics. There is either a numeric instability issue or something is seriously wrong with the methods/model/data of this analysis. You can't do any inference with beta's that are not statistically significant. You seem to be trying to interpret all the 0 beta's in the logistic regression.
Click to expand...

A can logistic regression can be used to predict what category something might fall in. Brest Cancer/No Brest Cance predicts categories and provides a confusion matrix. My stat programs produce them.( SYSTAT,NCSS,SPSS )Unless all the stat programs I have used are worng. The OP developed two models Rsquared applies to the continuous dependent variable, not logigistic model. The Rsquared is THE measure of of how well the model predicts in a multiple linear regression. Where did you get the info the Rsquared is not a mearure of the accuracy of a model. I have been involved with multi variable stats for 25 years and it has always been that way. Prehaps it has changed. Could you provide me with a reference I am talking about the significanse of the Rsquared. The higher the RSquared the less error in prediction in the continuous independent variable. There is less varience. Also, the increase in Rsquared is is very helpful in eliminating or reducing the the number of variables. Win/Loss is a classification. I have used this type of classification many with PH.D. students that needed help with their dissertation. It was never critisized. But, perhaps things have recently changed. Medical and psychological literature uses it in this way. The only machine learning that I have used are Neural Networks and genetic algorithims and I have not seen logistic regression termed machine learning.
Perhaps I am totally wrong. I will research my understanding.
Your post is a bit confusing to me
Have you read Elazar Pedhazur's book : Multiple Regression in The Behavioral Sciences. He was the stat Guru in this area when I took most of my courses in stat. Here is a quote form his book abut RSquare: "There are several tests of significanse one may apply to the results of multiple regression. Three of them are: (1) Rsquared ,Tests of regression coefficients; and tests of increments in the proportion of varience accounted for by a given variable."(Stepwise regression" "The test of RSquare indicates whether the regression Y on the independent variable (S) is significant." (Pg.57-58 in the above mentioned book.
Your post is very confusing to me. Can you see why after the quote from Pedhazur.

#19 gene18, Feb 16, 2014
qiantom1999 Rookie

Joined:

Jan 6, 2014

Messages:

223

Likes Received:

1

You are talking about linear regression. He is talking about generalized linear models, logistic regression in this case.

gene18 said: ↑

A can logistic regression can be used to predict what category something might fall in. Brest Cancer/No Brest Cance predicts categories and provides a confusion matrix. My stat programs produce them.( SYSTAT,NCSS,SPSS )Unless all the stat programs I have used are worng. The OP developed two models Rsquared applies to the continuous dependent variable, not logigistic model. The Rsquared is THE measure of of how well the model predicts in a multiple linear regression. Where did you get the info the Rsquared is not a mearure of the accuracy of a model. I have been involved with multi variable stats for 25 years and it has always been that way. Prehaps it has changed. Could you provide me with a reference I am talking about the significanse of the Rsquared. The higher the RSquared the less error in prediction in the continuous independent variable. There is less varience. Also, the increase in Rsquared is is very helpful in eliminating or reducing the the number of variables. Win/Loss is a classification. I have used this type of classification many with PH.D. students that needed help with their dissertation. It was never critisized. But, perhaps things have recently changed. Medical and psychological literature uses it in this way. The only machine learning that I have used are Neural Networks and genetic algorithims and I have not seen logistic regression termed machine learning.
Perhaps I am totally wrong. I will research my understanding.
Your post is a bit confusing to me
Have you read Elazar Pedhazur's book : Multiple Regression in The Behavioral Sciences. He was the stat Guru in this area when I took most of my courses in stat. Here is a quote form his book abut RSquare: "There are several tests of significanse one may apply to the results of multiple regression. Three of them are: (1) Rsquared ,Tests of regression coefficients; and tests of increments in the proportion of varience accounted for by a given variable."(Stepwise regression" "The test of RSquare indicates whether the regression Y on the independent variable (S) is significant." (Pg.57-58 in the above mentioned book.
Your post is very confusing to me. Can you see why after the quote from Pedhazur.
Click to expand...

#20 qiantom1999, Feb 16, 2014

(You must log in or sign up to post here.)

Page 1 of 4

Share This Page