What factors are most predictive of a game's outcome for the Rockets? A multiple/logistic regression

heypartner · Feb 16, 2014

. The way I personally like to interpret this, is that rather than think of that game as evidence for our Rockets not being that good, we should regard that game as a fluke game where the Rockets should have won by more points but the Wizards simply got lucky and brought the statistically predicted differential of 9 points down to one
Click to expand...

Fun post. I'm on a phone, so haven't read it all yet. But wanted to stop regarding your first example with the wizards game

So when the real games don't match the stats, you explain that away as "luck"? Did your professor teach you that.

hollywoodMarine · Feb 16, 2014

Noob Cake said: ↑

2. Your logistic regression section is straight up garbage in its current. There is something wrong with the logistic regression. Check out your standard errors and wald test statistics. There is either a numeric instability issue or something is seriously wrong with the methods/model/data of this analysis. You can't do any inference with beta's that are not statistically significant. You seem to be trying to interpret all the 0 beta's in the logistic regression.
Click to expand...

Hi, thanks! I just facepalmed because I derped real bad on this one. Basically when I ran logistic regression initially (with lower amounts of variables, like around 4) the significance of coefficients were more like how you would expect them to be (not f*ed up like 1, and .9999 as you see in this table). As I added more and more variables, the omnibus test of model coefficients remained below alpha which suggested that the model was still significant. The problem I had totally forgot about is that when too many variables are added in a logistic regression analysis (esp with a smallish 50+ sample size of games like in this case), the model can remain significant while the coefficients become insignificant. This is why, like I said, SPSS was still able to predict 100% accuracy of the outcome between wins and losses.

Anyways I reran logistic regression with fewer variables (took out OREB's and FT's) and got a much better result:

Noob Cake said: ↑

To OP: state.
1. If you are trying to throw in sections on logit link, VIF and multicolinearity, you really should go back to the basics and present the diagnostic plots. I'm highly doubtful that linearity, homoscedasticity and normality assumptions are satisfied.
Click to expand...

The only reason I talked about VIF and multicollinearity was because it was relevant to how to properly interpret OppOREB (I said that its effect was probably inflated, which I felt ppl here in this forum may have been interested in). Given that this post had already gotten bit too technical I felt that including information about things like linearity, homoscedasticity, and normality assumptions would have been unnecessary.

But if you truly are curious, they are satisfied (for the most part). Breusch-pagan test of heteroscedasticity was 4.868, p =.9623, KS test of normality on residuals had p = .200, and in terms of linearity, I eyeballed all scatter-plots with individual IV's and the DV (they all appeared linear with the exception of PF's for both sides and OppFTP's which were too scattered to really detect the relationship was, but I chose to include them anyway, and I highly doubt that would have messed up the model or anything)

wizkid83 · Feb 16, 2014

hollywoodMarine said: ↑

Hi, thanks! I just facepalmed because I derped real bad on this one. Basically when I ran logistic regression initially (with lower amounts of variables, like around 4) the significance of coefficients were more like how you would expect them to be (not f*ed up like 1, and .9999 as you see in this table). As I added more and more variables, the omnibus test of model coefficients remained below alpha which suggested that the model was still significant. The problem I had totally forgot about is that when too many variables are added in a logistic regression analysis (esp with a smallish 50+ sample size of games like in this case), the model can remain significant while the coefficients become insignificant. This is why, like I said, SPSS was still able to predict 100% accuracy of the outcome between wins and losses.

Anyways I reran logistic regression with fewer variables (took out OREB's and FT's) and got a much better result:

The only reason I talked about VIF and multicollinearity was because it was relevant to how to properly interpret OppOREB (I said that its effect was probably inflated, which I felt ppl here in this forum may have been interested in). Given that this post had already gotten bit too technical I felt that including information about things like linearity, homoscedasticity, and normality assumptions would have been unnecessary.

But if you truly are curious, they are satisfied (for the most part). Breusch-pagan test of heteroscedasticity was 4.868, p =.9623, KS test of normality on residuals had p = .200, and in terms of linearity, I eyeballed all scatter-plots with individual IV's and the DV (they all appeared linear with the exception of PF's for both sides and OppFTP's which were too scattered to really detect the relationship was, but I chose to include them anyway, and I highly doubt that would have messed up the model or anything)
Click to expand...

Have you had a chance to run a model with my suggested changes to FG% variables? I really think you'd draw the wrong conclusion without normalizing for competitors performance/ Otherwise you might just be saying the most important factor of Rocket's winning or losing is the quality of the opponents faced rather than what's actually controllable by the Rockets.

hollywoodMarine · Feb 16, 2014

Updated conclusions :grin:

New order of magnitude of effect (in terms of predicting wins vs losses)

ThreepointFGP (3.932)
OppTwosFGP (-3.649)
OppThreesFGP (-3.52)
TwosFGP (3.398)
Turnovers (-1.471)
Steals (1.054)

I had to remove a couple variables because logistic regression model for this sample size of games simply could not be properly created with more than 6 IV's.

According these stats, the main thing Rockets should improve on is three point shot, followed by defense, followed by maintain good two point FG%, followed by cutting down on turnovers, followed by increasing steals (although given the high significance -p value of .177 I don't think steals are a good predictor of outcome). TO's also have a high p value of .107 here... which may suggest it is not consistently predictive of outcome either. This is different from the multiple regression model, in which steals and TO's are significant predictors of point differential.

In terms of the multiple regression model, as Carl Herrera pointed out, OREB% is much better indicator of performance. The new multiple regression model using that variable instead shows that OREB for both sides are insignificant... but the importance of the other factors, defense, three-point shooting, turnovers, remain more or less the same.

hollywoodMarine · Feb 16, 2014

wizkid83 said: ↑

Have you had a chance to run a model with my suggested changes to FG% variables? I really think you'd draw the wrong conclusion without normalizing for competitors performance/ Otherwise you might just be saying the most important factor of Rocket's winning or losing is the quality of the opponents faced rather than what's actually controllable by the Rockets.
Click to expand...

Hi, I'm still reading everyone's posts. I'll get to your suggestions shortly

Noob Cake · Feb 16, 2014

hollywoodMarine said: ↑

Updated conclusions :grin:

New order of magnitude of effect (in terms of predicting wins vs losses)

ThreepointFGP (3.932)
OppTwosFGP (-3.649)
OppThreesFGP (-3.52)
TwosFGP (3.398)
Turnovers (-1.471)
Steals (1.054)

I had to remove a couple variables because logistic regression model for this sample size of games simply could not be properly created with more than 6 IV's.

According these stats, the main thing Rockets should improve on is three point shot, followed by defense, followed by maintain good two point FG%, followed by cutting down on turnovers, followed by increasing steals (although given the high significance -p value of .177 I don't think steals are a good predictor of outcome). TO's also have a high p value of .107 here... which may suggest it is not consistently predictive of outcome either. This is different from the multiple regression model, in which steals and TO's are significant predictors of point differential.

In terms of the multiple regression model, as Carl Herrera pointed out, OREB% is much better indicator of performance. The new multiple regression model using that variable instead shows that OREB for both sides are insignificant... but the importance of the other factors, defense, three-point shooting, turnovers, remain more or less the same.
Click to expand...

Looks much better. The omnibus f-test is pretty much "garbage" in this case since the alternative is that at least one of the beta's is not equal to 0.

Since you seem to have time and are interest in doing regression.

1. Run an exhaustive MLR fit with selection by BIC while ADDING in an indicator for home court.
2. Estimate the density of your predictor variables for each of the 30 teams (ie 30 teams * 12 significant predictor). This is a lot of data collection. Not sure where you are sourcing your data from. If you have the data in any raw form, ie scraped or downloaded, I can help you reformat and clean it.
3. Essentially do a Monte Carlo simulation next, simulate N samples and plug into your MLR model to estimate the point spread, % win (ie without having to resort to logistric regression, if my understanding is correct, you are getting either 0 or 1 inverse-logit transformed p-values.)
4. You have the start of a NBA prediction system.

If you are doing all of this, you should probably move away from SPSS and use something more numerical like R/Python/Matlab (Octave).

hollywoodMarine · Feb 16, 2014

wizkid83 said: ↑

If for example, the Rockets offense is extremely consistent (let's just say we never shoot more than 55% and never less than 51%) while our defenses FG% varies greatly between 40% and 60%, then defensive FG% will likely be a much better predictor of our win/loss while the offensive FG% likely won't even register as significant in modeling.
Click to expand...

Sorry for the late reply. Just wanted to say that is an excellent point. Although the hypothetical example is not actually what is happening (opponent FG% and rockets FG% have roughly the same standard deviation and thus consistency), I understand your general logic and concern.

Since opponent FG% and rockets FG% are roughly the same consistency, I will use 2point FG% and 3poingFG% example instead (those two differ in standard deviation, 1 SD for the 2point FG% is 6 percent, while for the 3point FG% 1 SD is 10 percent). You are right that the standardized effect of 3 point shooting may look inflated because it varies so much. So why bother identifying standardized coefficients?

I guess a way to look at it is that SD's tell you what is likely and what is unlikely... for example anything within 1 SD of the center of a distribution (the middle of the bell curve) is likely (almost 70%), while going past that in either direction you begin to enter the unlikely territory. "Likelihood" may also suggest how "hard" it may be to improve a given amount (this is MY interpretation). If the Rockets only had to improve let's say 0.15 SD in 2 point shooting, it should not be too difficult. They are currently around 54%, and would just need to improve it by 1% to 55%. Should not be too difficult, especially since they have probably achieved that before in quite a few games. However, if they were to improve by 2 SD, that would be like improving 12% to 66%, which, because of how rare such shooting performances were in the past (how far away from the center of the bell curve), we assume it is much more difficult. So the assumption here is that the greater the SD in terms of how much you want to improve, the more difficult or unlikely.

We already know from the standardized coefficients that improving by 1 SD for 3 point shooting gives the Rockets a greater positive impact than improving by 1 SD in 2 point shooting. For the Rockets to get the same amount of positive impact from their 2 point shooting as 3 point shooting then, they would need to improve their 2 point shooting percentage by MORE than 1 SD.. to a point where their new average performance would represent more rare occurrences in the past, compared to improving "only" 1 SD above the current average for 3 point shooting (which statistically would represent less rare occurrences in the past). Even if the actual percentage increase required to improve by more than 1 SD for 2 point shooting is not as much as the amount required to improve by 1 SD in 3 point shooting, we still assume it is more difficult to improve the 2 point shooting by > 1 SD, because being at a point where you are > 1 SD above average is more rare than being at exactly 1 SD above average.

I don't know if I am making sense, and the assumption that improving 1 SD in one area is as hard as improving 1 SD in another area is probably oversimplified and not completely correct either. I guess the question really comes down to, do you think it would require the same amount of effort for the Rockets to improve their 2 point shooting 1 SD (6%) from 54% to 60% as for them to improve their 3 point shooting by 1 SD (10%) from 35% to 45%? If you do, then the model still holds, and it would still be much more worth it to focus on 3 point shooting as the area for improvement. If you don't, then this model is useless

Clarinetmonster · Feb 16, 2014

Didn't read yet, but just for fun used find function to find out that in your post, the key Rox player's are mentioned this many times:
Harden - 2
Dwight - 2
Bev - 2
Jones - 0
Parsons - 0
Lin - 9

hollywoodMarine · Feb 16, 2014

wizkid83 said: ↑

Additional thought, wouldn't the opponent FG% just be more a factor of the quality of opponents offense rather than anything in Rocket's control? (and likely highly correlated with opponent win-loss record).

I'd be interested to see what the model what say if we used defensive FG% difference (delta of what opponent shot vs. their season average) and offesnive FG% (delta of what we shot vs. our season average) in the models instead and see which one comes out the better predictor.
Click to expand...

That is a great idea, but where can I get such stats? (that is, defensive FG% delta and offensive FG% delta as broken down by game?)

jtr said: ↑

After a quick scan:

Free throw percentage is pretty much a meaningless stat. It does not effect the outcome of games per se. However free throws made have an enormous effect on games. Rather than go on about which stats are actually meaningful you should just look at the four factors. These are metrics that have basically been vetted by the basketball community for importance. The four factors have been through rigorous statistical investigations in the past. But your post certainly is an interesting exercise.
Click to expand...

I believe the four factors are included in the model (shooting, turnovers, rebounding, and free throws). As you said, FT may not have as much of an effect on point differential, but should have an effect on wins/losses. Logistic regression was done to investigate that but for some reason our free throws had very little effect, even in my later re-test (the first one was fail lol).

Also, splitting up the shooting section of the 4 factors allowed for examination of which was more important for Rockets winning, 3 point shooting or 2 point shooting. Data suggests maybe 3 point shooting is more important for this team. Thx for the heads up on the 4 factors tho, gonna take a look at (and the weight assigned to each of the 4) later this week I get the chance

haoafu said: ↑

Applaud the effort. Aside from small sample size, the strength of schedule and lineup may need to be accounted for.

There's multicollinearity issue in the model as well with highly correlated predictor variables.
Click to expand...

There is definitely some bit of collinearity (would have been even worse if I included assists and blocks), but my understanding was that VIF below 5 is acceptable. Is that cut off too liberal?

don grahamleone said: ↑

Buddy Love, can you put that in rich dummy terms?

I wish I could understand the symbols and at least try to figure it out on my own.
Click to expand...

Sure. Just look at the second table in the post, look at the numbers that are circled, the greater the absolute value, the more "important" the corresponding part of the box score (in the first column) is in predicting point differential in a hypothetical game. Negative = bad, positive = good

Opponent two point shooting is most important and very bad. Rockets' three point shooting is 2nd most important and very good. Outside of shooting, turnovers are the most important and very bad. Steals are good but do not make up for bad effects of turnovers.

Ignore logistic regression model, it is f*ed up. The new one is at the top of the second page (with summary in that post)

heypartner said: ↑

Fun post. I'm on a phone, so haven't read it all yet. But wanted to stop regarding your first example with the wizards game

So when the real games don't match the stats, you explain that away as "luck"? Did your professor teach you that.
Click to expand...

I was just trying to be optimistic :grin:

Clarinetmonster said: ↑

Didn't read yet, but just for fun used find function to find out that in your post, the key Rox player's are mentioned this many times:
Harden - 2
Dwight - 2
Bev - 2
Jones - 0
Parsons - 0
Lin - 9
Click to expand...

That's because I had a long paragraph where I ranted about his TO's

gene18 said: ↑

I did not see the RSquared nor a significance level .
Click to expand...

Thanks for pointing that out, don't know why I forgot to include that. R square for the multiple regression was .888, p<.001, and the significance for the new logistic regression model I believe is on the relevant post

durvasa · Feb 16, 2014

hollywoodMarine said: ↑

That is a great idea, but where can I get such stats? (that is, defensive FG% delta and offensive FG% delta as broken down by game?)
Click to expand...

That's just team game log together with the season team statistics , right?

hollywoodMarine · Feb 16, 2014

durvasa said: ↑

That's just team game log together with the season team statistics , right?
Click to expand...

Wow, I had been trying to find out how to get team stats like that broken down by game for awhile now, and just could not figure it out. This is awesome, thx

wizkid83 said: ↑

Additional thought, wouldn't the opponent FG% just be more a factor of the quality of opponents offense rather than anything in Rocket's control? (and likely highly correlated with opponent win-loss record).

I'd be interested to see what the model what say if we used defensive FG% difference (delta of what opponent shot vs. their season average) and offesnive FG% (delta of what we shot vs. our season average) in the models instead and see which one comes out the better predictor.

...

Have you had a chance to run a model with my suggested changes to FG% variables? I really think you'd draw the wrong conclusion without normalizing for competitors performance/ Otherwise you might just be saying the most important factor of Rocket's winning or losing is the quality of the opponents faced rather than what's actually controllable by the Rockets.
Click to expand...

I realized there is a slight problem with the original suggestion of using delta FG% for opponents (their FG% minus their season average) and delta FG% for Rockets (Rockets' FG% minus Rockets' season average) to determine Rockets' level of offense or defense in a way that accounts for quality of opponents. That only accounts for quality of opponents on their offensive end, but not on their defensive end. Subtracting Rockets' FG% for each game from the Rockets' season average does not account for the opponents' defense because Rockets' season average remains constant at every game, and really all you're doing is just shifting the FG% distribution to the left (so that the center is at zero), but the variance and SD is unchanged.

To determine how good Rockets' offense is in a way that accounts for variance in the quality of opponents' defense, the Rockets' FG% for each game must be subtracted from the opponents' opponent FG% . If I leave that part out, and only account for the fluctuation in opponents' offense in this model, then the defense variables (opponent FG%) magnitude of effect would be "unfairly" reduced compared to the offense variables (Rockets' FG%) in this model. If I were to also account for the fluctuation in quality of opponents in terms of defense, that would balance things out between the effects of Rockets' and opponent FG% variables, but now their effects are unfairly reduced as a whole compared to turnovers and steals and other variables (because the quality of opponents can also fluctuate in terms of how well they take care of the ball and how good they are at making steals, and rebounding etc.).

Basically, it comes down to your initial concern, in which you asked "wouldn't the opponent FG% just be more a factor of the quality of opponents offense rather than anything in Rocket's control?" Because the answer is a partial yes, but then so is Rockets' FG%, Rockets rebounds, steals, turnovers, all these other things on the box score which are partially affected by quality of opponents as well. In a way, it actually balances out because all these things in the model, besides FT's, are affected by quality of opponents

Of course, opponent quality probably does not affect these things equally- obviously different teams have different strengths and weaknesses (e.g., one team may have great offense but poor defense). So the best model would account for ALL of these aspects of opponent quality fluctuation... but a model that does not account for any of them is, IMO, still decent (at least better than a model that only partially accounts for opponent quality)

In any case, I can either make a more complicated model that accounts for all of that, or not do anything.

And... frankly, I think I've done enough stats for one day

PS: I will remember your suggestion and finish a better regression model next time I have a huge chunk of free time

glimmertwins · Feb 16, 2014

That's a ton of statistical evidence to point out what I thought was fairly obvious about this team already.

...not criticizing, I love the stats too but just had a little chuckle to myself.

Old Man Rock · Feb 16, 2014

I love the effort but in the words of McHale, "all that analytics does is tell me stuff I already knew."

Great effort though you lost me early on.

burlesk · Feb 16, 2014

Warning: This post adds nothing of substance to this discussion

Warning: This post adds nothing of substance to this discussion.

Suggestion: a Statistical Analysis section for the bbs...

... solely so that I, burlesk, know where not to go if I want to avoid feeling stupider than I normally do.

JK -- though a section like that does seem like it might be kinda cool.

It's kinda funny, too, because I'm pretty good in most areas of math, but have always had a huge hole in my brain where statistics and probability should go. My brain goes into emergency shutdown mode almost immediately when faced with posts like this.

I don't object to the existence of advanced stats; I enjoy reading popular treatments on what such things can tell us (really enjoyed reading The Drunkard's Walk by Leonard Mlodinow, for a tangential example); I just don't want to see the inner workings. Kind of a don't ask, don't tell thing...

Seriously, though, it's truly impressive stuff, and I hope maybe you guys will inspire me to tackle it again someday...

krmclaughlin · Feb 16, 2014

Me after reading this post:

New · Feb 16, 2014

Did you use all the games to do the regression. Can you show your R2 values and also add a 2D plot of predicted vs actual. It is important to assess the predictive power of your model first before we draw any conclusion.

hollywoodMarine · Feb 17, 2014

New said: ↑

Did you use all the games to do the regression. Can you show your R2 values and also add a 2D plot of predicted vs actual. It is important to assess the predictive power of your model first before we draw any conclusion.
Click to expand...

Here ya go

And yea someone else pointed out I had forgotten to include R2 and p values (facepalm!). For multiple regression it is R square was .888, p<.001. For logistic regression, see earlier post beginning page 2 (I had to re-test because the first one had too many IV's).

Yes all games were included. Do note someone earlier suggested using OREB% rather than OREB, and that did change some things (namely oppOREB no longer significant)

glimmertwins said: ↑

That's a ton of statistical evidence to point out what I thought was fairly obvious about this team already.
...not criticizing, I love the stats too but just had a little chuckle to myself.
Click to expand...

Old Man Rock said: ↑

I love the effort but in the words of McHale, "all that analytics does is tell me stuff I already knew."
Click to expand...

No problem. I'm not so intuitive when it comes to actual common sense basketball knowledge, so I need numbers to help me out a little bit :grin:

Although, even though we all know defense is important, as is getting the 3's to fall, and cutting down on TO's, etc., if the regression models can be polished up some more they have the potential to tell us which is more important than the others, which can be insightful

(currently one model suggests defending and lowering opponent 2 pnt FG% is most important, while the other suggests improving the Rockets' 3 point shot is most important)

burlesk said: ↑

My brain goes into emergency shutdown mode almost immediately when faced with posts like this.
Click to expand...

That's my bad. Maybe went a little overboard on the technical stuff for this post lol..

burlesk · Feb 17, 2014

hollywoodMarine said: ↑

That's my bad. Maybe went a little overboard on the technical stuff for this post lol..
Click to expand...

Naw, hM, it's not you -- I just have some weird but serious block in my brain about statistics. ANY statistical analysis at any depth causes me to mentally melt down. It's kind of why I can't play chess, either... I'm a fairly smart feller in many ways, though...

wizkid83 · Feb 17, 2014

hollywoodMarine said: ↑

To determine how good Rockets' offense is in a way that accounts for variance in the quality of opponents' defense, the Rockets' FG% for each game must be subtracted from the opponents' opponent FG% . If I leave that part out, and only account for the fluctuation in opponents' offense in this model, then the defense variables (opponent FG%) magnitude of effect would be "unfairly" reduced compared to the offense variables (Rockets' FG%) in this model. If I were to also account for the fluctuation in quality of opponents in terms of defense, that would balance things out between the effects of Rockets' and opponent FG% variables, but now their effects are unfairly reduced as a whole compared to turnovers and steals and other variables (because the quality of opponents can also fluctuate in terms of how well they take care of the ball and how good they are at making steals, and rebounding etc.).

PS: I will remember your suggestion and finish a better regression model next time I have a huge chunk of free time
Click to expand...

Yeah had a total brain fart moment. You should be using opponent's FG% and our FG% at each game vs. Opponents offensive and defensive FG% as a prediction.

Also, instead of using 3 pt % and 2 pt%, why don't just used TS% which adjust for all for all of that in one variable?

TS%={PTS*100}/{(2*(FGA+0.44*FTA))}

FV Santiago · Feb 17, 2014

I am a big fan of multiple regression and backward regression models and used them extensively (StatTools) when I was involved in gambling on sports. When it comes to predicting over/unders on NBA games I came to the same conclusion as you -- the single most important independent variable was always related to defense. This creates a lot of betting opportunities because the public at large is trained to look for offense in an NBA game. So if you pit the Steve Nash Phoenix Suns against the Ben Wallace Detroit Pistons, it will consistently be the Pistons' defense that has the bigger input into the pace of play and overall points scored.

The hole however with regression analysis in sports is that you can't normalize your sample for injuries, lineup changes and trades. Back-to-backs also impact performance, as do psychological things like coming off a big win or being on a streak. That said, it's still a great tool and creates a lot of interesting insights.

Forums

What factors are most predictive of a game's outcome for the Rockets? A multiple/logistic regression

heypartner Contributing Member

hollywoodMarine Member

wizkid83 Contributing Member

hollywoodMarine Member

hollywoodMarine Member

Noob Cake Member

hollywoodMarine Member

Clarinetmonster Rookie

hollywoodMarine Member

durvasa Contributing Member

hollywoodMarine Member

glimmertwins Member

Old Man Rock Contributing Member

burlesk Serious business
Supporting Member

krmclaughlin Member

New Member

hollywoodMarine Member

burlesk Serious business
Supporting Member

wizkid83 Contributing Member

FV Santiago Member

Share This Page

About ClutchFans

Rockets Content

Support ClutchFans!

What factors are most predictive of a game's outcome for the Rockets? A multiple/logistic regression

heypartner Contributing Member

hollywoodMarine Member

wizkid83 Contributing Member

hollywoodMarine Member

hollywoodMarine Member

Noob Cake Member

hollywoodMarine Member

Clarinetmonster Rookie

hollywoodMarine Member

durvasa Contributing Member

hollywoodMarine Member

glimmertwins Member

Old Man Rock Contributing Member

burlesk Serious business Supporting Member

krmclaughlin Member

New Member

hollywoodMarine Member

burlesk Serious business Supporting Member

wizkid83 Contributing Member

FV Santiago Member

Share This Page

burlesk Serious business
Supporting Member

burlesk Serious business
Supporting Member