When it comes to nitty-gritty hockey discussions, ones on zone starts are among the most interesting because they touch on just about every aspect of the game. From on-ice strategy, to what goes on behind the bench, to front office management and statistical analysis, how to treat zone starts provides ample discussion for just about any hardcore hockey fan.
Part of what makes the discussion so interesting is the difficulty on getting a handle on what impact zones starts have on individual player performance or what conclusions we can draw from the data, the later of which we've looked at here at Japers' Rink in the past.
On Wednesday, The Post's Capitals Insider blog ran another look at zone starts - specifically how they impact scoring as it pertains to the Capitals - by noted local stat-head Neil Greenberg. Greenberg notes that, as we might expect, offensive zone starts and even-strength point production are positively correlated. He also notes that using zone starts to predict even strength points yields an R-Squared value of .70, which means that "offensive-zone starts explain 70.3 percent of the variation in even-strength points scored," and predicts "an extra offensive start per game could lead to 9-10 more points scored over the course of an 82-game season." While this may be true from a strictly mathematical standpoint (and don't worry if your eyes glazed over on that last sentence - we'll explain), we would have to be careful about drawing such a conclusion when it comes to what's happened - or is likely to happen - on the ice.
Backing up for a moment, let's have a quick explanation of just what the heck an R-Squared value is. As noted in the Capitals Insider piece, R-Squared is a measure of model accuracy. More specifically, and in very basic terms, it is the percentage of variation in the dependent variable that can be explained by the model which has been developed. For example, consider the following dataset, with the goal of explaining 'X' using 'a' and 'b':
In this case 'X' can be completely explained by a relatively simply algebraic formula: X = 2a + b. Because the dependent variable ('X') can be completely explained by the independent variables ('a' and 'b'), this formula's R-Squared would be 1.00, or 100 percent. Of course, real-world data is never this clean, and an R-Squared of .70 is generally very solid.
However, creating a model with only one explanatory variable inherently changes things. Rather than thinking of it in terms of one variable having predictive power, it is perhaps best to think of the R-Squared value as the extent to which the two variable move together.
Take retail as an example. Say I own two hockey equipment stores, one in Toronto and one in Mississauga, that Toronto has many more people than Mississauga, and that my monthly sales for the Toronto store are always exactly four times greater than the sales at the Mississauga store. We'll stipulate further that the equation T = 4M completely explains the relationship between the two - yielding an R-Squared of 1.00.
What's critically important to note is that even an R-squared value of 1.00 does not mean that sales in the Mississauga store are driving sales in the Toronto store (or vice versa). Instead, the correlation is the result of the fact that the two stores are selling identical products in markets with similar demographics, going up against similar competitors, etc. Sales in Toronto don't go way up in the early fall and Christmas season because the Mississauga store is selling more, they go way up because that's when people are buying hockey equipment. Sales in both stores are being driven by a third variable, total demand for hockey equipment.
Back to hockey on the ice, and the same kind of exogenous third variable effect (also know as confounding) is likely in play when it comes to zone starts and point production. Underlying each are two very significant factors: offensive ability and quality of teammates. Offensive ability is a major factor simply because good offensive players are more likely to start in the offensive zone than in the defensive zone. This is not always the case, of course, and it varies in degrees (think of Nicklas Backstrom getting defensive zone starts because he is one of the Capitals' better face-off men or Ryan Kesler seeing a lot of draws in his own end in order to free up the Sedin twins for offensive zone starts), but for the most part, the guys getting offensive zone starts are solid scorers whose offensive upside is their most significant contribution to the team. Thus offensive ability is driving both high offensive zone start totals and high point totals.
Teammates can have a similar effect. Generally speaking, good offensive players play with good offensive players - Nicklas Backstrom generally plays with Alexander Ovechkin and both likely see an increase in their point totals as a result. At the same time, having both of those players on the same line serves as an even more powerful incentive to try and get that line offensive zone starts. Again an exogenous variable, offensive ability of teammates, is driving both points and offensive zone starts.
Similarly, there may be bidirectional causation. Offensive zone starts probably increase point production, because common sense says they would. At the same time, as a player becomes a more dangerous scoring threat, there is more incentive to get him offensive zone starts; consequently while zone starts may result in increased scoring numbers, increased scoring numbers may results in more offensive zones starts and teasing out which is truly driving the other may be difficult, if not impossible.
All of these issues become even bigger concerns when using a one-variable model. The goal of a regression is to account for anything that could effect your dependent variable in order to isolate the effect of each. A one-variable model, especially one with an independent variable as hairy as the one we've discussed here, has a great deal of difficult doing this simply because it's simply impossible to isolate the effect of one variable without considering any others.
Thus while the regression may suggest that an additional offensive zone start each game may yield nine or ten more points over the course of the season, it's hard to believe this would be the case in practice given that the effect of an offensive zone start has not been effectively isolated and may be overstated as a result. Put another way, it's difficult to accept the supposition that an additional zone start could yield a double digit increase in points given that, taken to its logical extreme, it would suggest that simply adding five more offensive zone starts per game would net Alex Ovechkin an additional 50 points over the course of a season and another dozen O-zone starts would put him in contention to for the greatest season ever or that, in the other direction, seven fewer offensive-zone starts per game last year have left Nick Backstrom virtually point-less on the season?
It is this complexity and interrelation that make sports analytics so challenging and, subsequently, what should make strategists pause when they're attempting to use data to make decisions about what happens on the ice. Getting the big guns out for more offensive-zone starts will certainly help the Caps' offensive production... but right now it's impossible to say by how much.