2018-2019 NBA Champion Toronto Raptors Analysis

A Kaggle version is here

Toronto Raptors won the 2018-2019 NBA Championship, which is really exciting. I watched the playoffs even months after the season was over. As a data analyst, I found datas of Raptors and did this analysis. Since tree methods could apperantly display the relations between the feature variables and the response variable, it is appropriate to use them to reveal the factors of wining.

Kawhi Leonard versus KD, who was injured during G5

Figure 1: Kawhi Leonard versus KD, who was injured during G5

library(readr)
library(dplyr)
library(tidyr)
library(magrittr)
library(party)
library(rpart)
library(rpart.plot)
library(ggplot2)
library(corrplot)

Regular Season

First I try to analyze raptors’ collective data.

regular<- raptorsreg %>% 
  arrange(DATE) %>% 
  na_if('-') %>% 
  rename('TPP'='3P%','FTP'='FT%',
         'FGP'='FG%','win'='W/L') %>%
  mutate_if(sapply(.,is.character),as.factor)

corrplot(cor(regular %>% 
               select(., PTS:PF), 
             use = 'complete.obs'), 
         method = 'circle')
Correlation Plot

Figure 2: Correlation Plot

#tree <- ctree(formula,regular,
#              controls = #ctree_control(mincriterion = 0.01,minsplit=3))
#plot(tree)
#plot(tree,type = "simple")
formula <- win~PTS+FGM+FGA+FTM+FTA+OREB+DREB+REB+AST+STL+BLK+TOV+PF+FTP+TPP+FGP
treept <- rpart(formula,regular,
                control = rpart.control(cp=0.05,minsplit=3))

rpart.plot(treept, branch = 0.4)
When raptors score more than 111 and the field goals attempted more than 81, they will win in a high chance. If they get less than 111 points and shoot more than 82 times, they will definitely lose.

Figure 3: When raptors score more than 111 and the field goals attempted more than 81, they will win in a high chance. If they get less than 111 points and shoot more than 82 times, they will definitely lose.

ggplot(regular)+
  geom_point(aes(PTS,FGA,
                 shape=win,col=win)) +
  geom_segment(aes(111,81,
                   xend = 111,yend = 110),
               col='purple') +
  geom_segment(aes(111,81,
                   xend = 141,yend = 81),
               col='purple')+
  annotate('text', x =130,y=95, label='Win', size=10,color='turquoise4') +
  annotate('text', x =95,y=85, label='Lose', size=10,color='tomato') +
  labs(x='Points',y='Field Goals Attempted')
Partition borders of the CART tree.

Figure 4: Partition borders of the CART tree.

Players Variable Effect

In order to explore effect of every player to games’ result, the data has been transformed to wide format, which every column represents a player’s statistic, including points, mins per game and field goals points etc. Be careful that converting multiple columns into wide format is a little bit difficult, and I will give my solution below.

Moreover, I do not want to conclude unnecessary noise in datasets because some substitute players actually contribute very little. I remove those rows too.

As we know, Leonard, who is absolutely a superstar, could maintain high efficiency and high attempt of shooting almost every game. I really hope to see his effect in this analyze. Actually, Leonard ranked 5th in historical playoffs scoring list. Also, his effiecency is really amazing! Not to say his outstanding defence.

the list of the top 12 highest scoring post seasons, as well as how many field goals were attempted:

Rank Player Year Points FGAs
1 Michael Jordan 1992 759 581
2 LeBron James 2018 748 510
3 Kawhi Leonard 2019 732 496
4 Hakeem Olajuwon 1995 725 576
5 Allen Iverson 2001 723 661
6 Shaquille O’Neal 2000 707 505
7 LeBron James 2012 697 502
8 Kobe Bryant 2009 695 530
9 Michael Jordan 1998 680 526
10 Kobe Bryant 2010 671 511
11 Michael Jordan 1993 666 528
12 Hakeem Olajuwon 1994 664 514

Leonard is the only player use only less than 500 times of field goals in this list. However, compared with Leonard, Iverson was really inefficient.

players <- regular_players %>% 
  na_if('-') %>%
  filter(PLAYER!='Jodie Meeks',PLAYER!= 'Eric Moreland',
         PLAYER!='Jonas Valanciunas',PLAYER!='Jordan Loyd',
         PLAYER!='Malcolm Miller',PLAYER!='Chris Boucher',
         PLAYER!='Jeremy Lin',PLAYER!='Lorenzo Brown',
         PLAYER!='Malachi Richardson',PLAYER!='Marc Gasol',
         PLAYER!='Patrick McCaw',PLAYER!='Greg Monroe',
         PLAYER!='CJ Miles') %>% 
  rename('TPP'='3P%','FTP'='FT%',
         'FGP'='FG%','win'='W/L') %>% 
  mutate_if(sapply(.,is.character),as.factor) %>% 
  select(-'+/-') %>% 
  gather(key,value,-TEAM,-MATCHUP,-DATE,-PLAYER,-win) %>% #convert wide to long, one column is stat name, another is value
  unite(col,key,PLAYER) %>% #combine player's name and stat name, which will be column name later.
  spread(col,value) %>%  # convert long to wide
  select(-TEAM,-MATCHUP,-DATE) %>%
  replace(is.na(.),0) %>% 
  mutate_if(sapply(.,is.character),as.numeric)
ncol(players)
# [1] 172
head(data.frame(players)[1:5])
#   win X3PA_Danny.Green X3PA_Delon.Wright X3PA_Fred.VanVleet
# 1   W                2                 2                  2
# 2   L                6                 3                  6
# 3   W                3                 1                  8
# 4   W                8                 2                  6
# 5   W                0                 1                  4
# 6   W                3                 3                  5
#   X3PA_Kawhi.Leonard
# 1                  3
# 2                  2
# 3                  3
# 4                  0
# 5                  6
# 6                  5
#tree <- ctree(win~.,players,
#             controls = #ctree_control(mincriterion = 0.00001,minsplit=9))
#plot(tree)

treept <- rpart(win~.,players,
                control = rpart.control(cp=0.0001,minsplit=6))

rpart.plot(treept, branch = 0.4)

players %>% 
  rename('TPP_Kyle_Lowry'='TPP_Kyle Lowry',
         'MIN_Pascal_Siakam'='MIN_Pascal Siakam',
         'MIN_Kawhi_Leonard'='MIN_Kawhi Leonard') %>% 
ggplot() +
  geom_point(aes(TPP_Kyle_Lowry,MIN_Pascal_Siakam,shape=win,col=win))+
  labs(x='3 point percent of Kyle Lowry',y='Minutes per game of Siakam')

From the tree, Kyle Lowry’s 3 points accuracy is really important for win. It is surprising that Kawhi Leonard does not appear on the tree. It is opporiate to our common sense. It does not mean that Leonard is not significant, as a superstar, he really helps when pivotal time comes, and that contribution might not be reflected from models. Besides, Leonard only attended 60 games during 2018-2019 season. He might wanted to keep fit by absent at regular season to prepare for playoffs. Or, he might did not do better than what he did on post seasons.

Playoffs

playoffs <- raptorsplayoffs %>% 
  arrange(DATE) %>% 
  na_if('-') %>% 
  rename('TPP'='3P%','FTP'='FT%',
         'FGP'='FG%','win'='W/L') %>% 
  mutate_if(sapply(.,is.character),as.factor) 



tree <- ctree(formula,playoffs,
              controls = ctree_control(mincriterion = 0.01,minsplit=3))

plot(tree)

treept <- rpart(formula,playoffs,
                control = rpart.control(cp=0.001,minsplit=3))
rpart.plot(treept, branch = 0.4)

  • Since the quantity of playoff data is too small, the results of the trees are unstable. Even discrepencies between the rpart and ctree function are huge.

Combine collective data and players’ personal data

  • Since I have not seen Leonard, I decide to combine personal data and collective data of whole team. Besides, since team’s points are too important and directly related to wining or not, I removed it.
players <- regular_players %>% 
  na_if('-') %>%
  filter(PLAYER!='Jodie Meeks',PLAYER!= 'Eric Moreland',
         PLAYER!='Jonas Valanciunas',PLAYER!='Jordan Loyd',
         PLAYER!='Malcolm Miller',PLAYER!='Chris Boucher',
         PLAYER!='Jeremy Lin',PLAYER!='Lorenzo Brown',
         PLAYER!='Malachi Richardson',PLAYER!='Marc Gasol',
         PLAYER!='Patrick McCaw',PLAYER!='Greg Monroe',
         PLAYER!='CJ Miles') %>% 
  rename('TPP'='3P%','FTP'='FT%',
         'FGP'='FG%','win'='W/L') %>% 
  mutate_if(sapply(.,is.character),as.factor) %>% 
  select(-'+/-') %>% 
  gather(key,value,-TEAM,-MATCHUP,-DATE,-PLAYER,-win) %>% # wide to long
  unite(col,key,PLAYER) %>%
  spread(col,value) %>% # long to wide
  select(-TEAM,-MATCHUP) %>%
  replace(is.na(.),0) %>% 
  mutate_if(sapply(.,is.character),as.numeric)


intedt <- left_join(regular,players, by ='DATE', suffix = c('','.y')) %>% 
  select(-PTS,-win.y,-TEAM,-DATE,-MATCHUP,-MIN,-'+/-')

tree <- rpart(win~.,intedt,
                control = rpart.control(cp=0.0001,minsplit=6))

rpart.plot(tree, branch = 0.4)

PF: personal fouls

Now we could see Leonard appear on the tree. Except collective features(The total points has been deleted), Leonard and Siakam are players first appear on the tree, thus they are testified to be very important.

Conclusion

It is hard to draw a final conclusion because datasets are too small, and I have not found ways to handle missing values, which mean one player does not play at that game. Now I just filled them with 0. party and rpart always give me different results. Considering the length of passage, I do not display all results. In 2019-2020 season, after transfering to L.A. Clippers, Leonard still often be absent at games, especially back to backs. If he could keep healthy til end of regular seasons, will he play as good as last season?