10. GLS 실습

Applied statistics
Author

김보람

Published

May 25, 2023

해당 자료는 전북대학교 이영미 교수님 2023응용통계학 자료임

WLSE

  • 등분산성을 만족하지 않을 때! 하자
set.seed(2023)
x <- round(runif(20,1,15),2) 
y <- 20+5*x + rnorm(20,0,2)*x
  • \(N(0,2)x \rightarrow N(0,2x^2)\)

  • \(Var(\epsilon)= \sigma^2\)\(x^2\)에 비례

plot(y~x, pch=16)

  • X가 커지면서 분산도 커진다..
m <- lm(y~x)
m1 <- lm(y~x, weights = 1/x^2)
  • \(Var(\epsilon) = \dfrac{\sigma^2}{w_i}=x^2 \sigma^2\)

  • \(w_i = \dfrac{1}{x^2}\)

  • LSE는 \(argmin \sum_{i=1}^n(y_i-\hat y_i)^2\)

  • WLSE는 잔차 가중치의 제곱합을 가장 작게 해주는..

  • WLSE: \(argmin \sum_{i=1}^n w_i (y_i - \hat y_i)^2\)

  • MSE관점에서 더 좋은 추정량

- LSE

summary(m)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.007  -9.123   0.579   4.872  55.973 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  23.7109     7.7116   3.075  0.00653 ** 
x             5.0099     0.9461   5.295 4.92e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.83 on 18 degrees of freedom
Multiple R-squared:  0.609, Adjusted R-squared:  0.5873 
F-statistic: 28.04 on 1 and 18 DF,  p-value: 4.924e-05

- WLSE

summary(m1)

Call:
lm(formula = y ~ x, weights = 1/x^2)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-3.8348 -1.4757  0.0365  1.0085  4.6530 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17.1950     3.0315   5.672 2.22e-05 ***
x             6.0742     0.7657   7.933 2.76e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.99 on 18 degrees of freedom
Multiple R-squared:  0.7776,    Adjusted R-squared:  0.7653 
F-statistic: 62.94 on 1 and 18 DF,  p-value: 2.761e-07
par(mfrow=c(1,1))
plot(x,y,pch=16)  
abline(m, col='red', lwd=2)
abline(m1, col='blue', lwd=2)
legend("topleft",c('OLS', "WLS"), lty=1,
        col=c('red', 'blue'), lwd=2)

par(mfrow=c(1,2))
plot(fitted(m), resid(m),pch=16,xlab=expression(hat(y)), ylab=expression(e[i]),main="OLS") 
abline(h=0)
plot(fitted(m)/x, resid(m)/x,pch=16,xlab=expression(sqrt(w[i])*hat(y)), ylab=expression(sqrt(w [i])*e[i]), main="WLS")
abline(h=0)

  • OLS는 퍼져 있다.

  • WLS는 \(\sqrt{w_i}y_i = \sqrt{w_i}\beta_0 + \sqrt{w_i}\beta_1x_i + \sqrt{w_i} \epsilon_i\)

  • \(\sqrt{w_i} \epsilon_i\) ~ \(N(0,2)\)

  • \(\sqrt{w_i} y_i = \hat y' = \beta_0' + \beta_1 x_i' + \epsilon_i'\)

  • 분산이 어느정도 안정화 됭

- OLS (LSE)

  • y = X \(\beta\) + \(\epsilon\)

  • \(\epsilon\) ~ \(N(0_n, In\sigma^2)\)

  • \(\hat \beta = (X^TX)^{-1}X^Ty\)

- GLS (WLSE)

  • y = X \(\beta\) + \(\epsilon\)

  • \(Var(\epsilon) = V\sigma^2\)

  • \(\hat \beta^{G} = (X^TV^{-1}X)^{-1}X^TV^{-1}y\)

V<-diag(x^2)
X<-model.matrix(m)
V.inv<-solve(V)
beta<-solve(t(X)%*%V.inv%*%X)%*%t(X)%*%V.inv%*%y 
beta  
A matrix: 2 × 1 of type dbl
(Intercept) 17.194962
x 6.074163
summary(m1)$coef
A matrix: 2 × 4 of type dbl
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.194962 3.0314900 5.672116 2.220897e-05
x 6.074163 0.7656541 7.933298 2.760514e-07

\(lm(y\) ~ \(x, wieght=\frac{1}{x^2})\)

GLS구한거랑 WLS로 구한거랑 6.074163 값 동일

e2 <- y-(beta[1]+beta[2]*x) 
t(e2) %*% V.inv %*%e2 
A matrix: 1 × 1 of type dbl
71.26588
anova(m1)
A anova: 2 × 5
Df Sum Sq Mean Sq F value Pr(>F)
<int> <dbl> <dbl> <dbl> <dbl>
x 1 249.18204 249.182044 62.93722 2.760514e-07
Residuals 18 71.26588 3.959216 NA NA

1차 자기상관 회귀모형

  • 독립성을 만족하지 않을 때! 하자
dt <- read.csv("corr.csv")
head(dt)
A data.frame: 6 × 2
y x
<dbl> <dbl>
1 349.7 133.6
2 353.5 135.4
3 359.2 137.6
4 366.4 140.0
5 376.5 143.8
6 385.7 147.1
plot(y~x, dt, pch=16)

  • 오른쪽 4개 데이터가 애매하다.
m <- lm(y~x, dt)
  • \(y=\beta_0+\beta_x + \epsilon\)
summary(m)

Call:
lm(formula = y ~ x, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.959  -8.874   2.035   9.035  16.623 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   89.234     26.721   3.339  0.00365 ** 
x              2.024      0.166  12.196 3.89e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.98 on 18 degrees of freedom
Multiple R-squared:  0.892, Adjusted R-squared:  0.886 
F-statistic: 148.7 on 1 and 18 DF,  p-value: 3.888e-10
par(mfrow=c(1,1))
plot(dt$x, resid(m), pch=16) 
abline(h=0)

library(lmtest)
Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

dwtest(m)

    Durbin-Watson test

data:  m
DW = 0.31237, p-value = 1.238e-08
alternative hypothesis: true autocorrelation is greater than 0
  • DW값이 0에 가까우므로 양의상관관계를 가진다.

  • 2에 가까우면 상관관계가 0에 가까움

  • 4에 가까우면 음의 상관관계에 가까움

plot(y~x, dt, pch=16)
abline(m, col='darkorange', lwd=2)
legend("topleft", "OLS",col="darkorange", lty=1)

plot(fitted(m), resid(m), pch=16,xlab=expression(hat(y)), ylab='residual', 
     main=paste0("residua l plot for OLS : DW statistics = ", round(dwtest(m)$stat,4)))
abline(h=0)

변수변환

bar_e1 <- mean(resid(m)[-1])
bar_e2 <- mean(resid(m)[-20])
# hat_rho <- sum((resid(m)[-1]-bar_e1)*(resid(m)[-20] - bar_e2))/sum((resid(m)[-1]-bar_e1)^2) #일밎게 구하는거
hat_rho <- sum(resid(m)[-1]*resid(m)[-20])/sum((resid(m)[-20])^2)  #대략적으로 구하는거 
# hat_rho <- cor(resid(m)[-1], resid(m)[-20])
hat_rho
0.891002040194419
  • \(Corr(\epsilon_i, \epsilon_j) \neq 0, i \neq j\)

- 1차 자기상관 회귀모형

  • \(y=\beta_0 + \beta_1 x + \epsilon\)

  • \(\epsilon_i = \rho \epsilon_{i-1} + \delta_i\)

  • \(\delta_i\) ~ \(N(0,\sigma^2), iid\)

  • 현시점의 오차가 다음 시점의 오차에 영향을 준다. ->시계열 데이터와 관련된 모형

  • \(y_i=\beta_0+\beta_1x_i + \epsilon_i\)

  • \(\rho y_{i-1} = \beta_0 + \beta_1 \rho x_{i-1} + \rho \epsilon_{i-1}\)

  • \(y_i - \rho y_{i-1}=(1-\rho)\beta_0 + \beta_1 (x_i - \rho x_{i-1}) + \epsilon_i - \rho \epsilon_{i-1}\)

  • 위를

  • \(y' = \beta_0' + \beta_1'x_i' + \delta_i\)

  • \(\hat \rho = \dfrac{\sum_{i=2}^n e_i e_{i-1}}{\sum_{i=2}^n e_i^2}=0.891002040194419\)

y1 <- dt$y[-1]-hat_rho*dt$y[-20] 
x1 <- dt$x[-1]-hat_rho*dt$x[-20]
m1 <- lm(y1~x1)
summary(m1)

Call:
lm(formula = y1 ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9956  -1.2060   0.7668   3.7082   8.1241 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  40.2455    13.1639   3.057  0.00713 **
x1            0.4862     0.6483   0.750  0.46354   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.347 on 17 degrees of freedom
Multiple R-squared:  0.03202,   Adjusted R-squared:  -0.02492 
F-statistic: 0.5624 on 1 and 17 DF,  p-value: 0.4635
dwtest(m1)

    Durbin-Watson test

data:  m1
DW = 1.3704, p-value = 0.04663
alternative hypothesis: true autocorrelation is greater than 0
  • 데이터가 많지 않아서 크게 dw가 차이나지 ㄴ않음..
par(mfrow=c(1,1))
plot(y1~x1,pch=16,  xlab=expression(x^"'"),  ylab=expression(y^"'"), 
     main="scatter plot after v ariable transformation")
abline(m1, col='darkorange', lwd=2)

plot(fitted(m1), resid(m1), pch=16,xlab=expression(hat(y)), ylab='residual', 
     main=paste0("residual plot for OLS : DW statistics = ", round(dwtest(m1)$stat,4))) 
abline(h=0)

GLS

V<-diag(20)
V<-hat_rho^abs(row(V)-col(V))
X<-model.matrix(m)
V.inv<-solve(V)
beta<-solve(t(X)%*%V.inv%*%X)%*%t(X)%*%V.inv%*%dt$y 
beta
A matrix: 2 × 1 of type dbl
(Intercept) 131.745681
x 1.714326
plot(y~x, dt, pch=16)
abline(m, col='darkorange', lwd=2)
abline(beta[1], beta[2], col='steelblue', lwd=2)
legend("topleft", c("OLS", "GLS"),col=c("darkorange", 'steelblue'), lty=1, lwd=2)

- 시간 변수를 더 추가해보자.

dt$t <- 1:nrow(dt)

m2 <- lm(y~t, dt) 
summary(m2)

Call:
lm(formula = y ~ t, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.107  -6.594   0.372   8.359  16.907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 348.0605     5.4652   63.69  < 2e-16 ***
t             6.2023     0.4562   13.60  6.6e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.77 on 18 degrees of freedom
Multiple R-squared:  0.9113,    Adjusted R-squared:  0.9063 
F-statistic: 184.8 on 1 and 18 DF,  p-value: 6.605e-11
summary(m)

Call:
lm(formula = y ~ x, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.959  -8.874   2.035   9.035  16.623 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   89.234     26.721   3.339  0.00365 ** 
x              2.024      0.166  12.196 3.89e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.98 on 18 degrees of freedom
Multiple R-squared:  0.892, Adjusted R-squared:  0.886 
F-statistic: 148.7 on 1 and 18 DF,  p-value: 3.888e-10