《北大暑期课程《回归分析》(Linear-Regression-Analysis)讲义PKU510978.pdf》由会员分享,可在线阅读,更多相关《北大暑期课程《回归分析》(Linear-Regression-Analysis)讲义PKU510978.pdf(9页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、 Class 5:ANOVA(Analysis of Variance)and F-tests I.What is ANOVA What is ANOVA?ANOVA is the short name for the Analysis of Variance.The essence of ANOVA is to decompose the total variance of the dependent variable into two additive components,one for the structural part,and the other for the stochast
2、ic part,of a regression.Today we are going to examine the easiest case.II.ANOVA:An Introduction Let the model be Xy.Assuming xi is a column vector(of length p)of independent variable values for the ith observation,iiixy.Then b xi is the predicted value.sum of squares total:2YySSTi 2xb xyYbiii Y-b x
3、b xy 2Yb xb xy22iiiiii 22Yb xeii because 0Yb xeYb xb xyiiiii.This is always true by OLS.=SSE+SSR Important:the total variance of the dependent variable is decomposed into two additive parts:SSE,which is due to errors,and SSR,which is due to regression.Geometric interpretation:blackboard Decompositio
4、n of Variance If we treat X as a random variable,we can decompose total variance to the between-group portion and the within-group portion in any population:Class 5,Page 2 iiixyVVV Prove:iiixyVV iiiixx,Cov2VV iixVV (by the assumption that 0,Covkx,for all possible k.)The ANOVA table is to estimate th
5、e three quantities of equation(1)from the sample.As the sample size gets larger and larger,the ANOVA table will approach the equation closer and closer.In a sample,decomposition of estimated variance is not strictly true.We thus need to separately decompose sums of squares and degrees of freedom.Is
6、ANOVA a misnomer?III.ANOVA in Matrix I will try to give a simplied representation of ANOVA as follows:2YySSTi iiyY2Yy22 iiyY2Yy22 222Yn2Ynyi (because Ynyi)22Ynyi 2Yny y yJ yn/1y y (in your textbook,monster look)SSE=ee 2Yb xSSRi Yb x2Yb x22ii b x Y2Ynb x22ii Class 5,Page 3 iiiey Y2Ynb x22 222Yn2Ynb x
7、i (because 0e,Ynyii,as always)22Ynb xi 2YnXbXb yJ yn/1yXb (in your textbook,monster look)IV.ANOVA Table SOURCE SS DF MS F with Regression SSR DF(R)MSR MSR/MSE DF(R)Error SSE DF(E)MSE DF(E)Total SST DF(T)Let us use a real example.Assume that we have a regression estimated to be y=-1.70+0.840 x ANOVA
8、Table SOURCE SS DF MS F with Regression 6.44 1 6.44 6.44/0.19=33.89 1,18 Error 3.40 18 0.19 Total 9.84 19 We know 100 xi,50yi,12.509x2i,84.134y2i,66.257yxii.If we know that DF for SST=19,what is n?n=20 5.220/50Y 84.95.25.22084.134YnySST22i 0.1250.84x1.7-SSR2i 0.125x84.07.12x84.084.07.17.12ii =201.71
9、.7+0.840.84509.12-21.70.84100-125.0 Class 5,Page 4 =6.44 SSE=SST-SSR=9.84-6.44=3.40 DF(Degrees of freedom):demonstration.Note:discounting the intercept when calculating SST.MS=SS/DF p=0.000 ask students.What does the p-value say?V.F-Tests F-tests are more general than t-tests,t-tests can be seen as
10、a special case of F-tests.If you have difficulty with F-tests,please ask your GSIs to review F-tests in the lab.F-tests takes the form of a fraction of two MSs.MSR/MSEF,df2df1 An F statistic has two degrees of freedom associated with it:the degree of freedom in the numerator,and the degree of freedo
11、m in the denominator.An F statistic is usually larger than 1.The interpretation of an F statistics is that whether the explained variance by the alternative hypothesis is due to chance.In other words,the null hypothesis is that the explained variance is due to chance,or all the coefficients are zero
12、.The larger an F-statistic,the more likely that the null hypothesis is not true.There is a table in the back of your book from which you can find exact probability values.In our example,the F is 34,which is highly significant.VI.R2 R2=SSR/SST The proportion of variance explained by the model.In our
13、example,R-sq=65.4%VII.What happens if we increase more independent variables.1.SST stays the same.2.SSR always increases.3.SSE always decreases.4.R2 always increases.5.MSR usually increases.6.MSE usually decreases.Class 5,Page 5 7.F-test usually increases.Exceptions to 5 and 7:irrelevant variables m
14、ay not explain the variance but take up degrees of freedom.We really need to look at the results.VIII.Important:General Ways of Hypothesis Testing with F-Statistics.All tests in linear regression can be performed with F-test statistics.The trick is to run nested models.Two models are nested if the i
15、ndependent variables in one model are a subset or linear combinations of a subset(子集)of the independent variables in the other model.That is to say.If model A has independent variables(1,1x,2x),and model B has independent variables(1,1x,2x,3x),A and B are nested.A is called the restricted model;B is
16、 called less restricted or unrestricted model.We call A restricted because A implies that 03.This is a restriction.Another example:C has independent variable(1,1x,2x+3x),D has(1,2x+3x).C and A are not nested.C and B are nested.One restriction in C:32.C and D are nested.One restriction in D:01.D and
17、A are not nested.D and B are nested:two restriction in D:32;01.We can always test hypotheses implied in the restricted models.Steps:run two regression for each hypothesis,one for the restricted model and one for the unrestricted model.The SST should be the same across the two models.What is differen
18、t is SSE and SSR.That is,what is different is R2.Let dfdf SSE,dfdf SSEuurr;dfdf()()0ururrunpnppp Use the following formulas:,SSESSE/df SSEdf SSEFSSE/dfrurudfr dfu dfuuu or,SSRSSR/df SSRdf SSRFSSE/dfururdfr dfu dfuuu Class 5,Page 6 (proof:use SST=SSE+SSR)Note,df(SSEr)-df(SSEu)=df(SSRu)-df(SSRr)=df,is
19、 the number of constraints(not number of parameters)implied by the restricted model or 22,2RR/dfF1R/dfurdfr dfu dfuuu Note that df1df,2Ft That is,for 1df tests,you can either do an F-test or a t-test.They yield the same result.Another way to look at it is that the t-test is a special case of the F t
20、est,with the numerator DF being 1.IX.Assumptions of F-tests What assumptions do we need to make an ANOVA table work?Not much an assumption.All we need is the assumption that(XX)is not singular,so that the least square estimate b exists.The assumption of X=0 is needed if you want the ANOVA table to b
21、e an unbiased estimate of the true ANOVA(equation 1)in the population.Reason:we want b to be an unbiased estimator of,and the covariance between b andto disappear.For reasons I discussed earlier,the assumptions of homoscedasticity and non-serial correlation are necessary for the estimation of iV.The
22、 normality assumption that i is distributed in a normal distribution is needed for small samples.X.The Concept of Increment Every time you put one more independent variable into your model,you get an increase in 2R.We sometime called the increase incremental2R.What is means is that more variance is
23、explained,or SSR is increased,SSE is reduced.What you should understand is that the incremental 2R attributed to a variable is always smaller than the2R when other variables are absent.Class 5,Page 7 XI.Consequences of Omitting Relevant Independent Variables Say the true model is the following:01122
24、33iiiiiyxxx.But for some reason we only collect or consider data on 21,xandxy.Therefore,we omit 3xin the regression.That is,we omit in3x our model.We briefly discussed this problem before.The short story is that we are likely to have a bias due to the omission of a relevant variable in the model.Thi
25、s is so even though our primary interest is to estimate the effect of 1xor 2x on y.Why?We will have a formal presentation of this problem.XII.Measures of Goodness-of-Fit There are different ways to assess the goodness-of-fit of a model.A.R2 R2 is a heuristic measure for the overall goodness-of-fit.I
26、t does not have an associated test statistic.R2 measures the proportion of the variance in the dependent variable that is“explained”by the model:R2=SSESSRSSRSSTSSR B.Model F-test The model F-test tests the joint hypotheses that all the model coefficients except for the constant term are zero.Degrees
27、 of freedoms associated with the model F-test:Numerator:p-1 Denominator:n-p.C.t-tests for individual parameters A t-test for an individual parameter tests the hypothesis that a particular coefficient is equal to a particular number(commonly zero).tk=(bk-k0)/SEk,where SEkis the(k,k)element of MSE(XX)
28、-1,with degree of freedom=n-p.D.Incremental R2 Relative to a restricted model,the gain in R2 for the unrestricted model:R2=Ru2-Rr2 Class 5,Page 8 E.F-tests for Nested Model It is the most general form of F-tests and t-tests.,SSESSE/df SSEdf SSEFSSE/dfrurdfu dfrudfuuu It is equal to a t-test if the u
29、nrestricted and restricted models differ only by one single parameter.It is equal to the model F-test if we set the restricted model to the constant-only model.Ask students What are SST,SSE,and SSR,and their associated degrees of freedom,for the constant-only model?Numerical Example A sociological s
30、tudy is interested in understanding the social determinants of mathematical achievement among high school students.You are now asked to answer a series of questions.The data are real but have been tailored for educational purposes.The total number of observations is 400.The variables are defined as:
31、y:math score x1:fathers education x2:mothers education x3:familys socioeconomic status x4:number of siblings x5:class rank x6:parents total education(note:x6=x1+x2)For the following regression models,we know:Table 1 SST SSR SSE DF R2(1)y on(1 x1 x2 x3 x4)34863 4201 (2)y on(1 x6 x3 x4)34863 396.1065(
32、3)y on(1 x6 x3 x4 x5)34863 10426 24437 395.2991(4)x5 on(1 x6 x3 x4)269753 396.0210 1.Please fill the missing cells in Table 1.2.Test the hypothesis that the effects of fathers education(x1)and mothers education(x2)on math score are the same after controlling for x3 and x4.3.Test the hypothesis that
33、x6,x3 and x4 in Model(2)all have a zero effect on y.4.Can we add x6 to Model(1)?Briefly explain your answer.5.Test the hypothesis that the effect of class rank(x5)on math score is zero after controlling for x6,x3,and x4.Class 5,Page 9 Answer:1.SST SSR SSE DF R2(1)y on(1 x1 x2 x3 x4)34863 4201 30662
34、395 .1205(2)y on(1 x6 x3 x4)34863 3713 31150 396 .1065(3)y on(1 x6 x3 x4 x5)34863 10426 24437 395 .2991(4)x5 on(1 x6 x3 x4)275539 5786 269753 396 .0210 Note that the SST for Model(4)is different from those for Models(1)through(3).2.Restricted model is 01123344()ybb xxb xb xe Unrestricted model is 01 1223344ybb xb xb xb xe (31150-30662)/1 F1,395=-=488/77.63=6.29 30662/395 3.3713/3 F3,396=-=1237.67/78.66=15.73 31150/396 4.No.x6 is a linear combination of x1 and x2.XX is singular.5.(31150-24437)/1 F1,395=-=6713/61.87=108.50 24437/395 t=108.5010.42tF