《R语言房价回归预测案例报告-附代码数据.docx》由会员分享,可在线阅读,更多相关《R语言房价回归预测案例报告-附代码数据.docx(10页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、R语言房价回归预测案例报告首先,我们加载数据和必要的软件包:load(ames_train.Rdata)library(MASS)library(dplyr)# Iibrary(ggplot2)library(devtools)library(statsr)library(liibridate)library(tidyr)library (gridExtra I1在数据集中建立房屋的年龄标签直方图(30个箱),并描述分布。# type your code for Question 1 here, and Knitggplotidata ames_train. aes(x=year(today(
2、 )-Year.Built)+geom_histogram(bins 3(), fill -blue, colour white)+labs(title House counts by age, x House age, y Hous e count)+geom_vline(xintercept median) yeari today()-anies_trainSYear.Built), colour=red)+annotate(text, x= median(year(today()-ames_trainSYear.Built)-4, y=c( 1 ()(),-1(),label=c(Med
3、ian, mediant yean today()-ames_train$Year.Built),colour=red, angle=c(9(),()+geom_vline(xintercept mean(year(today()-ames_trainsYear.Bui 11), colour #4lc42f)+annotate(text, x mean(yeahtoday()-aines_trainSYear.Built)+4, y c( 100, 10)# Central.Air + Sale.Condition, data = data)# Residuals:# Min IQ Medi
4、an 3Q Max# -1.2941() -0.06298 -O.(XX)26 0.06037 ().88123# Coefficients:# # (Intercept)# # area# Lot. Area# # OveralLQual# Overall.Cond# # Ycar.Built# Year.Remod. Add# # Bedroom.AbvGr# # Fireplaces# Garagc.Cars# # MS.ZoningFV# MS.Zoningl (all)# # MS.ZoningRH# MS.ZoningRL# # MS.ZoningRM# Condition.2Fe
5、edr# # Condition.2Norm# Condition.2PosA# # Condition.2PosN# Condition.2RRNn# # B Idg.Type2fmCon# Bldg.TypeDuplex# # Bldg.TypeTwnhs# Bldg.TypeTwnhsE# # Exter.QualFa# Extcr.QualGdEstimate Std. Error t value Pr(|t|)3.789e+00 6.579e-0l 5.759 !.14e-08 *2.846e-04 1.603e-05 17.752 2e-16*1.612e-06 4.660e-07
6、 3.460 0.000564 *7.868e-02 5.594e-03 14.066 2e-l6*5.l96e-02 5,l20e-03 10.149 2e-16*2.203e-03 2.788C-04 7.902 7.48c-15*1.061e-03 3.111 e-04 3.4II 0.000675 *2.146c-04 1,463c-05 14.669 2c-16*1.732c-04 2.743C-05 6.315 4.13e-10 *1.081C-04 1.517c-O5 7.122 2.09c-12 *-1.993C-02 7.477c-03 -2.666 0.007813 *2.
7、649e-02 7.959e-03 3.328 0.000907 *3.673c-02 8.O13C-O3 4.584 5.16c-06 *3.426e-01 5.061e-02 6.770 2.24e-l 1 *3.180C-01 1.433c-01 2.219 0.026688*2.814e-01 6.743e-02 4.172 3.29e-05 *3.138e-01 4.696e-02 6.683 3.95*11 *2.364e-01 4.664e-02 5.068 4.83e-07 *-9.684e-02 i.091e-01 -0.887 0.375158-1.817e-03 9.46
8、5e-02 -0.019 0.9846897.855e-02 1.629e-0l0.482 0.629736-1.006e+00 1.356e-01 -7.419 2.59e-13 *-6.852e-02 l.628e-0l -0.421 0.6738854.260e-02 3.201e-021.331 0.183533-7.264e-02 2.53Ie-02 -2.8700.004199*-l.367e-0l 2.387e-02-5.726 l.38e-08 *-3.960e-02 l.766e-02 -2.243 0.025136*-5.329e-()3 5.419e-02 -0.098
9、0.921685-8.948C-02 2.558C-02 -3.498 0.000491 *# Exter.QualTA-1.266e-01 2.992e-02 -4.233 2.52e-05 *# Exter.CondFa-L2()9e-01 7.521e-02 -1.608 0.108264# Exter.CondGd1.976e-02 6.673e-02 0.296 0.767240# Exter.CondTA4.762e-02 6.620e-02 0.719 0.472122# CentraLAirY8.643e-02 2.297e-02 3.762 0.000179 *# Sale.
10、ConditionAdjLand 1.264e-01 9.730e-02 ).299 0.194106# # Sale.ConditionAlloca l.84le-0l 6.837e-02 2.693 0.007208 *# Sale.ConditionFamily -3.146e-02 3.620e-02 -0.869 0.384956# Sale.ConditionNormal 8.407e-02 1.777e-02 4.7312.56X)6 * # Sale.ConditionPartial 1.250e-01 2.434e-02 5.137 3.38e-07 * #-# Signif
11、. codes: 0 ,* 0.(X)l *().()1 *0.05 7 0.1 11 I# Residual standard error: 0.1296 on 961 degrees of freedom# # Multiple R-squared: 0.9086, Adjusted R-squared: 0.905# F-statistic: 251.5 on 38 and 961 DF, p-valuc: % ggplot(aes(x=Neighborhood, y=pricc 10A5, fill=Ncighborhood) gcom_boxplot()+(hcmc(axis.tcx
12、t.x element_text( angle 90)NeighborhoodNeighborhood申 Blmngtn 申 Blueste 串 BrDale 臼 BrkSide 审 Clearer 串 ColgCr . Crawfor Edv/ards . Gilbert Greens GmHl IDOTRR * MeadowV Mitchel. NAmes NoRidge NPkVill . NndgHt NWAmes 率 OWTov/n 串 Sawyer 日 SawyerW $ Scxnerst 申 StoneBr SWISU Timber 曰 VeenkerJoqE 二 nwMS as
13、auow 一S0UJOS MJ%MBS UM01S0 sauiVMN X号N =xdN 96PSON saiuvN OUW2 M0PBas aa.Loa- SXEe suaaj。 C2-0 SPBMP山 。6=0。 ,JyB 里。 apOTxco 里 BQas ,9sans S6UUJ2JoqE 二 nwMS asauow 一S0UJOS MJ%MBS UM01S0 sauiVMN X号N =xdN 96PSON saiuvN OUW2 M0PBas aa.Loa- SXEe suaaj。 C2-0 SPBMP山 。6=0。 ,JyB 里。 apOTxco 里 BQas ,9sans S6UU
14、J2Neighborhood#计算由邻居分组并存储在数据框中的所有中央和传播统计数据。ames_stats % group_by(Neighborhood) % summarise(Min=min(price. na.rm TRUE), Mean=me an( price, na.rm=TRUE), Median=median( price. na.rm=TRUE),IQR=IQR% arrange(desc Mean)制存储在数据框中的汇总统计信息ames_summary - data.frame( filteri ames_stats. Mean max( Mean)SNeighborho
15、od. filter( ames_stats. Mean=min Mean)$Neighborhool, filter(ames_stats, IQR=max(IQR)$Neighborhood)# #格式化数据帧colnames(ames_summary i - cCMost Expensive, Least Expensive, Most heterogenous)rownames(ames_sunimar y) - c(Neighborhood)# #打印出数据帧ames_summary# # Most Expensive Least Expensive Most heterogenou
16、sMeadowVMeadowVStoneBr# # Neighborhood StoneBrThe above summary statistics collected based on above scripts shows that StoncBr Neighborhood with the highest price mean & median values amongs all neighborhoods. StoneBr is therefore the most expensive neigborhood. However, we can as well see that base
17、d on the IQR, StoneBr has (he most dispersed house price making it as well the most heterogenous neighborhood.McadowV in opposite has the lowest mean & median in terms of house price and therefore is the least expensive neighborhood.3Which variable has the largest number of missing values? Explain w
18、hy it makes sense that there arc so many missing values for this variable.# type code far Question 3 here, and Knit# Count all variables missing i/Me.smissing-vaI % summarise_all(funs(sum(is.na(.)# # select the column that has the maximum missing ah(esmissmg_va. colSums(missing_val11J)=max(missing_v
19、aD# #A tibble: I x 1# Pool.QC# # # 1 997The variable that has (he highest number of missing value is Pool.QC which means most of the houses do not have a swimmingpooL Only 3 houses in this data set have swimmingpools. This makes sense as the cost of owning a swimming pool is generally high as it ent
20、ails not only the area space but as well the construction and more importantly (he running costs which in essence are lifetime.Wc want to predict the natural log of the home prices. Candidate explanatory variables arc lot size in square feet (Lot.Area), slope of property (Land.Slope), original const
21、ruction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom. AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method (o find the best multiple regression model for
22、predicting the natural log of the home prices.# # select the variable with at least one missing valuesnamesimissing_val|. colSums(missing_val )= 1 )# 1 LoLFrontage Alley Mas.Vnr.Arca Bsmt.Qual# # 5 Bsmt.Cond Bsmt-Exposurc BsmtFin.Type.l BsmtFin.SF.I, Toial.Bsmt.SF# # 113 Bsmt.Full.Bath Bsmt.Half.Bat
23、h Fireplace.Qu Garage.Type# 117 Garage.Yr.Bit Garage.Finish Garage.Cars Ganige.Area# #21 Garage.Qual0 Garage.Cond Pool.QC Fence# 25 Misc.FeatureAbove are listed all variables within this dataset with NA values. Based on the data dictionary and on the previous investigations, we can explain the missi
24、ng and decide on how to deal with them. Lot.Frontage and Mas.Vnr.Area arc integers and their NAs means that there is no Lot frontage or Masonry Veneer area. We can set these NA value to zero (0) All variables Alley, Pool.QC, Fence, Misc.Feature NAs means that there is none of them. We can set all NA
25、s to the value “None”.# 21 houses have no basement, therefore All variables containing t4Bsmf, NAs will be set to the value “None”., 47 houses have no garage, hence “Garage“ NAs can be set to the value None.Based on the above analysis, we have created a function prepare_data()prepare_data() that wil
26、l treat all NA values as described above and make the updated dataset datadata available for our model selection. in the appropriate window below, we are describing the model selection method used.# type your code for Question 4 here, and Knit/fFrequentist approach of model selection#Backward select
27、ion starting with full modelPreparing the data function# this function will prepare the data from anics for u7/y.y/.s prepare_da(a - function(data=ames_train)identifying variables with same values not to be considered in modeling since not significant to the studyvar_lo_use % summarise_all(funs(leng
28、lh(unique!.)var_to_use 11)# select userfid variables only for the modeldata % select(var_to_use)# Deal with NA values# Seperaie Numerical from Categorical variablesames_var_types - split(names(data), sapply(data. function(x) pastelclass(x), collapse )# Deal with numerical variables NA valuesames_tra
29、in_int % select(ames_var_typesSinteger)ames_train_int|is.na(ames_train_int) - 0# # Dealing with categorical variables NA valuesames_train_fac % select(ames_var_typesSfactor ) ames_train_fac - sapply(ames_train_fac, as.character) ames_train_fac|is.na(ames_train_fac) - c(None) ames_train_fac ames_trai
30、n_fac z= | - c(None) anies_train_fac - data.frame(ames_train_fac)# # Merging both numerical and caiegoricalsdata - cbind(ames_train_int_ ames_train_fac)return(na.omit(data)Model selection function# This function will do the automatic model selection and yield the best model based on selection(7淞万amo
31、del_selecti on - functiontdata data, response_variable=price ,criteria-c(pvalue, Adj-Rsquare, BIC, AIC). significance =0.05)count - 0Building the linear model based on th e dataframe %d沁ed # Looping along the variable names and creating the next variable list for the Imfull_var_list - names( data) n
32、ames! data)! response_variable |while(count=()next_var - c()fbmiulam - NULLfor(j in seq_along(full_var_list)ncxt_var - noquotctpastcOi ncxt_var, full_var_list|j, scp = +)# Removing the + at the end of the variable list next_var_iist - substr(next_var. L nchan next_var)-l)# Writing the linear model f
33、ormula to be used with the new variable listformula_lm - as.formula(pastelpaste(log(, response_variable.,sep = next_var_list. sep =)data_lm - imi formula = formulam* data = data)# Starting the model selection based on the method selected #TT TT TT 77 77 tT IT T7 rf IT TT 77 rT TT 77 T7 TT TT T7 fr t
34、T TT T7 TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TTTT TT rr TT TT Tr 17 tT TT Tr T7 TT TT 7/ T7 TT TT 77 TT fTITirTrif( tolowerfcriteria) %in% pvalue)林 Select p-values higher than the preset significance level. Default is 0.05model_pvalue - da(a.frame(summary(dataJm)ScoefIsummary(da(a_lm)$coef(,
35、4 : significance,)# Filler all row in the returned dataframe which partially tnalch variable names in original daiaframe ncxt_lm_var - sapply(namcs(data). function(x) grcp(x, row.namcsimodcl_pvalue). va!uc=TRUE)# # Remove all variables with no match林he re we deal with character(O) type of variable w
36、hich is different from NULL.# # We filter if by matching the length =0Lnexl_lm_var - next_lm_var ;sapply(next_lm_var. function(x) length(x)=OL)|# # Updating the next list of variable to be used for linear modelingnexl_var_list !=length(full_var_list)full_var_list - next_var_list else count - IJelse
37、if(tolower(criteria) %in% adj-rsquarcd)# # Select p-values higher than the preset significance level. Default is 0.05# Adj-Rsquared - summary(data _lni)$adj. r.squared# Define the sequence of variable# varjist - names(data) var_seq - seq( length( var_list)# # Filter all row in the returned datqframe
38、 which partially match variable names in original dataframe # nex t_bn_var - sapply(var_seq, function(x) combn(varjist, x)# # Remove all variables with no matchffUhere we deal with character(O) type of variable which is different from NULL.# # We filter if by matching the length =0L# next_lm_var - n
39、ext_lm_var!sapply(next_lm_var, function(x) length(x)=0L)J# # Updating the next list of variable to he used for linear modelingnext_var_list - names(next_lm_var) names(next_lni_var)!=resjonse_vanablej# # Checking if previous and current variable list are the sameprint(paste(criteria. is not yet suppo
40、rted*)# # Computing BIC model selection option if selected else if (tolower(criteria) %in% bic)n - nrow(na.omiti data)bic_lm -stepAIC(datam. direclion backward, k log(n), trace 0) data_lni - bicm# # setting count to I to exit the while loopcount - I# # Computing AIC model selection option if selected else if (tolowcrtcriteria) %in% aic)n - nrow(na.omiti dat