《机器学习入门项目——加州房价预测.docx》由会员分享,可在线阅读,更多相关《机器学习入门项目——加州房价预测.docx(12页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、机器学习入门项目加州房价预测DOWNLOAD_ROOTs:/raw.githubusercontent/ageron/handson-ml2/master/HOUSING_PATHos.path.join(datasets,housing)HOUSING_URLDOWNLOAD_ROOTdatasets/housing/housing.tgzdeffetch_housing_data(housing_urlHOUSING_URL,housing_pathHOUSING_PATH):#途径不存在那么创立途径ifnotos.path.isdir(housing_path):os.makedirs(
2、housing_path)tgz_pathos.path.join(housing_path,housing.tgz)#将URL表示的网络对象复制到本地文件urllib.request.urlretrieve(housing_url,tgz_path)#下面是以及解压相关的代码housing_tgztarfile.open(tgz_path)housing_tgz.extractall(pathhousing_path)housing_tgz.close()fetch_housing_data()一些讲明快速查看数据构造importpandasaspddefload_housing_data(
3、housing_pathHOUSING_PATH):csv_pathos.path.join(housing_path,housing.csv)returnpd.read_csv(csv_path)housingload_housing_data()housing.info()通过查看数据得知有区域缺失特征值创立测试集#对房价中位数这个重要指标进展分层抽样#pd.cut创立5个不同的收入类别0-1.5为类别11.5-3.0为类别2importnumpyasnphousingincome_catpd.cut(housingmedian_income,bins0.,1.5,3.0,4.5,6.,n
4、p.inf,labels1,2,3,4,5)fromsklearn.model_selectionimportStratifiedShuffleSplitsplitStratifiedShuffleSplit(n_splits1,test_size0.2,random_state42)#这个返回的是分组后数在原数组中的索引#该for循环只执行一次fortrain_index,test_indexinsplit.split(housing,housingincome_cat):strat_train_sethousing.loctrain_indexstrat_test_sethousing.l
5、octest_index#income_cat只是临时用于创立测试集完了要删除恢复原数据forset_in(strat_train_set,strat_test_set):set_.drop(income_cat,axis1,inplaceTrue)数据准备要预测的是房价中位数所以训练的数据需要将房价中位数除去房价中位数作为标签以验证结果housingstrat_train_set.drop(median_house_value,axis1)#droplabelsfortrainingsethousing_labelsstrat_train_setmedian_house_value.copy
6、()处理缺失值大局部机器学习无法在缺失的特征上工作所以我们要处理缺失值fromsklearn.imputeimportSimpleImputerimputerSimpleImputer(strategymedian)#用中位数填充的话文本属性值要去掉housing_numhousing.drop(ocean_proximity,axis1)#fit()这步就是将imputer实例适配到训练数据#它计算了每个属性的中位值并将结果存储在其实例变量中imputer.fit(housing_num)#这步才是真正将缺失值交换成中位数值#完成训练集的转换但X的结果是数组Ximputer.transfor
7、m(housing_num)#重新生成dataframe格式housing_trpd.DataFrame(X,columnshousing_num.columns,indexhousing_num.index)处理文本属性对于机器学习来讲数字比文本更加好处理ocean_proximity这个属性的文本是有限个可能的取值而不是任意文本采用独热编码housing_cathousingocean_proximityfromsklearn.preprocessingimportOneHotEncodercat_encoderOneHotEncoder()housing_cat_1hotcat_enco
8、der.fit_transform(housing_cat)自定义转换器属性组合可以到达更好的效果将组合后的属性添加进去注意这里的array知识fromsklearn.baseimportBaseEstimator,TransformerMixin#columnindexrooms_ix,bedrooms_ix,population_ix,households_ix3,4,5,6classCombinedAttributesAdder(BaseEstimator,TransformerMixin):def_init_(self,add_bedrooms_per_roomTrue):#no*ar
9、gsor*kargsself.add_bedrooms_per_roomadd_bedrooms_per_roomdeffit(self,X,yNone):returnself#nothingelsetododeftransform(self,X):rooms_per_householdX:,rooms_ix/X:,households_ixpopulation_per_householdX:,population_ix/X:,households_ixifself.add_bedrooms_per_room:bedrooms_per_roomX:,bedrooms_ix/X:,rooms_i
10、xreturnnp.c_X,rooms_per_household,population_per_household,bedrooms_per_roomelse:returnnp.c_X,rooms_per_household,population_per_householdattr_adderCombinedAttributesAdder(add_bedrooms_per_roomFalse)housing_extra_attribsattr_adder.transform(housing.values)最终代码这里使用转换流水线将前面的转换合在一起并多了特征缩放list用法fromskle
11、arn.pipelineimportPipelinefromsklearn.preprocessingimportStandardScalerfromsklearnposeimportColumnTransformer#这些都必须有fit_transform方法num_pipelinePipeline(imputer,SimpleImputer(strategymedian),(attribs_adder,CombinedAttributesAdder(),(std_scaler,StandardScaler(),#晋级为可以处理所有列的转换器#list讲明num_attribslist(ho
12、using_num)cat_attribsocean_proximityfull_pipelineColumnTransformer(num,num_pipeline,num_attribs),(cat,OneHotEncoder(),cat_attribs),housing_preparedfull_pipeline.fit_transform(housing)训练模型defdisplay_scores(scores):print(Scores:,scores)print(Mean:,scores.mean()print(Standarddeviation:,scores.std()使用线性
13、模型模型fromsklearn.linear_modelimportLinearRegressionfromsklearn.metricsimportmean_squared_errorlin_regLinearRegression()lin_reg.fit(housing_prepared,housing_labels)housing_predictionslin_reg.predict(housing_prepared)lin_msemean_squared_error(housing_labels,housing_predictions)lin_rmsenp.sqrt(lin_mse)l
14、in_rmse结果使用决策树模型fromsklearn.treeimportDecisionTreeRegressortree_regDecisionTreeRegressor(random_state42)tree_reg.fit(housing_prepared,housing_labels)housing_predictionstree_reg.predict(housing_prepared)tree_msemean_squared_error(housing_labels,housing_predictions)tree_rmsenp.sqrt(tree_mse)tree_rmse结
15、果为使用随机森林模型fromsklearn.ensembleimportRandomForestRegressorforest_regRandomForestRegressor(n_estimators100,random_state42)forest_reg.fit(housing_prepared,housing_labels)housing_predictionsforest_reg.predict(housing_prepared)forest_msemean_squared_error(housing_labels,housing_predictions)forest_rmsenp.
16、sqrt(forest_mse)forest_rmse结果似乎是决策树最好为了更好的进展评估我们采用穿插验证对线性模型进展穿插验证fromsklearn.model_selectionimportcross_val_scorelin_scorescross_val_score(lin_reg,housing_prepared,housing_labels,scoringneg_mean_squared_error,cv10)lin_rmse_scoresnp.sqrt(-lin_scores)display_scores(lin_rmse_scores)评估结果对决策树进展穿插验证fromsk
17、learn.model_selectionimportcross_val_scorescorescross_val_score(tree_reg,housing_prepared,housing_labels,scoringneg_mean_squared_error,cv10)tree_rmse_scoresnp.sqrt(-scores)display_scores(tree_rmse_scores)结果对随机森林进展穿插验证fromsklearn.model_selectionimportcross_val_scoreforest_scorescross_val_score(forest
18、_reg,housing_prepared,housing_labels,scoringneg_mean_squared_error,cv10)forest_rmse_scoresnp.sqrt(-forest_scores)display_scores(forest_rmse_scores)结果微调模型从上面看来使用随机森林效果比拟好接下来我们对随机森林进展微调就是尝试大量的组合fromsklearn.model_selectionimportGridSearchCVparam_grid#try12(34)combinationsofhyperparametersn_estimators:3
19、,10,30,max_features:2,4,6,8,#thentry6(23)combinationswithbootstrapsetasFalsebootstrap:False,n_estimators:3,10,max_features:2,3,4,forest_regRandomForestRegressor(random_state42)#trainacross5folds,thatsatotalof(126)*590roundsoftraininggrid_searchGridSearchCV(forest_reg,param_grid,cv5,scoringneg_mean_s
20、quared_error,return_train_scoreTrue)grid_search.fit(housing_prepared,housing_labels)#找出最正确的组合grid_search.best_estimator_结果为参看分数cvresgrid_search.cv_results_formean_score,paramsinzip(cvresmean_test_score,cvresparams):print(np.sqrt(-mean_score),params)结果为最正确模型的评分为49682比默认参数的评分50182好至此找到最正确模型了分析最正确模型及其误
21、差#得出一组关于每个属性相对重要程度的数值feature_importancesgrid_search.best_estimator_.feature_importances_#之前额外加进去的属性extra_attribsrooms_per_hhold,pop_per_hhold,bedrooms_per_room#获取独热编码转换器cat_encoderfull_pipeline.named_transformers_cat#获取独热编码的属性注意独热编码里面的文本值如今作为了属性名cat_one_hot_attribslist(cat_encoder.categories_0)attri
22、butesnum_attribsextra_attribscat_one_hot_attribssorted(zip(feature_importances,attributes),reverseTrue)结果用测试集评估系统经过前面的训练你找到了最正确模型如今可以用测试集评估最终模型final_modelgrid_search.best_estimator_X_teststrat_test_set.drop(median_house_value,axis1)y_teststrat_test_setmedian_house_value.copy()X_test_preparedfull_pip
23、eline.transform(X_test)final_predictionsfinal_model.predict(X_test_prepared)final_msemean_squared_error(y_test,final_predictions)final_rmsenp.sqrt(final_mse)final_rmse最终结果为假如想知道计算泛化误差的95%置信区间fromscipyimportstatsconfidence0.95squared_errors(final_predictions-y_test)*2np.sqrt(stats.t.interval(confiden
24、ce,len(squared_errors)-1,locsquared_errors.mean(),scalestats.sem(squared_errors)其他尝试尝试支持向量机超参数fromsklearn.model_selectionimportGridSearchCVparam_gridkernel:linear,C:10.,30.,100.,300.,1000.,3000.,10000.,30000.0,kernel:rbf,C:1.0,3.0,10.,30.,100.,300.,1000.0,gamma:0.01,0.03,0.1,0.3,1.0,3.0,svm_regSVR()
25、grid_searchGridSearchCV(svm_reg,param_grid,cv5,scoringneg_mean_squared_error,verbose2)grid_search.fit(housing_prepared,housing_labels)grid_search.best_params_到这里我们已经确定了模型的参数然后还要将全部数据放入这个参数已经确定的模型中重新进展训练将网格搜索改为随机搜索fromsklearn.model_selectionimportRandomizedSearchCVfromscipy.statsimportrandintparam_di
26、stribsn_estimators:randint(low1,high200),max_features:randint(low1,high8),forest_regRandomForestRegressor(random_state42)rnd_searchRandomizedSearchCV(forest_reg,param_distributionsparam_distribs,n_iter10,cv5,scoringneg_mean_squared_error,random_state42)rnd_search.fit(housing_prepared,housing_labels)
27、查看结果cvresrnd_search.cv_results_formean_score,paramsinzip(cvresmean_test_score,cvresparams):print(np.sqrt(-mean_score),params)假如是对上个问题的支持向量机应用fromsklearn.model_selectionimportRandomizedSearchCVfromscipy.statsimportexpon,reciprocal#sees:/docs.scipy.org/doc/scipy/reference/stats.html#forexpon()andrecip
28、rocal()documentationandmoreprobabilitydistributionfunctions.#Note:gammaisignoredwhenkernelislinearparam_distribskernel:linear,rbf,C:reciprocal(20,200000),gamma:expon(scale1.0),svm_regSVR()rnd_searchRandomizedSearchCV(svm_reg,param_distributionsparam_distribs,n_iter50,cv5,scoringneg_mean_squared_erro
29、r,verbose2,random_state42)rnd_search.fit(housing_prepared,housing_labels)创立一个覆盖完好的数据准备以及最终预测的流水线full_pipeline_with_predictorPipeline(preparation,full_pipeline),(linear,LinearRegression()full_pipeline_with_predictor.fit(housing,housing_labels)full_pipeline_with_predictor.predict(some_data)添加一个转化器只选出最
30、重要的属性fromsklearn.baseimportBaseEstimator,TransformerMixin#选出最大的几个数返回列名defindices_of_top_k(arr,k):returnnp.sort(np.argpartition(np.array(arr),-k)-k:)classTopFeatureSelector(BaseEstimator,TransformerMixin):def_init_(self,feature_importances,k):self.feature_importancesfeature_importancesself.kkdeffit(s
31、elf,X,yNone):self.feature_indices_indices_of_top_k(self.feature_importances,self.k)returnselfdeftransform(self,X):returnX:,self.feature_indices_preparation_and_feature_selection_pipelinePipeline(preparation,full_pipeline),(feature_selection,TopFeatureSelector(feature_importances,k)housing_prepared_top_k_featurespreparation_and_feature_selection_pipeline.fit_transform(housing)