《回归算法实践.docx》由会员分享,可在线阅读,更多相关《回归算法实践.docx(13页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、实验报告课程名称: 数据挖掘理论与实践指导教师:向前学号20191106078姓名专业电子信息工程班级龚永好上机地点信-506电信1902班 时 间 2022年5月19日上机内容实验三:回归算法实践一、实验目的及要求目的:进一步掌握数据探索、数据预处理的过程;熟悉回归算法原理;学会使用回归算法对数据进行处理,掌握一种回归模型的应用。要求:1 .进一步熟悉Python环境下数据挖掘的步骤。2 .完成对数据的探索分析,包括描述性统计分析、相关性分析。3 .完成数据预处理,包括关键属性分析。4 .完成模型构建,包括灰色预测模型。二、实验设备(环境)及要求1 .硬件要求:CPU在2.0 GHz以上,内
2、存在4G以上,建议8G。2 .软件要求:Widows7系统及以上系统,Anaconda编译环境。三、实验内容(一)数据挖掘步骤1、定义问题在开始知识发现之前最先的也是最重要的要求就是了解数据和业务问题。必须要对目标有一个清晰 明确的定义,即决定到底想干什么。比方,想提高电子信箱的利用率时,想做的可能是“提高用户 使用率”,也可能是“提高一次用户使用的价值”,要解决这两个问题而建立的模型几乎是完全不同 的,必须做出决定。2、建立数据挖掘库建立数据挖掘库包括以下几个步骤:数据收集,数据描述,选择,数据质量评估和数据清理,合并 与整合,构建元数据,加载数据挖掘库,维护数据挖掘库。3、分析数据分析的目
3、的是找到对预测输出影响最大的数据字段,和决定是否需要定义导出字段。如果数据集包 含成百上千的字段,那么浏览分析这些数据将是一件非常耗时和累人的事情,这时需要选择一个具3-6运行结果:真实值与预测值分别为:y199419951996199719981999200020012002200320042005200620072008200920102011201220132014201564.8799.7588.11106.07137.32 188.14 219.91271.91269.10300.55338.45408.86476.72 838.99 843.141107.67 1399.16 15
4、35.141579.682088.14 NaN NaNy_pred 38.073131 84.574514 95.333014 107.001982 151.508388 188.497021 219.800524 230.491832 219.797214 300.550000 383.344482 462.958862 554.493152 690.770222 842.238801 1086.833572 1377.965940 1535.224116 1737.430078 2083.657220 2185.493083 2536.162949四、实验结果分析以及出现问题(-)遇到的问
5、题1、问题一I rue JC:Pypython-3.6.6libsite-packagessklearnlinear_modelcoordinate_descent.py:491: Convergencewarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.Convergencewarning)Traceback (most recent ca
6、ll last):File l, line 1, in runfiletD:/shuj uwaj ue-shiyan3/3-4.py1)File C:Pypython-3.6.6libsite-packagesspyder_kernelscustomizespydercustomize.py, line 668, in runfileexecflie(filename, namespace)File C:Pypython-3.6.6libsite-packagesspyder_kernelscustomizespydereustomize.py, line 108, in execfileex
7、ec(compile(f.read(), filename, *exec1), namespace)File D:/shujuwajueshiyan3/34py”, line 23, in new_reg_data. to_csv( out put file) # 存储数据File C:Pypython-3.6.6libsite-packagespandascoreframe.py, line 1745, in to_csv formatter.save()File C:Pypython-3.6.6libsite-packagespandasioformatscsvs.py, line 156
8、, in save compression=self pression)File C:Pypython-3.6.6libsite-packagespandasiocommon.py, line 400, in _get_handle f = open(path_or_buf, mode, encoding=encodmg)FileNotFoundError: Errno 2 No such file or directory: 1./tmp/new_reg_data.csv,原因:代码中的路径没有更改解决:更改路径,代码可以正常运行1 outputfile =D:/shuj uwaj ue-s
9、hiyan3/new_reg_data.csv1 #修改后的运行结果:In 8: runfile(1D:/shujuwajue-shiyan3/3-4.py)相关系数为:1.8000e04 -0.0000e+00 1.2414e-01 -l.O31Oe-02 6.5400e-02 1.2O0Oe-043.1741e-01 3.4900e-02 -0.000Oe+00 0.0000e+0O O.0000e+00 0.0000e+00-4.030Oe-02相关系数非零个数为:8相关系数是否为零:True False True True True True True True False False
10、 False FalseTrue输出数据的维度为:(20, 8)C:Pypython-3.6.6libsite-packagessklearnlinear_modelcoordmate_descent.py:491: Convergencewarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.Convergencewarning)2、问题二Fi
11、le Ur line 1, in runfile(1D:/shuj uwaj ue-shiyan3/3-5.py)File C:Pypython-3.6.6libsite-packagesspyder_kemelscustomizespydercustomize.pyf line 668, in runfileexec file(filename r namespace)File C:Pypython-3.6.6libsite-packagesspyder_kernelscustomizespydereustomize.py9 line 108r in execfileexec(compile
12、(f.read(), filename, * exec), namespace)File D:/shuuwajue-shiyan3/3-5.py, line 33, in new_reg_data. to_excel (outputfile) # 结果输出File C:Pypython-3.6.6libsite-packagespandascoreframe.py, line 1766, in to_excel engine=engine)File C:Pypython-3.6.6libsite-packagespandasioformatsexcel.py, line 646v in wri
13、te writer = ExcelWriter(_stringify_path(writer), engine=engine)File C:Pypython-3.6.6libsite-packagespandasioexcel.py, line 1448, in _init_ import xlwtModuleNotFoundError: No module named 1xlwt1原因:电脑里是已安装xlwt,但桌面的Spyder.exe未安装xlwt图 告C:Windowssystem32cmd.exeMicrosoft Windows 版本 10 0 19042. 508(c) 2020
14、 Microsoft Corporation.保存所有权利。C:UsersAdministratorpip3 install -i s:/pypi. doubanio. com/siirple/ xlwtLooking in indexes: s:/pypi. doubanio. com/siir5?le/Requirement already satisfied: xlwt in c:programdataanaconda3libsite-packages (1. 3. 0)C:UsersAdministratorpip3 install -i s:/pypi. doubanio. com/
15、siirple/ xlwtLooking in indexes: s:/pypi. doubanio. com/siir?le/Requirement already satisfied: xlwt in c:programdataanaconda3libsite-packages (1.3.0)C: User sAdinini str a tor解决:告wR: Anaconda Prompt (python36)(python36) C:UsersAdministratorpip3 install -i . doubanio. com/sinple/ xlwtLooking in index
16、es: Collecting xlwtDownloading s:pypi. doubanio. com/Dackages/44/48/def306413b25c3doi753603bla222aoi.lb8621aed27cd7f89cbe27e6bOf4/xlwt-L 3. 0-py2. py3-none-any. whl (99 kB) 99 kB IB/sInstalling collected packages: xlwtSuccessfully installed xlwt-1. 3. 0再用jupyter notebook运行这个程序(base) PS C:UsersAdmini
17、 strator jupyter notebookLI I I El I El I I c15:22:09. 35715:22:09. 73915:22:09. 74015:22:09. 74115:22:09. 74215:22:09. 74215:22:09. 74215:22:09. 74215:22:09. 840NotebookApp.NotebookApp.NotebookApp.NotebookApp.NotebookApp.NotebookApp.NotebookApp.NotebookApp.NotebookApp.The port 8888 is already in us
18、e, trying another port.JupyterLab extension loaded from C:ProgramDataAnaconda3libsite-packagesjupyterlabJupyterLab application directory is C:PrograinDataAnaconda3sharejupyterlabServing notebooks from local directory: C:UsersAdndnistratorThe Jupyter Notebook is running at: :/localhost:8889/?token=72
19、a5d0ee3d096175d6b5al9c85dba29912f2e69ea8bl2965or . 0. 0. l:8889/?token=72a5d0ee3d096175d6b5al9c85dba29912f2e69ea8bl2965Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).To access the notebook, open this file in a browser:file:/C:/Users/Adininistrator/AppData/Ro
20、aming/jupyter/runtime/nbserver_8680-open. htmlOr copy and paste one of these URLs: :/localhost:8889/?token=72a5d0ee3d096175d6b5al9c85dba29912f2e69ea8bl2965or :/127. 0. 0. 1:8889/?token=72a5d0ee3d096175d6b5al9c85dba29912f2e69ea8bl2965,i j j i,i j j i15:22:13. 977 NotebookApp Creating new notebook in1
21、5:22:14. 608 NotebookApp Kernel started: af9bc877-c5e8-4bld-ab24-f58e2f417f6315:24:00. 603 NotebookApp Saving file at /Untitled5. ipynb15:24:14. 579 NotebookApp Saving file at /Untitled5. ipynb运行结果:1:IMpOXt xlwt2:inpoxt seaborn:import nujnpy as npiapoxt pandas as pdiaport matplotlib. pyplot as pitis
22、port seaborn as snsi*port syssys. path, append( D :/shujuwajue-shiyan3/code,) ff 设置路任inpoxt nunpy as np iaport pandas as pdfro* GM11 import GM11 ff引入自编的灰色预测诵数inputfilel = J D :/shujuwajue-shiyan3/new_reg_data. csvJ ff 输入的数冕文件inputfile2 = 5 D :/shujuwajue-shiyan3/data. csv ff 谕入的数施文件new_reg_data = pd
23、. read_csv(inputfilel) 才 谈取经过特征选择后的数据data = pd. read_csv(inputf ile2) ff 读取总的数据new_reg_data. index = range(1994, 2014)new_reg_data. loc2O14 = Honenew_reg data. loc2015 = Hone1 = f x? , , x3, J x4x5x6, J x7-x8xl3for i in 1 :f = GM11 (new_reg_data. loc range (1994, 2014), i. values) 0new_reg_data. loc
24、 2014, i = f (len(new_reg_data)-l) # 2014年预测结果new_reg_data. loc 2015, i = f (len(new_reg_data) # 加15年演淑结果new_reg_datai = new_reg_datai. round(2)右保豌两位小数outputfile = ; D :/shujuwajue-shiyan3/new_reg_data_GMll. xlsJ ff 灰色预澜后保存的貂径 y = list (datafyM. values) #健取财政取入血,杳并至新数据赛中y. extend(np. nan, np. nan) n
25、ew_reg_data y = y new_reg_data. to_excel (outputfile) ff 结果谕出print (预测结果为:n, new_reg_data. loc 2014:2015, :) ff 预澜结果屐示预测结果为:Unnamed: 020142015NaNNaN8142148.248460489.28x3 x4 x5x67042.31 43611.84 35046.63 8505522.588166.92 47792.22 38384.22 8627139.31xl3 y20142015201420154600. 405214.7818686.28 44506
26、.47 NaN21474.47 49945.88 NaN(二)实验总结这是数据挖掘第三次上机实验,做的内容是教材第六章“财政收入影响因素分析及其预测”, 学习目的是进一步掌握数据探索、数据预处理的过程;熟悉回归算法原理;学会使用回归算法对 数据进行处理,掌握一种回归模型的应用。通过本次实验,我进一步了熟悉Python环境下数据挖 掘的步骤,完成了对数据的探索分析,包括描述性统计分析、相关性分析,数据预处理,包括关 键属性分析,并且完成了模型构建,包括灰色预测模型,收获非常多。有好的界面和功能强大的工具软件来协助你完成这些事情。4、准备数据这是建立模型之前的最后一步数据准备工作。可以把此步骤分为
27、四个局部:选择变量,选择记录, 创立新变量,转换变量。5、建立模型建立模型是一个反复的过程。需要仔细考察不同的模型以判断哪个模型对面对的商业问题最有用。 先用一局部数据建立模型,然后再用剩下的数据来测试和验证这个得到的模型。有时还有第三个数 据集,称为验证集,因为测试集可能受模型的特性的影响,这时需要一个独立的数据集来验证模型 的准确性。训练和测试数据挖掘模型需要把数据至少分成两个局部,一个用于模型训练,另一个用 于模型测试。6、评价模型模型建立好之后,必须评价得到的结果、解释模型的价值。从测试集中得到的准确率只对用于建立 模型的数据有意义。在实际应用中,需要进一步了解错误的类型和由此带来的相
28、关费用的多少。经 验证明,有效的模型并不一定是正确的模型。造成这一点的直接原因就是模型建立中隐含的各种假 定,因此,直接在现实世界中测试模型很重要。先在小范围内应用,取得测试数据,觉得满意之后 再向大范围推广。7、实施模型建立并经验证之后,可以有两种主要的使用方法。第一种是提供给分析人员做参考;另一种是 把此模型应用到不同的数据集上。(二)数据探索3-1代码:描述性统计分析,# -*- coding: utf-8 -*-,II II II3-l|II II II)import numpy as npimport pandas as pd)inputfile = D:/shujuwajue-sh
29、iyan3/data.csv # data = pd.read_csv(inputfile) #:#描述性统计分析I:description = data.min(), data.max(), data.mean(), data.std()description = pd.DataFrame(description, index = Min, Maxr Mean, STD1).Tprint (描述性统计结果:n ,np. round (description, 2)3-1运行结果名称大小dataDataFrame(20, 14)Column names: xl, x2, x3, x4, x5,
30、 x6, x7, x8, x9, xl0, xll, xl2, xl3,.descriptionDataFname(14, 4)Column names: Min, Max, Mean, STDinputfilestr1D:/shujuwajue-shiyan3/data.csv变里管理器File explorer 帮助IPython控制台G口控制台A X.40Python 3.6.6 (v3.6.6:4cflf54eb7, Jun 27 2018, 02:47:15) MSC v.1900 32 bit (Intel)Type copyright r credits* or license
31、for more information.IPython 6.5.0 - An enhanced Interactive Python.In 1: runfile(D:/shuj uwaj ue-shiyan3/3-1.py1)描述性统计结果:MinMaxMeanSTDxl3831732.007599295.005579519.951262194.72x2181.542110.78765.04595.70x3448.196882.852370.831919.17x47571.0042049.1419644.6910203.02x56212.7033156.8315870.958199.77x6
32、6370241.008323096.007350513.60621341.85x7525,714454.551712.241184.71x8985.3115420.145705.804478.40x960.62228.46129.4950.51xlO65.66852.56340.22251.58xll97.50120.00103.315.51X121.031.911.420.25X135321.0041972.0017273.8011109.19y64.872088.14618.08609.25In 2:1# - coding: utf-8 - * -2 “ ”33-14 -56 import
33、 numpy as np7 import pandas as pd8 MB x-hdJtAat*-(20.14)Colum ruses: xl. i2, k3, U. x5.x6. x7. xS. x9. xlO. xll. xl2. xl3. .descriptionDitaFfMe(14.4)Colum Kin. FUx. Nma. STDmpuxfilestr10:/sMijuMajue-shiyan3/ta.csvinputfile = * D:/shujuwaj ue-shiyan3/data.csv1 data pd.readmesv(inputfile)1112#描达fl:弋计1
34、3description data.min(), data.max(), data.mean()( data.std()14 description = pd.DataFrame(description, index = ( Min, Max. Mean, STD ).T15 print(: n ,np.round(description, 2)python 3.6.6 (vi.6:4cfJf547, Jun 27 2018. 02:47:15) (MSC V.1M0 33 bit (IntaUI -*-* copyright*, credits, or ,license* tar aore
35、infornation.|hon 6.5.0 AnInteractive Python.2.Mfr)MeanSTDxlU13x5*xl3.831736+067.59938065.57952e)61.26219e*06e3831732181.54448.1975716212.7x2181. M2110.787W.035595.i3913824214.W549.979038.167coi.73x344a.i9W2.W2370. M1919.1723928907239.56686.4499C5.3180W.8214W142049.119M4.71020334282130802.81U4.OW.MS3
36、12JniM.el70.918444Mlm.14WM.S7M22.33166. W06t.3231e06giw5454MS2106&1000.6Q1201S.5W1.4417S25.714454.551712.241184.71642579348.8nn.ia13966.5】125靖供5.3115420.1575.84476.475029 接367.811248.291469411467.4x966.62228.4650.50G685070216453.491370.6813390.510671.8iW65.66852.56皿216251.57895210706533.5514a.271胸.6
37、11570.6Mil97.5UO103.306S.51J2810M070B7596.131677.7716M4.213120.81U1.0291.9061.4222O.2M23SnS744SS0M5.32190s.8418M7.2144M. 2IX)41972mn.e11109.2125WW3739.9721W.1419WO.715444.9y64.572C6814616.084609.254136236312877.97双 4.2422469.210951.3*大小JW h/H火化一大小/ 口!色小/化ft,科及dT三至 date -1MinMaxMtanSTD3831732.007s829
38、S.ee“7】9.9S12621%. 72181.542110.78765.04$95.70448.196882.852370.831919.177571.0042049.1419644.69ieze3.w6212.70331S6.831SS70.9SIM. 776370241.008323696.007350513.60621X1.85525.714454.SS1712.241184.71985.311M20.145705.804478.4060.62228.46129.4950.5165.66852.56340.222S1.S897.50120.00103.315.511.031.911.
39、420.2S5321.0041972.0017273.8011109.1964.t72088.14618.0811: runfil( D:/shuivwA)un3/3 I.py 也统济无发:Kitry )13-2代码:求解原始数据的Pearson相关系数矩阵12 3 4 5 6# coding: utf-8 II II II 3-2Hl A/- /i / 7/V/r ZTP7A 7 corr = data.corr(method = pearson1) #A8print(相大系数矩阵为:na ,np.round(corrf 2) #Vf两位小数代码分析undefined name 1np3-2
40、运行结果Name SizeTypeDate ModifiedD -WRLOOOl.tmp46 KB tmp File2022/5/19 星期四 14:12囹 3-1.py543 bytes py File2022/5/19 星期四 14:240 3-2.py202 bytes py File2022/5/19 毁四 14:30值20191106078龚永好谯3次实验报告doc46 KB doc File2022/5/19 星期四 14:12U data.csv2 KB csv File2022/5/19 星期四 14:14团 GM11.py763 bytes py File2022/5/19
41、MH 14:14变里管理器File explorer 帮助工Python控制台口控制台1A X14 rows x 14 columnsrxi3831732.007599295.005579519.951262194.72x2181.542110.78765.04595.70x3448.196882.852370.831919.17x47571.0042049.1419644.6910203.02x56212.7033156.8315870.958199.77x66370241.008323096.007350513.60621341.85x7525.714454.551712.241184.7
42、1x8985.3115420.145705.804478.40x960.62228.46129.4950.51X1065.66852.56340.22251.58xll97.50120.00103.315.51X121.031.911.420.25X135321.0041972.0017273.8011109.19y64.872088.14618.08609.25In2: runfile(D:/shujuwajue-shiyan3/3-2.py1)相关系数矩阱为:xlx2x3x4x5 X10 xll X12X13yxl1.00 0.950.95 0.970.970.98 -0.29 0.94
43、0.96 0.94x20.95 1.001.00 0.990.99 0.98 -0.13 0.89 1.00 0.98x30.95 1.001.00 0.990.99 0.99 -0.15 0.89 1.00 0.99x40.97 0.990.99 1.001.00 ,1.00 -0.19 0.91 1.00 0.99x50.97 0.990.99 1.001.00 ,1.00 -0.18 0.90 0.99 0.99x60.99 0.920.92 0.950.950.96 -0.34 0.95 0.94 0.91x70.95 0.991.00 0.990.990.99 -0.15 0.89 1.00 0.99x8