《Supplementary Dr. S. Mari完整原版文件.docx》由会员分享,可在线阅读,更多相关《Supplementary Dr. S. Mari完整原版文件.docx(27页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、Supplementary MaterialCurrent Metabolomics, 2013, Vol. 1, No. 2 viiSupplementary MaterialMetabolomic Univariate & Multivariate Analysis (muma)TUTORIALTABLE OF CONTENTSmuma overview3Functions list3Download and Installation4Dataset format5Analysis procedure51| Create the working directory62| Start the
2、 analysis63| Principal Component Analysis Score and Loading plots114| Univariate Analysis135| Merge univariate and multivariate information176| Partial Least Square Discriminant Analysis (PLS-DA)197| Orthogonal Projection to Latent Structures - Discriminant Analysis (OPLS-DA)208| Tools for NMR molec
3、ular assignment and data interpretation21A| Statistical TOtal Correlation SpectroscopY (STOCSY)21B| STOCSY 1D23C| Orthogonal Signal Correction (OSC) STOCSY24D| Ratio Analysis NMR SpectroscopY (RANSY)26References27muma overviewmuma is a tool for the multivariate and univariate statistical analysis of
4、 metabolomic data, written in the form of add-on package for the open source software R. By creating this statistical protocol we wanted to provide guidelines for the whole process of metabolomic data interpretation, from data pre- processing, to dataset exploration and visualization, to identificat
5、ion of potentially interesting variables (or metabolites). For doing so, we implemented the steps that are typically used in metabolomic analyses and created some new tips that can facilitate users work. In fact, muma is designed for those people who are not R experts, but want to perform statistica
6、l analysis in a very short time and with reliable results.Even though muma has been designed for the analysis of metabolomic data generated with different analytical platforms (NMR, MS, NIR.), it provides specific methods for helping the NMR-based metabolomics. In particular, muma is equipped with t
7、wo tools (STOCSY and RANSY) aiding the identification and assignment of molecules present in NMR spectra, or suggesting possible biochemical interaction between different molecules.In this tutorial we provide a workflow for metabolomics data interpretation using muma, describing from the installatio
8、n to the specific usage of all mumas functions, to the recovery of all results generated. Enjoy.Functions listwork.dir()Generate a working directory within which all the files generated are stored.explore.data() Perform data pre-processing (normalization, scaling) and data exploration, through PCA.P
9、lot.pca()Plot the PCA Score and Loading plots for specified principal components.plsda()Perform PLS-DA.univariate()Perform an array of univariate statistical techniques.Plot.plsda()PlotthePLS-DAScoreandw*cplots,forspecified components.oplsda()Perform OPLS-DA.stocsy()Perform STOCSY analysis.stocsy.1d
10、()Perform monodimensional STOCSY analysis. ostocsy()Perform STOCSY analysis on the OSC-filtered dataset. ransy()Perform RANSY analysis.Download and InstallationFirst of all download R (version 2.15 or higher) from the CRAN (www.r- project.org), according to your operating system (Unix, MacOS or Wind
11、ows). Install R as indicated in the R manual.You can open R with its graphic interface or from command line: shell (Unix), Terminal (MacOS) or DOS (Windows).After you have installed and launched R, you can install the package muma, as described in Figure 1. You can install muma by typing the command
12、 install.packages(muma)and by chosing your CRAN mirror from the browser (Figure 1).FIGURE 1It could happen that installation fails, due to diverse R software versions. In this case, it should be sufficient to install the following packages, prior the installation of muma:install.packages(mvtnorm) in
13、stall.packages(robustbase) install.packages(gtools) install.packages(bitops) install.packages(caTools)and then run the commandinstall.packages(muma)Once muma is installed you can load the package by typing library(muma).Dataset formatData table of interest has to be submitted in .csv format and with
14、 a specific form, as indicated in Figure 2.FIGURE 2In particular:- the first column indicates the names of every samples (NOTE: these must be different from each other, even if samples belong to the same class; moreover, for an optimal graphical visualization, short names (4-5 characters) are recomm
15、ended);- the second column indicates the “Class” of each sample, with an integer, positive value, starting from 1.- From the third column to the column N are reported data values of each sample, for each variable.- The first row will be considered as header; within this row variables names must be p
16、rovided, everyone different from each other.The dataset in Figure 2 is provided with this tutorial and derives from a metabolomics analysis of B cell cultures untreated or after one, two, three and four days of LPS treatment (Garcia-Manteiga et al, 2011). As it can be observed from Figure 2, the “Cl
17、ass” column is filled according to the day of treatment.Analysis procedureFor starting the analysis move to the directory in which you have stored your data table, by selecting the option Change Working Directory from the menuMisc of the R Console. If you are not using the R console, but you decided
18、 to launch R from command line, just navigate to the directory in which your data table is stored, then launch R with the command R.1| Create the working directoryBefore starting the analysis it is recommended to create a new directory that will become the working directory from now on. This is reco
19、mmended because muma generates diverse files and directories, that could be useful to store in a unique directory. All the results created from mumas analyses will be stored here. You can use the function work.dir(dir.name=WorkDir)to create a new working directory, as indicated in Figure 3.FIGURE 3A
20、s it can be observed a directory called “WorkDir” has been created. All the files present in the first directory are copied in the new generated one. Automatically, this drectory become the current working directory.2| Start the analysisThe first step in muma analysis can be performed with the funct
21、ion explore.data(), which provides data pre-processing and dataset exploration. Figure 4 indicates a usage of such function. In particular it can be used in the following way: explore.data(file=YourFile.csv, scaling=ScalingType, scal=TRUE, normalize=TRUE, imputation=FALSE, imput=ImputType)This funct
22、ion generates three new directories:- “Groups”, in which are stored the samples of each group as identified by the “Class” column in the data table;- “PCA_Data_scalingused”, in which are stored the principal component analysis files as the matrices of score and loading values, as well as all the plo
23、ts and graphics PCA-related. Note: this directory is given different names according to the scaling used.;- “Preprocessing_Data_scalingused”, in which are stored all the files used for preprocessing the dataset, as the normalized and scaled tables. Note: this directory is given different names accor
24、ding to the scaling used.FIGURE 4A| In particular, this function reads the data table and converts all the negative values to 0 values, because metabolomics measurements resulting with negative values are considered noise or errors, hence are brought to a null baseline. A table called “NegativeValue
25、s.out” and reporting the negative values found is written and saved in the directory “Preprocessing_Data_scalingused”.There is the posibility to imput a data table with missing values. The field “imputation” is FALSE by default, but turning it to TRUE allows the substitution of missing values with a
26、 specified option. There are three options for imputation and they can be specified in the field “imput”:- mean : missing values are imputed with the average value of the other obsevations;- minimum: missing values are imputed with the minimum value, among the other observations;- half.minimum: miss
27、ing values are imputed with the half of the minimum value, among the other observations;- zero: missing values are imputed with a zero value.Reports on which values are imputed are printed to screen and a file called “ImputedMatrix.csv” and reporting the matrix with imputed values is written and sav
28、ed in the directory “Preprocessing_Data_scalingused”.Moreover, a control on the proportion of missing values for each variable has been implemented; when a variable shows a proportion of missing values higherSupplementary MaterialCurrent Metabolomics, 2013, Vol. 1, No. 2 xithan 80%, that variable is
29、 eliminated, as considered not informative. Warnings about the eliminated variables are reported at the end of the function, indicating the variables that have been eliminated.B| Then the function performs normalization of each sample on total spectrum: this is achieved by calculating the sum of all
30、 variables within a spectrum and by normalizing each spectrum on such value; in this way every single variable is transformed as a fraction of the total spectral area or intensity. A table called ProcessedTable.csv reporting the normalized values is written and saved in the directory “Preprocessing_
31、Data_scalingused”.As this process can influence the outcome of following analyses, normalization can be avoided by turning the field “normalize” to FALSE. This has been implemented for those data tables that are already normalized or that do not require normalization.The function performs automatic
32、centering and scaling of each variable, according to the scaling type specified. There are five scaling options that can be chosen by the user:- pareto scaling- auto scaling- vast scaling- range scaling- median scalingAll these options are not case sensitive, therefore you can use, for example, eith
33、er “Pareto” or “pareto” as well as “P” or “p”: explore.data(file=MetaBc.csv, scaling=pareto)or explore.data(file=MetaBc.csv, scaling=p)A table called “ProcessedTable.csv” and reporting the scaled values is written and saved in the directory “Preprocessing_Data_scalingused”.As for normalization, the
34、scaling step can influence subsequent analyses, therefore it can be avoided by turning the field scal to FALSE. Of course in this case the field scaling can be skipped. THEORY Centering and scalingTheory sections provided within this tutorial are meant to introduce the user to the theory behind data
35、 treatment techniques that are proposed. Far from claiming to be a thorough description of the statistics applied here, these sections aim to provide information about the basic concepts mumas tools rely on and to help the user exploiting those tools in the best suiting way.CENTERING: converts the v
36、ariables from fluctuations around the mean into fluctuations around the zero. It flattens out the differences between high and low abundant metabolites.Disadvantages: may not be sufficient with heteroscedastic data.ScalingsAUTOSCALING: also called Unit Variance Scaling, it uses the standard deviatio
37、n as the scaling factor. After this procedure, each variable has standard deviation of one and becomes equally important.Disadvantages: inflation of the measurement errors; when applied before PCA, can make the interpretation of the loading plots difficult, as a large amount of metabolites will have
38、 high loading values.PARETO: similar to autoscaling, but the square root of the standard deviation is used as scaling factor. Highly variating metabolites are decreased more than low variating ones. The data stays closer to the original measurement, than with autoscaling.Disadvantages: sensitive to
39、large fold changes.VAST:an extension of autoscaling, it focuses on the stable variables i.e. those variables that change less. It uses the standard deviation and the coefficient of variation as scaling factor. Metabolites with a small relative standard deviation are more important.Disadvantages: not
40、 suited for large induced variation without group structure.RANGE: it uses the value range as the scaling factor. Metabolites are compared according to the induced biological response.Disadvantages: inflation of the measurement errors and sensitive to outliers.MEDIAN: also called central tendency sc
41、aling, this operation makes the median of each sample equivalent. This scaling is used when only few metabolites are expected to change, but there may be non- biological sample-dependent factors influencing data interpretation. Disadvantages: low reliability with dataset having a high proportion of
42、responding variables.For a more complete introduction to scalings and other data pretreatment techniques used in metabolomics please refer to (van den Berg et al, 2006).C| Principal Component Analysis (PCA) is performed on normalized/scaled table and it returns the score plots of each pairwise compa
43、rison of the first ten principal components (when the number of principal components = 10, otherwise all comparisons of components are plotted) (Figure 5, right panel). Together with this a screeplot (Figure 5, left panel) is created, in order to provide the user with information about the importanc
44、e of each principal component.These plots are visualized to screen and automatically saved in the directory “PCA_Data_scalingused”, with the name “First_10_Components_scalingused” and “Screeplot_scalingused”, respectively.Proportion of Variance explained020406080100ScreeplotPrincipal ComponentsFIGUR
45、E 5D| In order to help the user to chose the “best” pair of principal components to visualize, a specific tool has been implemented which calculates the statistical significance of cluster separation (Goodpaster et al, 2011) obtained with each pair of principal components. In other words, groups wil
46、l be more or less separated from each other according to different pair of principal components and this cluster separation is tested for its statistical significance. A rank of the first five best-separating principal components is printed to screen, reporting the number of the principal components
47、, the calculated p-value from the F statistics and the proportion of variance explained by each pairs of components (Figure 6). The p- value shown is the sum of all p-values fro each cluster separation statistics: as lower is the p-value, better is the separation ability. Thank to this ranking, one
48、can chose the best pair of PCs according to both their “separation capacity” and the proportion of variance explained.Two files deriving from this technique are saved in the directory “PCA_Data_scalingused”, one listing the F statistics values for each pair of components and named “PCs_Fstatistic.out”, and one ranking all the co