The class assignment prediction of the subjects in these test data sets was determined based on the model parameters of the corresponding training data set. Hence, the prediction was based on the same number of LVs as was used for the training set. The error rate of the test data set, being the percentage of misclassified subjects, was calculated and was used as measure for the generalizability of the model.
Ideally, the test error rates are comparable to the ones found by fold cross-validation. As this model is based on all lean and obese subjects, this model is considered to be the reference model. Per data set and per model, the error rate based on the fold cross-validation, the number of used LVs and the evaluation of the permutation test are given.
Also the mean and standard deviation of the error rate and the mean number of LVs per data set are presented. The mean cross-validation error rate and the variance of the error rate both increase if the number of subjects in the data set decreases. The jack-knife results confirm the above described discrepancy between the conclusions based on both sets of data The 10 largest regression coefficients found in the reference model of data were considered to be the most important variables for the discrimination between the two groups and therefore, only these 10 were used to evaluate the jack-knife results.
The coefficients of the 4th selection of data show a lot of variation and the coefficients of the 10th selection show only little variation but were almost all equal to zero. This finding confirms that it can be expected that both sets were not representative for the total set of 50 lean and 50 obese subjects. The test data sets were used to determine the generalizability of the models.
Permutation Methods: A Distance Function Approach (Springer Series in Statistics)
The number of LVs was based on the number of LVs used for modelling the training data set. Although the mean levels of the error rates are similar, the rates are more variable compared to the original test data sets, due to the smaller size of the extra test data sets.
The results are predominantly driven by the size of the training data set and the selection of the subjects in that data set, which is especially illustrated by the smaller training data sets. The mean cross-validation error rate increases as the number of subjects in the training data set decreases. In itself this is not a spectacular finding. A model based on a larger training data set can be determined more precisely than a model based on a smaller data set. On the other hand, the larger the test data set, the more precise the mean test error rate can be estimated.
Ideally, test error rates are of the same order as cross-validation error rates. The test set error rates and the cross-validation error rates were quite similar at a mean level, except for data However, at individual set level, the cross-validation error rate is in most cases not comparable to the test error rate.
This illustrates that the result crucially depends on the specific sample of subjects that was used for modelling. With only a small selection from a total population it is more likely that the selected subjects are not representative for the studied population, because it is possible that only subjects out of the extremes of the population distribution are selected.
This study shows that the selection of subjects is crucial for the conclusions that are drawn about the model. The effect is best seen in the results of data The 10th selection of data had a much better cross-validation error rate for the training set compared to the 4th selection. If the 5 lean and 5 obese subjects of the 10th selection were selected as the representatives of the population under study, the conclusion would be that the 2 groups can be separated based on their LC-MS lipidomic profiles, even based on cross-validation results.
If the 10 subjects out of set 4 were the subjects selected as the representatives of the population under study, the conclusion would be completely opposite. This means that the conclusions about the model completely depend on the selected 10 subjects. Nevertheless, the error rates based on the corresponding test sets were quite similar. As the predictability of both models was poor, it can be expected that both sets were not representative for the total set of 50 lean and 50 obese subjects.
This illustrates how it could go wrong using data sets having considerably less subjects compared to the number of variables and it also shows the risk of drawing too optimistic conclusions about the distinction between the two classes, even based on cross-validation results. The size of the test data set did not seem to be an issue, as the results of the extra test data sets of 10 obese and 10 lean subjects were similar to the results based on the original test data sets.
Because different purposes are served, the conclusions about model validity based on cross-validation are not always comparable to the conclusions drawn based on the permutation test. The variation in performance of the permutation test was lower compared to the variation in error rates. The test only assesses the significance of the classification and does not take the predictability into account, which can explain why a model having a high cross-validation error rate can perform well in the permutation test.
All results indicate that cross-validation, jack-knifing and permutation tests are insufficient validation tools for megavariate data sets with only a few samples. The lower the ratio between the number of subjects and the number of variables, the less the validation results can be trusted. Taking only the results of these validation tools into account can be very misleading and may lead to incorrect conclusions.
In order to avoid these problems, the number of samples per group should be large enough. In the present study, the turning point seemed to be between the sets having 10 and 20 subjects per group and based on about variables. Due to practical or budgetary limitations, it is often impossible to include the number of subjects that would be necessary to avoid the problems presented above.
The disadvantage of the third and fifth approach is that the variables are selected using MVA methods which use the full data and similar problems as mentioned above can affect the selection. Using this approach, the bias due to selection should be assessed and corrections should be made Ambroise and McLachlan, In case of a small number of subjects compared to the number of variables, contradictory results can be expected. Whether more simple statistical methods, e.
The performance is assessed using this specific megavariate metabolomics data, but it is expected that the conclusions will also carry over to many other research areas. It is possible that the findings would be less dramatic if data that represents larger differences between groups is used. The present study did not take the variable selection into account and only investigated the influence of the number of samples in the data sets. Future research may reveal the impact of the variable selection on the reliability of the standard statistical validation tools for megavariate data.
The lower the number of subjects compared to the number of variables, the less the outcome of validation tools such as cross-validation, jack-knifing and permutation tests can be trusted. The validation tools cannot be used as warning mechanism for problems due to sample size or representativity issues.
- Generalized Measure Theory.
- Account Options.
- Information Visualization: Perception for Design (Interactive Technologies) (3rd Edition)?
- Model K-C1. (B039, B040, B043) parts catalog.
- Seduction and Power: Antiquity in the Visual and Performing Arts.
National Center for Biotechnology Information , U. Published online Jul Carina M. Rubingh , Sabina Bijlsma , Eduard P. Derks , Ivana Bobeldijk , Elwin R.
- Permutation Methods?
- Groundwater Geochemistry: A Practical Guide to Modeling of Natural and Contaminated Aquatic Systems;
Verheij , Sunil Kochhar , and Age K. Eduard P. Elwin R. Age K. Author information Article notes Copyright and License information Disclaimer. Rubingh, Email: ln. Corresponding author. Received Jan 13; Accepted Mar This article has been cited by other articles in PMC. Abstract Statistical model validation tools such as cross-validation, jack-knifing model parameters and permutation tests are meant to obtain an objective assessment of the performance and stability of a statistical model.
Keywords: metabolomics, megavariate data, PLS-DA, cross-validation, permutation test, predictability, jack-knife. Introduction Metabolomics studies are performed to investigate responses of biologic systems on environmental influences due to, for instance, toxicological exposure, nutrition or medical treatment.
Materials and methods Data General Although real-life data may lead to less distinguishing differences between sets, it was preferred above simulated data because it illustrates the problems researchers have to deal with best. Data subsets A data set was generated containing the data of 40 lean and 40 obese subjects data Open in a separate window. Illustration of the procedure that was followed to obtain the data sets.
Permutation test Cross-validation can be used to assess the class-predictability of a model. Predictability Cross-validation, jack-knifing and the permutation test provide information about the validity of the model based on the information in the training data set. Test sets The test data sets were used to determine the generalizability of the models. Model Training data set Size testset Size testset Size testset Size testset Size testset 1 Discussion The results are predominantly driven by the size of the training data set and the selection of the subjects in that data set, which is especially illustrated by the smaller training data sets.
Concluding remarks The lower the number of subjects compared to the number of variables, the less the outcome of validation tools such as cross-validation, jack-knifing and permutation tests can be trusted. References Ambroise C. Selection bias in gene extraction on the basis of microarray gene-expression data. Is cross-validation valid for small-sample microarray classification?
Partial least squares for discrimination. Large scale human metabolomics studies. A strategy for data pre- processing and validation. Estimating reaction time constants form a two-step reaction: comparison between two-way and three-way methods. Fat oxidation before and after a high fat load in the obese insulin-resistant state. Oxford: Pergamon Press; An Introduction to the Bootstrap. Chapter 15 Fiehn O. Metabolomics — the link between genotypes and phenotypes. Plant Mol. Partial least-squares regression: a tutorial.
New York: Springer-Verlag; Modelling of spectroscopic batch process data using grey models to incorporate external information. The Elements of statistical learning: Data mining, inference and prediction. Springer Series in Statistics. Application to the study of heavy metal toxcity.
The dendrogram is found with the method given in the cluster argument using function hclust. The terminal segments hang to within-cluster dissimilarity. If some of the clusters are more heterogeneous than the combined class, the leaf segment are reversed. The histograms are based on dissimilarities, but ore otherwise similar to those of Van Sickle and Hughes : horizontal line is drawn at the level of mean between-cluster dissimilarity and vertical lines connect within-cluster dissimilarities to this line.
This difference may be one of location differences in mean or one of spread differences in within-group distance. That is, it may find a significant difference between two groups simply because one of those groups has a greater dissimilarities among its sampling units. Most mrpp models can be analysed with adonis2 which seems not suffer from the same problems as mrpp and is a more robust alternative.
McCune and J. Analysis of Ecological Communities. Mielke and K. Springer Series in Statistics. Van Sickle and R. Hughes Classification strengths of ecoregions, catchments, and geographic clusters of aquatic vertebrates in Oregon. Warton, D.senjouin-kikishiro.com/images/salogusic/2950.php
Permutation Methods - A Distance Function Approach | Paul W. Jr. Mielke | Springer
Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution , 3, Created by DataCamp. Community examples Looks like there are no examples yet. Post a new example: Submit your example.