## Usage

```
clustering(
.data,
...,
by = NULL,
scale = FALSE,
selvar = FALSE,
verbose = TRUE,
distmethod = "euclidean",
clustmethod = "average",
nclust = NA
)
```

## Arguments

- .data
The data to be analyzed. It can be a data frame, possible with grouped data passed from

`dplyr::group_by()`

.- ...
The variables in

`.data`

to compute the distances. Set to`NULL`

, i.e., all the numeric variables in`.data`

are used.- by
One variable (factor) to compute the function by. It is a shortcut to

`dplyr::group_by()`

. To compute the statistics by more than one grouping variable use that function.- scale
Should the data be scaled before computing the distances? Set to FALSE. If TRUE, then, each observation will be divided by the standard deviation of the variable \(Z_{ij} = X_{ij} / sd_j\)

- selvar
Logical argument, set to

`FALSE`

. If`TRUE`

, then an algorithm for selecting variables is implemented. See the section**Details**for additional information.- verbose
Logical argument. If

`TRUE`

(default) then the results for variable selection are shown in the console.- distmethod
The distance measure to be used. This must be one of

`'euclidean'`

,`'maximum'`

,`'manhattan'`

,`'canberra'`

,`'binary'`

,`'minkowski'`

,`'pearson'`

,`'spearman'`

, or`'kendall'`

. The last three are correlation-based distance.- clustmethod
The agglomeration method to be used. This should be one of

`'ward.D'`

,`'ward.D2'`

,`'single'`

,`'complete'`

,`'average'`

(= UPGMA),`'mcquitty'`

(= WPGMA),`'median'`

(= WPGMC) or`'centroid'`

(= UPGMC).- nclust
The number of clusters to be formed. Set to

`NA`

## Value

**data**The data that was used to compute the distances.**cutpoint**The cutpoint of the dendrogram according to Mojena (1977).**distance**The matrix with the distances.**de**The distances in an object of class`dist`

.**hc**The hierarchical clustering.**Sqt**The total sum of squares.**tab**A table with the clusters and similarity.**clusters**The sum of square and the mean of the clusters for each variable.**cofgrap**If`selectvar = TRUE`

, then,`cofpgrap`

is a ggplot2-based graphic showing the cophenetic correlation for each model (with different number of variables). Else, will be a`NULL`

object.**statistics**If`selectvar = TRUE`

, then,`statistics`

shows the summary of the models fitted with different number of variables, including cophenetic correlation, Mantel's correlation with the original distances (all variables) and the p-value associated with the Mantel's test. Else, will be a`NULL`

object.

## Details

When `selvar = TRUE`

a variable selection algorithm is executed. The
objective is to select a group of variables that most contribute to explain
the variability of the original data. The selection of the variables is based
on eigenvalue/eigenvectors solution based on the following steps.

compute the distance matrix and the cophenetic correlation with the original variables (all numeric variables in dataset);

compute the eigenvalues and eigenvectors of the correlation matrix between the variables;

Delete the variable with the largest weight (highest eigenvector in the lowest eigenvalue);

Compute the distance matrix and cophenetic correlation with the remaining variables;

Compute the Mantel's correlation between the obtained distances matrix and the original distance matrix;

Iterate steps 2 to 5

*p*- 2 times, where*p*is the number of original variables.

At the end of the *p* - 2 iterations, a summary of the models is returned.
The distance is calculated with the variables that generated the model with
the largest cophenetic correlation. I suggest a careful evaluation aiming at
choosing a parsimonious model, i.e., the one with the fewer number of
variables, that presents acceptable cophenetic correlation and high
similarity with the original distances.

## References

Mojena, R. 2015. Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20:359-363. doi:10.1093/comjnl/20.4.359

## Author

Tiago Olivoto tiagoolivoto@gmail.com

## Examples

```
# \donttest{
library(metan)
# All rows and all numeric variables from data
d1 <- clustering(data_ge2)
# Based on the mean for each genotype
mean_gen <-
data_ge2 %>%
means_by(GEN) %>%
column_to_rownames("GEN")
d2 <- clustering(mean_gen)
# Select variables for compute the distances
d3 <- clustering(mean_gen, selvar = TRUE)
#> EH excluded in this step |=== | 7%
EP excluded in this step |======= | 14%
CDED excluded in this step |========== | 21%
PH excluded in this step |============== | 29%
CL excluded in this step |================= | 36%
NR excluded in this step |===================== | 43%
PERK excluded in this step |======================= | 50%
EL excluded in this step |=========================== | 57%
CD excluded in this step |=============================== | 64%
ED excluded in this step |================================== | 71%
KW excluded in this step |====================================== | 79%
CW excluded in this step |========================================= | 86%
NKR excluded in this step |============================================ | 93%
TKW excluded in this step |===============================================| 100%
#> --------------------------------------------------------------------------
#>
#> Summary of the adjusted models
#> --------------------------------------------------------------------------
#> Model excluded cophenetic remaining cormantel pvmantel
#> Model 1 - 0.8656190 15 1.0000000 0.000999001
#> Model 2 EH 0.8656191 14 1.0000000 0.000999001
#> Model 3 EP 0.8656191 13 1.0000000 0.000999001
#> Model 4 CDED 0.8656191 12 1.0000000 0.000999001
#> Model 5 PH 0.8656189 11 1.0000000 0.000999001
#> Model 6 CL 0.8655939 10 0.9999996 0.000999001
#> Model 7 NR 0.8656719 9 0.9999982 0.000999001
#> Model 8 PERK 0.8657259 8 0.9999977 0.000999001
#> Model 9 EL 0.8657904 7 0.9999972 0.000999001
#> Model 10 CD 0.8658997 6 0.9999964 0.000999001
#> Model 11 ED 0.8658274 5 0.9999931 0.000999001
#> Model 12 KW 0.8643556 4 0.9929266 0.000999001
#> Model 13 CW 0.8640355 3 0.9927593 0.000999001
#> Model 14 NKR 0.8648384 2 0.9925396 0.000999001
#> --------------------------------------------------------------------------
#> Suggested variables to be used in the analysis
#> --------------------------------------------------------------------------
#> The clustering was calculated with the Model 10
#> The variables included in this model were...
#> ED CW KW NKR TKW NKE
#> --------------------------------------------------------------------------
#>
# Compute the distances with standardized data
# Define 4 clusters
d4 <- clustering(data_ge,
by = ENV,
scale = TRUE,
nclust = 4)
# }
```