--- title: "An Introduction to the PhylogeneticEM Package" author: "Paul Bastide" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{TreeSim} --- ## Introduction The `PhylogeneticEM` package is designed to automatically detect shifts in quantitative traits. Simply load the package with: ```{r, label="load", message=FALSE} library(PhylogeneticEM) ``` ## Simulation of a dataset To show how the functions work, let's first simulate some traits with known parameters. We use function `sim.bd.taxa.age` from package `TreeSim` to simulate a phylogenetic tree. ```{r, label="Simus_tree", eval = FALSE} set.seed(17920902) ntaxa = 80 tree <- TreeSim::sim.bd.taxa.age(n = ntaxa, numbsim = 1, lambda = 0.1, mu = 0, age = 1, mrca = TRUE)[[1]] ``` ```{r label="Simus_tree_int", echo=FALSE} set.seed(17920902) ntaxa = 80 if (!requireNamespace("TreeSim", quietly = TRUE)) { tree <- ape::rphylo(n = ntaxa, birth = 0.1, death = 0, T0 = 1) tree$edge.length <- tree$edge.length / max(ape::vcv(tree)) } else { tree <- TreeSim::sim.bd.taxa.age(n = ntaxa, numbsim = 1, lambda = 0.1, mu = 0, age = 1, mrca = TRUE)[[1]] } ``` We then choose a set of parameters, using function `params_process`: ```{r, label="Parameters"} params <- params_process("OU", ## Process p = 2, ## Dimension variance = diag(0.5, 2, 2) + 0.5, ## Rate matrix selection.strength = 3, ## Selection Strength random = TRUE, ## Root is random stationary.root = TRUE, ## Root is stationary edges = c(16, 81, 124), ## Positions of the shifts values = cbind(c(5, 4), ## Values of the shifts c(-4, -5), c(5, -3))) ``` Note that here, we took the selection strength to a scalar matrix, but we could specify a full matrix. Also, we did not specified the ancestral values of the optimum states: these are took to be $0$ by default. The position of the shifts is specified through to the number of the edge where it occurs. These can be saw using a standard `plot` from package `ape`. Once the parameters are constructed, we can check the position of the shifts by plotting them: ```{r, fig.show='hold', fig.height=3, fig.width=3.4} plot(params, phylo = tree, traits = 1, value_in_box = TRUE, shifts_bg = "white") plot(params, phylo = tree, traits = 2, value_in_box = TRUE, shifts_bg = "white") ``` We can then simulate a process along the tree with these parameters: ```{r, label="Simu_process", message=FALSE} sim <- simul_process(params, tree) ``` We can then extract a matrix of simulated data from this object: ```{r, label="extraction"} data <- extract(sim, ## The simul_process object what = "states", ## We want the actual values where = "tips") ## Only at the tips of the tree ``` The column names of the matrix are set to the tip labels of the tree. By default, the row names are left blank, but, for the sake of the demonstration, let's assume that the two traits are named "A" and "B": ```{r, label="rownames"} rownames(data) <- c("A", "B") ``` Finally, we can plot the simulated process on the tree to see what it looks like: ```{r, fig.show='hold', fig.height=4, fig.width=7} plot(params, phylo = tree, data = data) ``` ## Inference In this section, we take the traits as simulated by function `simul_process`, and use function `PhyloEM` for the inference of the parameters. To make things more interesting, let's assume that we have some missing data: ```{r} nMiss <- floor(ntaxa * 2 * 0.1) ## 10% of missing data miss <- sample(1:(2 * ntaxa), nMiss, replace = FALSE) ## sample missing randomly chars <- (miss - 1) %% 2 + 1 ## Trace back rows and columns tips <- (miss - 1) %/% 2 + 1 for (i in 1:nMiss){ data[chars[i], tips[i]] <- NA ## Forget some values } ``` By default, the inference is done on an automatically computed grid of alpha values. Here, to keep the computations to a minimum, we will only use $2$ values for alpha, including the true value. In a true application however, a grid with $10$ values of $\alpha$ is automatically chosen. ```{r, label="Fit_EM", warning=FALSE, eval = FALSE} ## Grid on alpha alpha_grid <- c(1, 3) ## Run algorithm res <- PhyloEM(phylo = tree, Y_data = data, process = "scOU", ## scalar OU model random.root = TRUE, ## Root is stationary (true model) stationary.root = TRUE, alpha = alpha_grid, ## On a grid of alpha K_max = 10, ## Maximal number of shifts parallel_alpha = TRUE, ## This can be set to TRUE for Ncores = 2) ## parallel computations res ``` ```{r label="Fit_EM_int", echo=FALSE, warning = FALSE} ## Grid on alpha alpha_grid <- c(1, 3) ## Run algorithm res <- PhyloEM(phylo = tree, Y_data = data, process = "scOU", ## scalar OU model random.root = TRUE, ## Root is stationary (true model) stationary.root = TRUE, alpha = alpha_grid, ## On a grid of alpha K_max = 10, ## Maximal number of shifts parallel_alpha = FALSE) ## This can be set to TRUE for res ``` The fitted object contains the inferred parameters for every values of $\alpha$ and $K$, as well as the parameters selected by the various model selection criteria. Here, we can see that the original shifts were recovered by the model selection procedure. We can plot the obtained solution, as shown below. The inferred states a the tips are shown in dotted lines. Notice that they are coherent with the positions of the shifts that we found. ```{r, fig.show='hold', fig.height=4, fig.width=7, warning=FALSE} plot(res) ``` The function above automatically plots the solution selected by the model selection criterion. We can plot this criterion, to see whether the solution selected is a clear minimum. ```{r, fig.show='hold', fig.height=4, fig.width=4, warning=FALSE} plot_criterion(res) ``` Here, the criterion behaves rather well, and the solution for $K=3$ seems to be a good choice. Above, we plotted the solution given by the LINselect method (the default). But we might want to see the solution given by another selection method, such as the slope heuristic. In that case, we just need to run: ```{r, eval=FALSE} plot(res, params = params_process(res, method.selection = "DDSE")) ``` where we extracted the parameters with the method `params_process`. Here, the results are the same. We can also extract the solution for a given number of shifts and/or a given value of alpha. For instance, we can see that in this simple case, the inferred position of the shifts is robust to the value of the selection strength: ```{r, fig.show='hold', fig.height=4, fig.width=7, warning=FALSE} plot(res, params = params_process(res, K = 3, alpha = 1)) ``` The inferred shifts values however are quite different: ```{r} params_process(res, K = 3, alpha = 1)$shifts params_process(res, K = 3, alpha = 3)$shifts ``` ## Equivalent Solutions When there are too many shifts, the solution might not be identifiable. For instance here, if we take the (degenerate) solution with $6$ shifts, we get this warning: ```{r} params_6 <- params_process(res, K = 6) ``` And we can plot all the equivalent shifts allocations: ```{r, fig.show='hold', fig.height=4, fig.width=8, warning=FALSE} plot(equivalent_shifts(tree, params_6)) ``` the default only shows the shifts values for the first trait.