---
title: "An Introduction to the PhylogeneticEM Package"
author: "Paul Bastide"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{TreeSim}
---

## Introduction

The `PhylogeneticEM` package is designed to automatically detect shifts in quantitative traits.

Simply load the package with:
```{r, label="load", message=FALSE}
library(PhylogeneticEM)
```

## Simulation of a dataset

To show how the functions work, let's first simulate some traits with known parameters.

We use function `sim.bd.taxa.age` from package `TreeSim` to simulate a phylogenetic tree.

```{r, label="Simus_tree", eval = FALSE}
set.seed(17920902)
ntaxa = 80
tree <- TreeSim::sim.bd.taxa.age(n = ntaxa, numbsim = 1, lambda = 0.1, mu = 0,
                                 age = 1, mrca = TRUE)[[1]]
```

```{r label="Simus_tree_int", echo=FALSE}
set.seed(17920902)
ntaxa = 80
if (!requireNamespace("TreeSim", quietly = TRUE)) {
  tree <- ape::rphylo(n = ntaxa, birth = 0.1, death = 0, T0 = 1)
  tree$edge.length <- tree$edge.length / max(ape::vcv(tree))
} else {
  tree <- TreeSim::sim.bd.taxa.age(n = ntaxa, numbsim = 1, lambda = 0.1, mu = 0,
                                   age = 1, mrca = TRUE)[[1]]
}
```

We then choose a set of parameters, using function `params_process`:

```{r, label="Parameters"}
params <- params_process("OU",                             ## Process
                         p = 2,                            ## Dimension
                         variance = diag(0.5, 2, 2) + 0.5, ## Rate matrix
                         selection.strength = 3,           ## Selection Strength
                         random = TRUE,                    ## Root is random
                         stationary.root = TRUE,           ## Root is stationary
                         edges = c(16, 81, 124),           ## Positions of the shifts
                         values = cbind(c(5, 4),           ## Values of the shifts
                                        c(-4, -5),
                                        c(5, -3)))
```

Note that here, we took the selection strength to a scalar matrix, but we could specify a full matrix. Also, we did not specified the ancestral values of the optimum states: these are took to be $0$ by default.

The position of the shifts is specified through to the number of the edge where it occurs. These can be saw using a standard `plot` from package `ape`. Once the parameters are constructed, we can check the position of the shifts by plotting them:

```{r, fig.show='hold', fig.height=3, fig.width=3.4}
plot(params, phylo = tree, traits = 1, value_in_box = TRUE, shifts_bg = "white")
plot(params, phylo = tree, traits = 2, value_in_box = TRUE, shifts_bg = "white")
```

We can then simulate a process along the tree with these parameters:

```{r, label="Simu_process", message=FALSE}
sim <- simul_process(params, tree)
```

We can then extract a matrix of simulated data from this object:
```{r, label="extraction"}
data <- extract(sim,             ## The simul_process object
                what = "states", ## We want the actual values
                where = "tips")  ## Only at the tips of the tree
```

The column names of the matrix are set to the tip labels of the tree. By default, the row names are left blank, but, for the sake of the demonstration, let's assume that the two traits are named "A" and "B":

```{r, label="rownames"}
rownames(data) <- c("A", "B")
```

Finally, we can plot the simulated process on the tree to see what it looks like:

```{r, fig.show='hold', fig.height=4, fig.width=7}
plot(params, phylo = tree, data = data)
```

## Inference

In this section, we take the traits as simulated by function `simul_process`, and use function `PhyloEM` for the inference of the parameters.

To make things more interesting, let's assume that we have some missing data:
```{r}
nMiss <- floor(ntaxa * 2 * 0.1)                       ## 10% of missing data
miss <- sample(1:(2 * ntaxa), nMiss, replace = FALSE) ## sample missing randomly
chars <- (miss - 1) %% 2 + 1                          ## Trace back rows and columns
tips <- (miss - 1) %/% 2 + 1
for (i in 1:nMiss){
  data[chars[i], tips[i]] <- NA                       ## Forget some values
}
```

By default, the inference is done on an automatically computed grid of alpha values. Here, to keep the computations to a minimum, we will only use $2$ values for alpha, including the true value. In a true application however, a grid with $10$ values of $\alpha$ is automatically chosen.

```{r, label="Fit_EM", warning=FALSE, eval = FALSE}
## Grid on alpha
alpha_grid <- c(1, 3)

## Run algorithm
res <- PhyloEM(phylo = tree,
               Y_data = data,
               process = "scOU",                   ## scalar OU model
               random.root = TRUE,                 ## Root is stationary (true model)
               stationary.root = TRUE,
               alpha = alpha_grid,                 ## On a grid of alpha
               K_max = 10,                         ## Maximal number of shifts
               parallel_alpha = TRUE,              ## This can be set to TRUE for
               Ncores = 2)                         ## parallel computations
res
```

```{r label="Fit_EM_int", echo=FALSE, warning = FALSE}
## Grid on alpha
alpha_grid <- c(1, 3)
## Run algorithm
res <- PhyloEM(phylo = tree,
               Y_data = data,
               process = "scOU",                   ## scalar OU model
               random.root = TRUE,                 ## Root is stationary (true model)
               stationary.root = TRUE,
               alpha = alpha_grid,                 ## On a grid of alpha
               K_max = 10,                         ## Maximal number of shifts
               parallel_alpha = FALSE)              ## This can be set to TRUE for
res
```

The fitted object contains the inferred parameters for every values of $\alpha$ and $K$, as well as the parameters selected by the various model selection criteria.

Here, we can see that the original shifts were recovered by the model selection procedure. We can plot the obtained solution, as shown below. The inferred states a the tips are shown in dotted lines. Notice that they are coherent with the positions of the shifts that we found.
```{r, fig.show='hold', fig.height=4, fig.width=7, warning=FALSE}
plot(res)
```

The function above automatically plots the solution selected by the model selection criterion.
We can plot this criterion, to see whether the solution selected is a clear minimum.

```{r, fig.show='hold', fig.height=4, fig.width=4, warning=FALSE}
plot_criterion(res)
```

Here, the criterion behaves rather well, and the solution for $K=3$ seems to be a good choice.

Above, we plotted the solution given by the LINselect method (the default). But we might want to see the solution given by another selection method, such as the slope heuristic. In that case, we just need to run:
```{r, eval=FALSE}
plot(res, params = params_process(res, method.selection = "DDSE"))
```

where we extracted the parameters with the method `params_process`. Here, the results are the same.

We can also extract the solution for a given number of shifts and/or a given value of alpha. For instance, we can see that in this simple case, the inferred position of the shifts is robust to the value of the selection strength:
```{r, fig.show='hold', fig.height=4, fig.width=7, warning=FALSE}
plot(res, params = params_process(res, K = 3, alpha = 1))
```

The inferred shifts values however are quite different:
```{r}
params_process(res, K = 3, alpha = 1)$shifts
params_process(res, K = 3, alpha = 3)$shifts
```

## Equivalent Solutions

When there are too many shifts, the solution might not be identifiable. For instance here, if we take the (degenerate) solution with $6$ shifts, we get this warning:
```{r}
params_6 <- params_process(res, K = 6)
```

And we can plot all the equivalent shifts allocations:
```{r, fig.show='hold', fig.height=4, fig.width=8, warning=FALSE}
plot(equivalent_shifts(tree, params_6))
```

the default only shows the shifts values for the first trait.

<!-- Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The `html_vignette` output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The `html_vignette` format: -->

<!-- - Never uses retina figures -->
<!-- - Has a smaller default figure size -->
<!-- - Uses a custom CSS stylesheet instead of the default Twitter Bootstrap style -->

<!-- ## Vignette Info -->

<!-- Note the various macros within the `vignette` section of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the `title` field and the `\VignetteIndexEntry` to match the title of your vignette. -->

<!-- ## Styles -->

<!-- The `html_vignette` template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows: -->

<!--     output:  -->
<!--       rmarkdown::html_vignette: -->
<!--         css: mystyles.css -->

<!-- ## Figures -->

<!-- The figure sizes have been customised so that you can easily put two images side-by-side.  -->

<!-- ```{r, fig.show='hold'} -->
<!-- plot(1:10) -->
<!-- plot(10:1) -->
<!-- ``` -->

<!-- You can enable figure captions by `fig_caption: yes` in YAML: -->

<!--     output: -->
<!--       rmarkdown::html_vignette: -->
<!--         fig_caption: yes -->

<!-- Then you can use the chunk option `fig.cap = "Your figure caption."` in **knitr**. -->

<!-- ## More Examples -->

<!-- You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using `knitr::kable()`. -->

<!-- ```{r, echo=FALSE, results='asis'} -->
<!-- knitr::kable(head(mtcars, 10)) -->
<!-- ``` -->

<!-- Also a quote using `>`: -->

<!-- > "He who gives up [code] safety for [code] speed deserves neither." -->
<!-- ([via](https://twitter.com/hadleywickham/status/504368538874703872)) -->