Title: | Segment Data Minimizing a Cost Function |
---|---|
Description: | Given a cost function provided by the user, this package applies it to a given matrix dataset in order to find change points in the data that minimize the sum of the costs of all the segments. This package provides a handful of algorithms with different time complexities and assumption compromises so the user is able to choose the best one for the problem at hand. The implementation of the segmentation algorithms in this package are based on the paper by Bruno M. de Castro, Florencia Leonardi (2018) <arXiv:1501.01756>. The Berlin weather sample dataset was provided by Deutscher Wetterdienst <https://dwd.de/>. You can find all the references in the Acknowledgments section of this package's repository via the URL below. |
Authors: | Thales Mello [aut, cre, cph], Florencia Leonardi [aut, cph, ths], Bruno M. de Castro [cph], Deutscher Wetterdienst [cph] |
Maintainer: | Thales Mello <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2024-10-13 05:38:14 UTC |
Source: | https://github.com/segmentr-package/segmentr |
Given a dataset, a cost function and penalty parameters on how to penalize big and small segments, this function makes an educated guess on a added penalty for the cost function.
auto_penalize(data, likelihood, cost, big_segment_penalty = 10, small_segment_penalty = 10)
auto_penalize(data, likelihood, cost, big_segment_penalty = 10, small_segment_penalty = 10)
data |
dataset to be segmented by the |
likelihood |
function to be maximized using the |
cost |
function to be minimized using the |
big_segment_penalty |
penalty factor for big segments. The bigger it is, the bigger the penalty on big segments. Must be greater than or equal to 1. Penalty on big segments is constant when it's equal to 1. Default: 10 |
small_segment_penalty |
penalty factor for small segments. The bigger it is, the bigger the penalty on small segments. Must be greater than or equal to 1. Penalty on small segments is constant when it's equal to 1. Default: 10 |
This function tries to fit a sum of two exponential functions to values inferred from the dataset and the penalty function. The model for the penalty function we try to fit is in the form:
In the equation, and
are, respectively,
a multiplier constant and an exponential scale modifier for small segments,
whereas
and
are the equivalent ones for
big segments.
is the number of columns in the
data
matrix.
Assuming the penalty function to be as such, the parameters are estimated
considering the scale of values yielded by the cost function for small
and big segments, also taking into account the big_segment_penalty
and
small_segment_penalty
tuning parameters, which can be used to adjust
the effect of the penalty function over big and small segments, respectively.
the likelihood function with the guessed penalty function applied
## Not run: penalized_cost <- auto_penalize(berlin, multivariate) ## End(Not run)
## Not run: penalized_cost <- auto_penalize(berlin, multivariate) ## End(Not run)
Contains weather daily weather data from many Deutscher Wetterdienst weather stations in Berlin from the years of 2010 and 2011. Data was obtained using the package rdwd and reformatted to a format appropriate to be used for analysis in this object.
berlin
berlin
A matrix containing daily temperatures, with each column representing a date and each column representing a weather station in Berlin
dates from the years 2010 and 2011
...
https://www.dwd.de/DE/Home/home_node.html
Find changes points with minimal total cost comparing all possible segment combinations
exactalg(data, cost, likelihood, max_segments = ncol(data), allow_parallel = TRUE)
exactalg(data, cost, likelihood, max_segments = ncol(data), allow_parallel = TRUE)
data |
matrix for which to find the change points |
cost |
a function receives the segment matrix as argument and returns a cost for the segment. This function is used to calculate the change points that minimize the total cost. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's good for this function to have a fast implementation. |
likelihood |
deprecated: use cost instead. function receives the segment matrix as argument and returns a likelihood estimation. This function is used to calculate the change points that maximize the total likelihood. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's advised that this function should have fast implementation. |
max_segments |
an integer that defines the maximum amount of segments to split the data into. |
allow_parallel |
allows parallel execution to take place using the
registered cluster. Assumes a cluster is registered with the |
Function that implements the dynamic programming algorithm, with the intent
of finding points of independent change points for which the cost
function is minimized. It analyzes all possible combinations, returning the
change points that are guaranteed to segment the data matrix in the change points
minimize total cost. Because it analyzes all possible combinations
of change points, it has a O-squared algorithm complexity, meaning it works
in an acceptable computation time for small datasets, but it takes quite
longer for datasets with many columns. For big datasets, hieralg()
might
be more adequate.
a list of type segmentr
, which has the two attributes:
changepoints
: a vector with the first index of each identified change point
segments
: a list of vectors, in which each vector corresponds to the indices
that identifies a segment.
By assuming change points follow an hierarchical architecture, this architecture manages to run faster by not searching all possible branches
hieralg(data, cost, likelihood, max_segments = ncol(data), allow_parallel = TRUE)
hieralg(data, cost, likelihood, max_segments = ncol(data), allow_parallel = TRUE)
data |
matrix for which to find the change points |
cost |
a function receives the segment matrix as argument and returns a cost for the segment. This function is used to calculate the change points that minimize the total cost. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's good for this function to have a fast implementation. |
likelihood |
deprecated: use cost instead. function receives the segment matrix as argument and returns a likelihood estimation. This function is used to calculate the change points that maximize the total likelihood. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's advised that this function should have fast implementation. |
max_segments |
an integer that defines the maximum amount of segments to split the data into. |
allow_parallel |
allows parallel execution to take place using the
registered cluster. Assumes a cluster is registered with the |
Fast algorithm that segments data into change points, and it does so by simplifying by reducing the search possibilities by assuming data split in an hierarchical structure, i.e. a segment found in a first trial is assumed to contain only segments independent of the rest of the data. This algorithm usually runs very fast, but is known to yield less accurate results, possibly not finding the exact change points that would minimize cost.
a list of type segmentr
, which has the two attributes:
changepoints
: a vector with the first index of each identified change point
segments
: a list of vectors, in which each vector corresponds to the indices
that identifies a segment.
For the larger datasets, assume the data is hierarchical, but calculate the exact segments when they're smaller than a threshold
hybridalg(data, cost, likelihood, allow_parallel = TRUE, max_segments = ncol(data), threshold = 50)
hybridalg(data, cost, likelihood, allow_parallel = TRUE, max_segments = ncol(data), threshold = 50)
data |
matrix for which to find the change points |
cost |
a function receives the segment matrix as argument and returns a cost for the segment. This function is used to calculate the change points that minimize the total cost. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's good for this function to have a fast implementation. |
likelihood |
deprecated: use cost instead. function receives the segment matrix as argument and returns a likelihood estimation. This function is used to calculate the change points that maximize the total likelihood. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's advised that this function should have fast implementation. |
allow_parallel |
allows parallel execution to take place using the
registered cluster. Assumes a cluster is registered with the |
max_segments |
an integer that defines the maximum amount of segments to split the data into. |
threshold |
the threshold for which the exact algorithm will be used, i.e. when the number of columns in the segment is less than or equal to the threshold. |
This algorithm implements an approach mixing the hierarchical and exact algorithms. It uses the hierarchical algorithms when the size of the segment is bigger than the threshold, and then goes on to use the exact algorithm when the size of the segment is less than or equal to the threshold.
a list of type segmentr
, which has the two attributes:
changepoints
: a vector with the first index of each identified change point
segments
: a list of vectors, in which each vector corresponds to the indices
that identifies a segment.
Estimate the likelihood of a given segment using the discrete multivariate estimation, implemented efficiently in C++
multivariate(data, na_action = function(d) d[, colSums(is.na(d)) == 0, drop = FALSE])
multivariate(data, na_action = function(d) d[, colSums(is.na(d)) == 0, drop = FALSE])
data |
Matrix to estimate the multivariate of. Each row is considered to be an observation, and each column is considered to be a different variable. |
na_action |
A function that is applied to the |
Calculates the discrete log likelihood multivariate estimation of a data
matrix using an algorithm implemented in C++ for performance. This is
intended to be used in conjunction with segment()
, as the log likelihood
function is executed multiple times, which makes it the bottleneck of the
computation. Because the multivariate is so commonly used, this efficient
implementation is provided.
the estimate of the Discrete Maximum Likelihood for the dataframe provided.
Prints a short description of the segments found in the segmentr
object
## S3 method for class 'segmentr' print(x, ...)
## S3 method for class 'segmentr' print(x, ...)
x |
an object of type segmentr, containing change point information |
... |
further arguments to be passed down to other methods |
A short representation of the segments is printed on the screen, using the
start:end
range notation.
make_segment <- function(n, p) matrix(rbinom(100 * n, 1, p), nrow = 100) data <- cbind(make_segment(5, 0.1), make_segment(10, 0.9), make_segment(2, 0.1)) heterogeneity_cost <- function(X) sum((X - mean(X))^2) + 1 x <- segment(data, cost = heterogeneity_cost, algorithm = "hieralg") print(x)
make_segment <- function(n, p) matrix(rbinom(100 * n, 1, p), nrow = 100) data <- cbind(make_segment(5, 0.1), make_segment(10, 0.9), make_segment(2, 0.1)) heterogeneity_cost <- function(X) sum((X - mean(X))^2) + 1 x <- segment(data, cost = heterogeneity_cost, algorithm = "hieralg") print(x)
Estimate the likelihood of a given segment using the discrete multivariate estimation, but code runs more slowly due to R implementation
r_multivariate(data, na.omit = TRUE)
r_multivariate(data, na.omit = TRUE)
data |
Matrix to estimate the multivariate of. Each row is considered to be an observation, and each column is considered to be a different variable. |
na.omit |
If true, omits NAs from the dataset. |
This log likelihood function is implemented in R in order to be used to
benchmark against the multivariate()
version implemented in C++ for
performance.
The estimate of the Discrete Maximum Likelihood for the dataframe provided.
Generic function to segment data into separate change points according to specified algorithm
segment(data, cost, likelihood, max_segments = ncol(data), allow_parallel = TRUE, algorithm = "exact", ...)
segment(data, cost, likelihood, max_segments = ncol(data), allow_parallel = TRUE, algorithm = "exact", ...)
data |
matrix for which to find the change points |
cost |
a function receives the segment matrix as argument and returns a cost for the segment. This function is used to calculate the change points that minimize the total cost. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's good for this function to have a fast implementation. |
likelihood |
deprecated: use cost instead. function receives the segment matrix as argument and returns a likelihood estimation. This function is used to calculate the change points that maximize the total likelihood. Depending on the algorithm being used, this function is likely to be executed many times, in which case it's also likely to be the bottleneck of the function execution, so it's advised that this function should have fast implementation. |
max_segments |
an integer that defines the maximum amount of segments to split the data into. |
allow_parallel |
allows parallel execution to take place using the
registered cluster. Assumes a cluster is registered with the |
algorithm |
can be of type |
... |
other parameters to be passed to the underlying function |
This function can be used as a generic function to call any of the algorithms implemented by the package. Depending on the type of data the user wants to segment, one algorithm might be more adequate than the others.
a list of type segmentr
, which has the two attributes:
changepoints
: a vector with the first index of each identified change point
segments
: a list of vectors, in which each vector corresponds to the indices
that identifies a segment.
exactalg()
for the exact algorithm, hieralg()
for the
hierarchical algorithm implementation, hybridalg()
for the hybrid
algorithm implementation.
make_segment <- function(n, p) matrix(rbinom(100 * n, 1, p), nrow = 100) data <- cbind(make_segment(5, 0.1), make_segment(10, 0.9), make_segment(2, 0.1)) heterogeneity_cost <- function(X) sum((X - mean(X))^2) + 1 segment(data, cost = heterogeneity_cost, algorithm = "hieralg")
make_segment <- function(n, p) matrix(rbinom(100 * n, 1, p), nrow = 100) data <- cbind(make_segment(5, 0.1), make_segment(10, 0.9), make_segment(2, 0.1)) heterogeneity_cost <- function(X) sum((X - mean(X))^2) + 1 segment(data, cost = heterogeneity_cost, algorithm = "hieralg")