26.1 Descriptive Statistics

One principal goal of descriptive statistics is to represent the essence of a large data set concisely. Octave provides the mean, median, and mode functions which all summarize a data set with just a single number corresponding to the central tendency of the data.

 
m = mean (x)
m = mean (x, dim)
m = mean (x, vecdim)
m = mean (x, "all")
m = mean (…, nanflag)
m = mean (…, outtype)
m = mean (…, 'Weights', w)

Compute the mean of the elements of x.

The mean is defined as

mean (x) = SUM_i x(i) / N

where N is the number of elements in x.

The weighted mean is defined as

weighted_mean (x) = SUM_i (w(i) * x(i)) / SUM_i (w(i))

where N is the number of elements in x.

If x is a vector, then mean (x) returns the mean of the elements in x.

If x is a matrix, then mean (x) returns a row vector with each element containing the mean of the corresponding column in x.

If x is an array, then mean (x) computes the mean along the first non-singleton dimension of x.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause mean to operate on all elements of x, and is equivalent to mean (x(:)).

The optional input outtype specifies the data type that is returned. outtype can take the following values:

'default' : Output is of type double, unless the input is

single in which case the output is of type single.

'double' : Output is of type double.
'native' : Output is of the same type as the input as reported

by (class (x)), unless the input is logical in which case the output is of type double.

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". The output will still contain NaN values if x consists of all NaN values in the operating dimension.

The optional argument pair "Weights", w specifies a weighting scheme w, which is applied on input x, so that mean computes the weighted mean. When operating along a single dimension, w must be a vector of the same length as the operating dimension or it must have the same size as x. When operating over an array slice defined by vecdim, w must have the same size as the operating array slice, i.e., size (w) == size (x)(vecdim), or the same size as x.

See also: median, mode, movmean.

 
m = median (x)
m = median (x, dim)
m = median (x, vecdim)
m = median (x, "all")
m = median (…, nanflag)
m = median (…, outtype)

Compute the median value of the elements of x.

The median is defined on the sorted data s (s = sort (x)) as

             |  s(ceil (N/2))          N odd
median (x) = |
             | (s(N/2) + s(N/2+1))/2   N even

If x is a vector, then median (x) returns the median of the elements in x.

If x is a matrix, then median (x) returns a row vector with each element containing the median of the corresponding column in x.

If x is an array, then median (x) computes the median along the first non-singleton dimension of x.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause median to operate on all elements of x, and is equivalent to median (x(:)).

median (…, outtype) returns the median with a specified data type, using any of the input arguments in the previous syntaxes. outtype can take the following values:

"default"

Output is of type double, unless the input is single in which case the output is of type single.

"double"

Output is of type double.

"native".

Output is of the same type as the input (class (x)), unless the input is logical in which case the output is of type double.

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". The output will still contain NaN values if x consists of all NaN values in the operating dimension.

See also: mean, mode, movmedian.

 
m = mode (x)
m = mode (x, dim)
m = mode (x, vecdim)
m = mode (x, "all")
[m, f, c] = mode (…)

Compute the most frequently occurring value in the input data x.

mode determines the frequency of values along the first non-singleton dimension and returns the value with the highest frequency. If two, or more, values have the same frequency mode returns the smallest.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored. If all dimensions in vecdim are greater than ndims (x), then mode will return x.

Specifying the dimension as "all" will cause mode to operate on all elements of x, and is equivalent to mode (x(:)).

The return variable f is the number of occurrences of the mode in the dataset.

The cell array c contains all of the elements with the maximum frequency.

See also: mean, median.

Using just one number, such as the mean, to represent an entire data set may not give an accurate picture of the data. One way to characterize the fit is to measure the dispersion of the data. Octave provides several functions for measuring dispersion.

 
[s, l] = bounds (x)
[s, l] = bounds (x, dim)
[s, l] = bounds (x, vecdim)
[s, l] = bounds (x, "all")
[s, l] = bounds (…, nanflag)

Return the smallest and largest values of the input data x.

If x is a vector, then bounds (x) returns the smallest and largest values of the elements in x in s and l, respectively.

If x is a matrix, then bounds (x) returns the smallest and largest values for each column of x as row vectors s and l, respectively.

If x is an array, then bounds (x) computes the smallest and largest values along the first non-singleton dimension of x.

The data in x must be numeric. By default, any NaN values are ignored. The size of s and l is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause bounds to operate on all elements of x, and is equivalent to bounds (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "omitnan" which does not include NaN values in the result. If the argument "includenan" is given, and there is a NaN present, then the result for both smallest (s) and largest (l) elements will be NaN.

Usage Note: The bounds are a quickly computed measure of the dispersion of a data set, but are less accurate than iqr if there are outlying data points.

See also: range, iqr, mad, std.

 
y = range (x)
y = range (x, dim)
y = range (x, vecdim)
y = range (x, "all")
y = range (…, nanflag)

Return the difference between the maximum and the minimum values of the input data x.

If x is a vector, then range (x) returns the difference between the maximum and minimum values of the elements in x.

If x is a matrix, then range (x) returns a row vector y with the difference between the maximum and minimum values for each column of x.

If x is an array, then range (x) computes the difference between the maximum and minimum values along the first non-singleton dimension of x.

The data in x must be numeric. By default, any NaN values are ignored. The size of r is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause range to operate on all elements of x, and is equivalent to range (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "omitnan" which does not include NaN values in the result. If the argument "includenan" is given, and there is a NaN present, then the corresponding result will be NaN.

Usage Note: The range is a quickly computed measure of the dispersion of a data set, but is less accurate than iqr if there are outlying data points.

See also: bounds, iqr, mad, std.

 
r = iqr (x)
r = iqr (x, dim)
r = iqr (x, vecdim)
r = iqr (x, "all")
[r, q] = iqr (…)

Compute the interquartile range of the input data x.

The interquartile range is defined as the difference between the 75th and 25th percentile values of x calculated using

quantile (x, [0.25 , 0.75])

If x is a vector, then iqr (x) computes the interquartile range of the elements in x.

If x is a matrix, then iqr (x) returns a row vector with each element containing the interquartile range of the corresponding column in x.

If x is an array, then iqr (x) computes the interquartile range along the first non-singleton dimension of x.

The data in x must be numeric and any NaN values are ignored. The size of r is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return zeros (size (x)).

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause iqr to operate on all elements of x, and is equivalent to iqr (x(:)).

The optional output q contains the quantiles for the 25th and 75th percentile of the data.

Usage Note: As a measure of dispersion, the interquartile range is less affected by outliers than either range or std. The interquartile range of a scalar is necessarily 0.

See also: bounds, mad, range, std, prctile, quantile.

 
m = mad (x)
m = mad (x, opt)
m = mad (x, opt, dim)
m = mad (x, opt, vecdim)
m = mad (x, opt, "all")

Compute the mean or median absolute deviation (MAD) of the elements of x.

The mean absolute deviation is defined as

mad = mean (abs (x - mean (x)))

The median absolute deviation is defined as

mad = median (abs (x - median (x)))

mad excludes NaN values from calculation similar to using the omitnan option in mean and median.

If x is a vector, then mad (x) returns the mean absolute deviation of the elements in x.

If x is a matrix, then mad (x) returns a row vector with each element containing the mean absolute deviation of the corresponding column in x.

If x is an array, then mad (x) computes the mean absolute deviation along the first non-singleton dimension of x.

The optional argument opt specifies whether mean or median absolute deviation is calculated. The default is 0 which corresponds to mean absolute deviation; a value of 1 corresponds to median absolute deviation. Passing an empty input [] defaults to mean absolute deviation.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return zeros (size (x)).

Specifying the dimension as vecdim, a vector of non-repeating dimensions, will return the mad over the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause mad to operate on all elements of x, and is equivalent to mad (x(:)).

Usage Note: As a measure of dispersion, mad is less affected by outliers than std.

See also: bounds, range, iqr, std, mean, median.

 
y = meansq (x)
y = meansq (x, dim)
y = meansq (x, vecdim)
y = meansq (x, "all")
y = meansq (…, nanflag)

Compute the mean square of the input data x.

The mean square is defined as

meansq (x) = 1/N SUM_i x(i)^2

where N is the length of the x vector.

If x is a vector, then meansq (x) returns the mean square of the elements in x.

If x is a matrix, then meansq (x) returns a row vector with each element containing the mean square of the corresponding column in x.

If x is an array, then meansq (x) computes the mean square along the first non-singleton dimension of x.

The data in x must be numeric. The size of y is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.^2.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause meansq to operate on all elements of x, and is equivalent to meansq (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". The output will still contain NaN values if x consists of all NaN values in the operating dimension.

See also: rms, var, std, moment.

 
y = rms (x)
y = rms (x, dim)
y = rms (x, vecdim)
y = rms (x, "all")
y = rms (…, nanflag)

Compute the root mean square of the input data x.

The root mean square is defined as

rms (x) = sqrt (1/N SUM_i x(i)^2)

where N is the length of the x vector.

If x is a vector, then rms (x) returns the root mean square of the elements in x.

If x is a matrix, then rms (x) returns a row vector with each element containing the root mean square of the corresponding column in x.

If x is an array, then rms (x) computes the root mean square along the first non-singleton dimension of x.

The data in x must be numeric. The size of y is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause rms to operate on all elements of x, and is equivalent to rms (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". The output will still contain NaN values if x consists of all NaN values in the operating dimension.

See also: meansq, var, std, moment.

 
s = std (x)
s = std (x, w)
s = std (x, w, dim)
s = std (x, w, vecdim)
s = std (x, w, "all")
s = std (…, nanflag)
[s, m] = std (…)

Compute the standard deviation of the elements of x.

The standard deviation is defined as

std (x) = sqrt ((1 / (N-1)) * SUM_i ((x(i) - mean(x))^2))

where N is the number of elements of x.

If x is a vector, then std (x) returns the standard deviation of the elements in x.

If x is a matrix, then std (x) returns a row vector with each element containing the standard deviation of the corresponding column in x.

If x is an array, then std (x) computes the standard deviation along the first non-singleton dimension of x.

The optional argument w determines the weighting scheme to use. Valid values are:

0 [default]:

Normalize with N-1 (population standard deviation). This provides the square root of the best unbiased estimator of the standard deviation.

1:

Normalize with N (sample standard deviation). This provides the square root of the second moment around the mean.

a vector:

Compute the weighted standard deviation with non-negative weights. The length of w must equal the size of x in the operating dimension. NaN values are permitted in w, will be multiplied with the associated values in x, and can be excluded by the nanflag option.

an array:

Similar to vector weights, but w must be the same size as x. If the operating dimension is supplied as vecdim or "all" and w is not a scalar, w must be an same-sized array.

Note: w must always be specified before specifying any of the following dimension options. To use the default value for w you may pass an empty input argument [].

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return zeros (size (x)).

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause std to operate on all elements of x, and is equivalent to std (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". The output will still contain NaN values if x consists of all NaN values in the operating dimension.

The optional second output variable m contains the mean of the elements of x used to calculate the standard deviation. If v is the weighted standard deviation, then m is also the weighted mean.

See also: var, bounds, mad, range, iqr, mean, median.

In addition to knowing the size of a dispersion it is useful to know the shape of the data set. For example, are data points massed to the left or right of the mean? Octave provides several common measures to describe the shape of the data set. Octave can also calculate moments allowing arbitrary shape measures to be developed.

 
v = var (x)
v = var (x, w)
v = var (x, w, dim)
v = var (x, w, vecdim)
v = var (x, w, "all")
v = var (…, nanflag)
[v, m] = var (…)

Compute the variance of the elements of x.

The variance is defined as

var (x) = (1 / (N-1)) * SUM_i ((x(i) - mean(x))^2)

where N is the number of elements of x.

If x is a vector, then var (x) returns the variance of the elements in x.

If x is a matrix, then var (x) returns a row vector with each element containing the variance of the corresponding column in x.

If x is an array, then var (x) computes the variance along the first non-singleton dimension of x.

The optional argument w determines the weighting scheme to use. Valid values are:

0 [default]:

Normalize with N-1 (population variance). This provides the square root of the best unbiased estimator of the variance.

1:

Normalize with N (sample variance). This provides the square root of the second moment around the mean.

a vector:

Compute the weighted variance with non-negative weights. The length of w must equal the size of x in the operating dimension. NaN values are permitted in w, will be multiplied with the associated values in x, and can be excluded by the nanflag option.

an array:

Similar to vector weights, but w must be the same size as x. If the operating dimension is supplied as vecdim or "all" and w is not a scalar, then w must match the size of the specified array slice.

Note: w must always be specified before specifying any of the following dimension options. To use the default value for w you may pass an empty input argument [].

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return zeros (size (x)).

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause var to operate on all elements of x, and is equivalent to var (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". The output will still contain NaN values if x consists of all NaN values in the operating dimension.

The optional second output variable m contains the mean of the elements of x used to calculate the variance. If v is the weighted variance, then m is also the weighted mean.

See also: std, mean, cov, skewness, kurtosis, moment.

 
y = skewness (x)
y = skewness (x, flag)
y = skewness (x, flag, dim)
y = skewness (x, flag, vecdim)
y = skewness (x, flag, "all")

Compute the sample skewness of the input data x.

The sample skewness is defined as

               mean ((x - mean (x)).^3)
skewness (X) = ------------------------.
                      std (x).^3

The optional argument flag controls which normalization is used. If flag is equal to 1 (default value, used when flag is omitted or empty), return the sample skewness as defined above. If flag is equal to 0, return the adjusted skewness coefficient instead:

                  sqrt (N*(N-1))   mean ((x - mean (x)).^3)
skewness (X, 0) = -------------- * ------------------------.
                      (N - 2)             std (x).^3

where N is the length of the x vector.

The adjusted skewness coefficient is obtained by replacing the sample second and third central moments by their bias-corrected versions.

If x is a vector, then skewness (x) computes the skewness of the data in x.

If x is a matrix, then skewness (x) returns a row vector with each element containing the skewness of the data of the corresponding column in x.

If x is an array, then skewness (x) computes the skewness of the data along the first non-singleton dimension of x.

The data in x must be numeric and any NaN values are ignored. The size of y is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause skewness to operate on all elements of x, and is equivalent to skewness (x(:)).

See also: var, kurtosis, moment.

 
y = kurtosis (x)
y = kurtosis (x, flag)
y = kurtosis (x, flag, dim)
y = kurtosis (x, flag, vecdim)
y = kurtosis (x, flag, "all")

Compute the sample kurtosis of the input data x.

The sample kurtosis is defined as

     mean ((x - mean (x)).^4)
k1 = ------------------------
            std (x).^4

The optional argument flag controls which normalization is used. If flag is equal to 1 (default value, used when flag is omitted or empty), return the sample kurtosis as defined above. If flag is equal to 0, return the "bias-corrected" kurtosis coefficient instead:

              N - 1
k0 = 3 + -------------- * ((N + 1) * k1 - 3 * (N - 1))
         (N - 2)(N - 3)

where N is the length of the x vector.

The bias-corrected kurtosis coefficient is obtained by replacing the sample second and fourth central moments by their unbiased versions. It is an unbiased estimate of the population kurtosis for normal populations.

If x is a vector, then kurtosis (x) computes the kurtosis of the data in x.

If x is a matrix, then kurtosis (x) returns a row vector with each element containing the kurtosis of the data of the corresponding column in x.

If x is an array, then kurtosis (x) computes the kurtosis of the data along the first non-singleton dimension of x.

The data in x must be numeric and any NaN values are ignored. The size of y is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause kurtosis to operate on all elements of x, and is equivalent to kurtosis (x(:)).

See also: var, skewness, moment.

 
m = moment (x, p)
m = moment (x, p, dim)
m = moment (x, p, vecdim)
m = moment (x, p, "all")
m = moment (x, p, …, type)

Compute the p-th central moment of the input data x.

The p-th central moment of x is defined as:

1/N SUM_i (x(i) - mean(x))^p

where N is the length of the x vector.

If x is a vector, then moment (x) computes the p-th central moment of the data in x.

If x is a matrix, then moment (x) returns a vector with element containing the p-th central moment of the corresponding column in x.

If x is an array, then moment (x) computes the p-th central moment along the first non-singleton dimension of x.

The data in x must be a non-empty numeric array and any NaN values along the operating dimension will return NaN for central moment. The size of m is equal to the size of x except for the operating dimension, which becomes 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored. If all dimensions in vecdim are greater than ndims (x), then moment will return x.

Specifying the dimension as "all" will cause moment to operate on all elements of x, and is equivalent to moment (x(:)).

The optional fourth input argument, type, is a string specifying the type of moment to be computed. Valid options are:

"c"

Central Moment (default).

"a"
"ac"

Absolute Central Moment. The moment about the mean ignoring sign defined as

1/N SUM_i (abs (x(i) - mean(x)))^p
"r"

Raw Moment. The moment about zero defined as

moment (x) = 1/N SUM_i x(i)^p
"ar"

Absolute Raw Moment. The moment about zero ignoring sign defined as

1/N SUM_i ( abs (x(i)) )^p

See also: var, skewness, kurtosis.

 
q = quantile (x)
q = quantile (x, p)
q = quantile (x, n)
q = quantile (x, …, dim)
q = quantile (x, …, vecdim)
q = quantile (x, …, "all")
q = quantile (x, p, …, method)
q = quantile (x, n, …, method)

Compute the quantiles of the input data x.

If x is a vector, then quantile (x) computes the quantiles specified by p of the data in x.

If x is a matrix, then quantile (x) returns a matrix such that the i-th row of q contains the p(i)th quantiles of each column of x.

If x is an array, then quantile (x) computes the quantiles specified by p along the first non-singleton dimension of x.

The data in x must be numeric and any NaN values are ignored. The size of q is equal to the size of x except for the operating dimension, which equals to the number of quantiles specified by p or n.

p is a numeric vector specifying the percentiles to be computed, which correspond to the cumulative probabilities of the data . All elements of p must be in the range from 0 to 1. If p is unspecified, return the percentiles for [0.00 0.25 0.50 0.75 1.00]. Alternatively, the second input argument may be specified as a positive integer value n, in which case quantile returns the quantiles for n evenly spaced cumulative probabilities computed as (1/(n + 1), 2/(n + 1), …, n/(n + 1)) for n > 1.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return N copies of x along the operating dimension, where N is the number of specified quantiles.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored. If all dimensions in vecdim are greater than ndims (x), then quantile will return N copies of x along the smallest dimension in vecdim.

Specifying the dimension as "all" will cause iqr to operate on all elements of x, and is equivalent to iqr (x(:)).

The fourth input argument, methods, determines the method to calculate the quantiles specified by p or n. The methods available to calculate sample quantiles are the nine methods used by R (https://www.r-project.org/) and can be specified by the corresponding integer value. The default value is method = 5.

Discontinuous sample quantile methods 1, 2, and 3

  1. Method 1: Inverse of empirical distribution function.
  2. Method 2: Similar to method 1 but with averaging at discontinuities.
  3. Method 3: SAS definition: nearest even order statistic.

Continuous sample quantile methods 4 through 9, where p(k) is the linear interpolation function respecting each method’s representative cdf.

  1. Method 4: p(k) = k / N. That is, linear interpolation of the empirical cdf, where N is the length of P.
  2. Method 5: p(k) = (k - 0.5) / N. That is, a piecewise linear function where the knots are the values midway through the steps of the empirical cdf.
  3. Method 6: p(k) = k / (N + 1).
  4. Method 7: p(k) = (k - 1) / (N - 1).
  5. Method 8: p(k) = (k - 1/3) / (N + 1/3). The resulting quantile estimates are approximately median-unbiased regardless of the distribution of x.
  6. Method 9: p(k) = (k - 3/8) / (N + 1/4). The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed.

Hyndman and Fan (1996) recommend method 8. Maxima, S, and R (versions prior to 2.0.0) use 7 as their default. Minitab and SPSS use method 6. MATLAB uses method 5.

References:

  • R. A. Becker, J. M. Chambers, and A. R. Wilks, The New S Language, Wadsworth & Brooks/Cole, 1988.
  • R. J. Hyndman, and Y. Fan, "Sample quantiles in statistical packages", American Statistician, 50, pp. 361–365, 1996.
  • R: A Language and Environment for Statistical Computing, https://cran.r-project.org/doc/manuals/fullrefman.pdf.

Examples:

x = randi (1000, [10, 1]);  # Create empirical data in range 1-1000
q = quantile (x, [0, 1]);   # Return minimum, maximum of distribution
q = quantile (x, [0.25 0.5 0.75]); # Return quartiles of distribution

See also: prctile.

 
q = prctile (x)
q = prctile (x, p)
q = prctile (x, p, dim)
q = prctile (x, p, vecdim)
q = prctile (x, p, "all")
q = prctile (x, p, …, method)

Compute the percentiles of the input data x.

If x is a vector, then prctile (x) computes the percentiles specified by p of the data in x.

If x is a matrix, then prctile (x) returns a matrix such that the i-th row of q contains the p(i)th percentiles of each column of x.

If x is an array, then prctile (x) computes the percentiles specified by p along the first non-singleton dimension of x.

The data in x must be numeric and any NaN values are ignored. The size of q is equal to the size of x except for the operating dimension, which equals to the number of quantiles specified by p.

p is a numeric vector specifying the percentiles to be computed. All elements of p must be in the range from 0 to 100. If p is unspecified, return the percentiles for [0 25 50 75 100].

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return N copies of x along the operating dimension, where N is the number of specified percentiles.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored. If all dimensions in vecdim are greater than ndims (x), then quantile will return N copies of x along the smallest dimension in vecdim.

Specifying the dimension as "all" will cause iqr to operate on all elements of x, and is equivalent to iqr (x(:)).

The fourth input argument, methods, determines the method to calculate the percentiles specified by p. The methods available to calculate sample percentiles are the nine methods used by R (https://www.r-project.org/) and can be specified by the corresponding integer value. The default value is method = 5.

Discontinuous sample quantile methods 1, 2, and 3

  1. Method 1: Inverse of empirical distribution function.
  2. Method 2: Similar to method 1 but with averaging at discontinuities.
  3. Method 3: SAS definition: nearest even order statistic.

Continuous sample quantile methods 4 through 9, where p(k) is the linear interpolation function respecting each method’s representative cdf.

  1. Method 4: p(k) = k / N. That is, linear interpolation of the empirical cdf, where N is the length of P.
  2. Method 5: p(k) = (k - 0.5) / N. That is, a piecewise linear function where the knots are the values midway through the steps of the empirical cdf.
  3. Method 6: p(k) = k / (N + 1).
  4. Method 7: p(k) = (k - 1) / (N - 1).
  5. Method 8: p(k) = (k - 1/3) / (N + 1/3). The resulting quantile estimates are approximately median-unbiased regardless of the distribution of x.
  6. Method 9: p(k) = (k - 3/8) / (N + 1/4). The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed.

See also: quantile.

A summary view of a data set can be generated quickly with the statistics function.

 
stats = statistics (x)
stats = statistics (x, dim)
stats = statistics (x, vecdim)
stats = statistics (x, "all")
stats = statistics (…, nanflag)

Return a vector with statistics parameters over the input data x.

statistics (x operates along the first non-singleton dimension of x and calculates the following statistical parameters:

  1. minimum
  2. first quartile
  3. median
  4. third quartile
  5. maximum
  6. mean
  7. standard deviation
  8. skewness
  9. kurtosis

If x is a row vector, then statistics (x) returns a row vector with the aforementioned statistical parameters. If x is a column vector, then it returns a column vector.

If x is a matrix, then statistics (x) returns a matrix such that each column contains the statistical parameters calculated over the corresponding column of x.

If x is an array, then statistics (x) computes the statistical parameters along the first non-singleton dimension of x.

The data in x must be numeric and by default any NaN values are ignored from the computations of statistical parameters except for the mean and the standard deviation. Set the optional argument nanflag to "omitnan" to exclude the NaN values from the calculation of the mean and standard deviation parameters. Setting nanflag to "includenan" is ignored and it is equivalent to calling the statistics function without the nanflag argument.

The size of stats is equal to the size of x except for the operating dimension, which equals to 9 (i.e., the number of statistical parameters returned).

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause statistics to operate on all elements of x, and is equivalent to statistics (x(:)).

See also: min, max, median, mean, std, skewness, kurtosis.