asaplib.compressor package

Submodules

asaplib.compressor.bck-fps module

asaplib.compressor.cur module

CUR matrix decomposition is a low-rank matrix decomposition algorithm that is explicitly expressed in a small number of actual columns and/or actual rows of data matrix.

asaplib.compressor.cur.CUR_deterministic(X, n_col, error_estimate=True, costs=1)[source]

Given rank k, find the best column and rows X = C U R

Parameters
  • X (np.matrix) – input a covariance matrix

  • n_col (int) – number of column to keep

  • error_estimate (bool, optional) – compute the remaining error of the CUR

  • costs (float, optional) – a list of costs associated with each column

Returns

  • indices (np.array) – indices of columns to choose

  • cur_error (np.array) – the error of the decomposition

asaplib.compressor.cur.CUR_deterministic_step(cov, k, costs=1)[source]

Apply (deterministic) CUR selection of k rows & columns of the given covariance matrix, including an orthogonalization step. Costs can be weighted if desired.

asaplib.compressor.cur.cur_column_select(array, num, rank=None, deterministic=True, mode='sparse', weights=None, calc_error=False)[source]

Select columns from a matrix according to statstical leverage score.

Based on: [1] Mahoney MW, Drineas P. CUR matrix decompositions for improved data analysis. PNAS. 2009 Jan 20;106(3):697–702.

Notes

Which mode to use? If the matrix is not square or possibly not hermitian (or even real symmetric) then sparse is the only option. For hermitian matrices, the calculation of the eigenvectors with eigh may be faster than using the sparse svd method. Benchmark is needed for this, especially for descriptor kernels.

Parameters
  • array (np.array) – Array to compute find columns of (M, N)

  • num (int) – number of column indices to be returned

  • rank (int, optional) – number of singular vectors to calculate for calculation of the statistical leverage scores default: minimum of (min(N, M) - 1, num/2) but at least one

  • deterministic (bool, optional) – Usage of deterministic method or probabilistic. Deterministic (True): the top num columns are given Stochastic (False): use leverage scores as probabilities for choosing num

  • mode ({"sparse", "dense", "hermitian"}) –

    mode of the singular vector calculation.
    • sparse (default), which is using sparse-SVD, which is expected to be robust and to solve the problem

    • hermitian offers a speedup for hermitian matrices (ie. real symmetric kernel matrices as well)

    • dense (not recommended) is using SVD, but calculated all the singular vectors and uses a number of them

    according to the rank parameter

  • weights (np.array, optional) – Costs for taking each column. shape=(N,) The statistical leverage scores are scaled by this array

  • calc_error (bool, optional) – calculate the error of the decomposition, default False There is a significant cost to the calculation of this, due to a semi-inverse operation needed. The error here is the norm of the matrix of difference between the original and the approximation obtained from the selected columns. Not necessarily meaningful in every situation.

Returns

  • indices (np.array) – indices of columns to choose

  • cur_error

asaplib.compressor.fps module

Farthest Point Sampling methods for sparsification

asaplib.compressor.fps.fast_fps(x, d=0, r=None)[source]
asaplib.compressor.fps.fps(x, d=0, r=None)[source]

Farthest Point Sampling

Parameters
  • x (np.matrix) – [n_samples, n_dim] coordinates of all the samples to be sparsified.

  • d (int) – number of samples to keep

  • r (int) – starting from sample of index r

Returns

sample_index – a list of selected samples, remaining error

Return type

list

asaplib.compressor.pep-fps module

asaplib.compressor.reweight module

Select samples using a re-weighted distribution

The original distribution (KDE) of the sample is :math: ho = exp(-F) and we select the samples using a well-tempered distribution :math: ho(lambda) = exp(-F/lambda)

asaplib.compressor.reweight.reweight(logkde, n_sparse, reweight_lambda)[source]
logkde: list, type=float

The (log of) kernel density for each sample

reweight_lambda: float

reweighting factor

n_sparse: int

number of samples to select

sbs: list, type=int

A list of selected samples

asaplib.compressor.sparsifier module

sparsifier class

class asaplib.compressor.sparsifier.Sparsifier(sparse_mode)[source]

Bases: object

sparsify(desc_or_ntotal, n_or_ratio, sparse_param=0)[source]

Function handing the sparsification of data :param desc_or_ntotal: Either a design matrix [n_sample, n_desc],

or simply the total number of samples

Parameters
  • n_or_ratio (int or float) – Either the number or the fraction of sparsified points

  • sparse_param (int) – additional parameter that may be needed for the specific sparsifier used

Returns

  • sbs (list)

  • a list of the indexes for the sparsified points

asaplib.compressor.split module

Functions for making splits

class asaplib.compressor.split.KFold(n_splits=3, shuffle=False, random_state=None)[source]

Bases: sklearn.model_selection._split.KFold

get_params()[source]
class asaplib.compressor.split.LCSplit(cv, n_repeats=[10], train_sizes=[10], test_size=None, random_state=None, **cvargs)[source]

Bases: object

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility.

np.zeros(n_samples) may be used as a placeholder.

Parameters
  • y (object) – Always ignored, exists for compatibility. np.zeros(n_samples) may be used as a placeholder.

  • groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set.

Returns

n_splits – Returns the number of splitting iterations in the cross-validator.

Return type

int

get_params()[source]
split(X, y=None, groups=None)[source]

Generates indices to split data into training and test set. :param X: Training data, where n_samples is the number of samples

and n_features is the number of features.

Parameters
  • y (array-like, of length n_samples) – The target variable for supervised learning problems.

  • groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set.

Returns

  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

class asaplib.compressor.split.ShuffleSplit(n_splits=10, test_size='default', train_size=None, random_state=None)[source]

Bases: sklearn.model_selection._split.ShuffleSplit

get_params()[source]
asaplib.compressor.split.exponential_split(xmin, xmax, n=5)[source]

Obtain integers that are equally spaced in log space.

Parameters
  • xmin (float) – lower bound in original space

  • xmax (float) – upper bound in original space

  • n (int) – integer giving the number of spaces (default is 5)

  • Returns

  • -------

  • X (np.array) – list of n evenly spaced points in log space

asaplib.compressor.split.kernel_random_split(X, y, r=0.05, seed=0)[source]
Parameters
  • X (array-like, shape=[n_samples,n_desc]) – kernel matrix

  • y (array-like, shape=[n_samples]) – labels

  • r (float) – test ratio

Returns

  • X_train, X_test (np.matrix) – train/test kernel matrix

  • y_train, y_test (np.array) – train/test labels

  • train_list, test_list (list) – train/test indexes

asaplib.compressor.split.random_split(n_sample, r, seed=0)[source]

Obtain train/test indexes with a test ratio

Parameters
  • n_sample (int) – giving the number of samples

  • r (float) – test ratio

  • Returns

  • -------

  • train_list (list) – train indexes

  • test_list (list) – test indexes

Module contents