asaplib.compressor package¶
Submodules¶
asaplib.compressor.bck-fps module¶
asaplib.compressor.cur module¶
CUR matrix decomposition is a low-rank matrix decomposition algorithm that is explicitly expressed in a small number of actual columns and/or actual rows of data matrix.
-
asaplib.compressor.cur.
CUR_deterministic
(X, n_col, error_estimate=True, costs=1)[source]¶ Given rank k, find the best column and rows X = C U R
- Parameters
X (np.matrix) – input a covariance matrix
n_col (int) – number of column to keep
error_estimate (bool, optional) – compute the remaining error of the CUR
costs (float, optional) – a list of costs associated with each column
- Returns
indices (np.array) – indices of columns to choose
cur_error (np.array) – the error of the decomposition
-
asaplib.compressor.cur.
CUR_deterministic_step
(cov, k, costs=1)[source]¶ Apply (deterministic) CUR selection of k rows & columns of the given covariance matrix, including an orthogonalization step. Costs can be weighted if desired.
-
asaplib.compressor.cur.
cur_column_select
(array, num, rank=None, deterministic=True, mode='sparse', weights=None, calc_error=False)[source]¶ Select columns from a matrix according to statstical leverage score.
Based on: [1] Mahoney MW, Drineas P. CUR matrix decompositions for improved data analysis. PNAS. 2009 Jan 20;106(3):697–702.
Notes
Which mode to use? If the matrix is not square or possibly not hermitian (or even real symmetric) then sparse is the only option. For hermitian matrices, the calculation of the eigenvectors with eigh may be faster than using the sparse svd method. Benchmark is needed for this, especially for descriptor kernels.
- Parameters
array (np.array) – Array to compute find columns of (M, N)
num (int) – number of column indices to be returned
rank (int, optional) – number of singular vectors to calculate for calculation of the statistical leverage scores default: minimum of (min(N, M) - 1, num/2) but at least one
deterministic (bool, optional) – Usage of deterministic method or probabilistic. Deterministic (True): the top num columns are given Stochastic (False): use leverage scores as probabilities for choosing num
mode ({"sparse", "dense", "hermitian"}) –
- mode of the singular vector calculation.
sparse (default), which is using sparse-SVD, which is expected to be robust and to solve the problem
hermitian offers a speedup for hermitian matrices (ie. real symmetric kernel matrices as well)
dense (not recommended) is using SVD, but calculated all the singular vectors and uses a number of them
according to the rank parameter
weights (np.array, optional) – Costs for taking each column. shape=(N,) The statistical leverage scores are scaled by this array
calc_error (bool, optional) – calculate the error of the decomposition, default False There is a significant cost to the calculation of this, due to a semi-inverse operation needed. The error here is the norm of the matrix of difference between the original and the approximation obtained from the selected columns. Not necessarily meaningful in every situation.
- Returns
indices (np.array) – indices of columns to choose
cur_error
asaplib.compressor.fps module¶
Farthest Point Sampling methods for sparsification
-
asaplib.compressor.fps.
fps
(x, d=0, r=None)[source]¶ Farthest Point Sampling
- Parameters
x (np.matrix) – [n_samples, n_dim] coordinates of all the samples to be sparsified.
d (int) – number of samples to keep
r (int) – starting from sample of index r
- Returns
sample_index – a list of selected samples, remaining error
- Return type
list
asaplib.compressor.pep-fps module¶
asaplib.compressor.reweight module¶
Select samples using a re-weighted distribution
The original distribution (KDE) of the sample is :math: ho = exp(-F) and we select the samples using a well-tempered distribution :math: ho(lambda) = exp(-F/lambda)
asaplib.compressor.sparsifier module¶
sparsifier class
-
class
asaplib.compressor.sparsifier.
Sparsifier
(sparse_mode)[source]¶ Bases:
object
-
sparsify
(desc_or_ntotal, n_or_ratio, sparse_param=0)[source]¶ Function handing the sparsification of data :param desc_or_ntotal: Either a design matrix [n_sample, n_desc],
or simply the total number of samples
- Parameters
n_or_ratio (int or float) – Either the number or the fraction of sparsified points
sparse_param (int) – additional parameter that may be needed for the specific sparsifier used
- Returns
sbs (list)
a list of the indexes for the sparsified points
-
asaplib.compressor.split module¶
Functions for making splits
-
class
asaplib.compressor.split.
KFold
(n_splits=3, shuffle=False, random_state=None)[source]¶ Bases:
sklearn.model_selection._split.KFold
-
class
asaplib.compressor.split.
LCSplit
(cv, n_repeats=[10], train_sizes=[10], test_size=None, random_state=None, **cvargs)[source]¶ Bases:
object
-
get_n_splits
(X=None, y=None, groups=None)[source]¶ Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility.
np.zeros(n_samples)
may be used as a placeholder.- Parameters
y (object) – Always ignored, exists for compatibility.
np.zeros(n_samples)
may be used as a placeholder.groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set.
- Returns
n_splits – Returns the number of splitting iterations in the cross-validator.
- Return type
int
-
split
(X, y=None, groups=None)[source]¶ Generates indices to split data into training and test set. :param X: Training data, where n_samples is the number of samples
and n_features is the number of features.
- Parameters
y (array-like, of length n_samples) – The target variable for supervised learning problems.
groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set.
- Returns
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
-
-
class
asaplib.compressor.split.
ShuffleSplit
(n_splits=10, test_size='default', train_size=None, random_state=None)[source]¶ Bases:
sklearn.model_selection._split.ShuffleSplit
-
asaplib.compressor.split.
exponential_split
(xmin, xmax, n=5)[source]¶ Obtain integers that are equally spaced in log space.
- Parameters
xmin (float) – lower bound in original space
xmax (float) – upper bound in original space
n (int) – integer giving the number of spaces (default is 5)
Returns –
------- –
X (np.array) – list of n evenly spaced points in log space
-
asaplib.compressor.split.
kernel_random_split
(X, y, r=0.05, seed=0)[source]¶ - Parameters
X (array-like, shape=[n_samples,n_desc]) – kernel matrix
y (array-like, shape=[n_samples]) – labels
r (float) – test ratio
- Returns
X_train, X_test (np.matrix) – train/test kernel matrix
y_train, y_test (np.array) – train/test labels
train_list, test_list (list) – train/test indexes