tabmat package

tabmat.from_pandas(df, dtype=<class 'numpy.float64'>, sparse_threshold=0.1, cat_threshold=4, object_as_cat=False, cat_position='expand', drop_first=False)

Transform a pandas.DataFrame into an efficient SplitMatrix. For most users, this will be the primary way to construct tabmat objects from their data.

Parameters
  • df (pd.DataFrame) – pandas DataFrame to be converted.

  • dtype (np.dtype, default np.float64) – dtype of all sub-matrices of the resulting SplitMatrix.

  • sparse_threshold (float, default 0.1) – Density threshold below which numerical columns will be stored in a sparse format.

  • cat_threshold (int, default 4) – Number of levels of a categorical column under which the column will be stored as sparse one-hot-encoded columns instead of CategoricalMatrix

  • object_as_cat (bool, default False) – If True, DataFrame columns stored as python objects will be treated as categorical columns.

  • cat_position (str {'end'|'expand'}, default 'expand') – Position of the categorical variable in the index. If “last”, all the categoricals (including the ones that did not satisfy cat_threshold) will be placed at the end of the index list. If “expand”, all the variables will remain in the same order.

  • drop_first (bool, default False) – If true, categoricals variables will have their first category dropped. This allows multiple categorical variables to be included in an unregularized model. If False, all categories are included.

Return type

SplitMatrix

tabmat.from_csc(mat, threshold=0.1)

Convert a CSC-format sparse matrix into a SplitMatrix.

The threshold parameter specifies the density below which a column is treated as sparse.

Parameters

mat (csc_matrix) –

class tabmat.MatrixBase

Bases: ABC

Base class for all matrix classes. MatrixBase cannot be instantiated.

property A: ndarray

Convert self into an np.ndarray. Synonym for toarray().

abstract matvec(other, cols=None, out=None)

Perform: self[:, cols] @ other, so result[i] = sum_j self[i, j] other[j].

The ‘cols’ parameter allows restricting to a subset of the matrix without making a copy. If provided:

result[i] = sum_{j in cols} self[i, j] other[j].

If ‘out’ is provided, we modify ‘out’ in place by adding the output of this operation to it.

Parameters
  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

abstract sandwich(d, rows=None, cols=None)

Perform a sandwich product: (self[rows, cols].T * d[rows]) @ self[rows, cols].

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

Parameters
  • d (ndarray) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

Return type

ndarray

standardize(weights, center_predictors, scale_predictors)

Return a StandardizedMatrix along with the column means and column standard deviations.

It is often useful to modify a dataset so that each column has mean zero and standard deviation one. This function does this “standardization” without modifying the underlying dataset by storing shifting and scaling factors that are then used whenever an operation is performed with the new StandardizedMatrix.

Note: If center_predictors is False, col_means will be zeros.

Note: If scale_predictors is False, col_stds will be None.

Parameters
  • weights (ndarray) –

  • center_predictors (bool) –

  • scale_predictors (bool) –

Return type

Tuple[Any, ndarray, Optional[ndarray]]

abstract toarray()

Convert self into an np.ndarray.

Return type

ndarray

abstract transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec, so result[i] = sum_j self[j, i] vec[j].

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

If ‘rows’ and ‘cols’ are provided:

result[i] = sum_{j in rows} self[j, cols[i]] vec[j].

Note that the length of the output is len(cols).

If out is provided:

out[cols[i]] += sum_{j in rows} self[j, cols[i]] vec[j]
Parameters
  • vec (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

class tabmat.DenseMatrix(input_array)

Bases: ndarray, MatrixBase

A numpy.ndarray subclass with several additional functions that allow it to share the MatrixBase API with SparseMatrix and CategoricalMatrix.

In particular, we have added:

  • The sandwich product

  • getcol to support the same interface as SparseMatrix for retrieving a single column

  • toarray

  • matvec

getcol(i)

Return matrix column at specified index.

matvec(vec, cols=None, out=None)

Perform self[:, cols] @ other.

Parameters
  • vec (Union[ndarray, List]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

multiply(other)

Element-wise multiplication.

This assumes that other is a vector of size self.shape[0].

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters
  • d (ndarray) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

Return type

ndarray

toarray()

Return array representation of matrix.

transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec.

Parameters
  • vec (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

class tabmat.SparseMatrix(arg1, shape=None, dtype=None, copy=False)

Bases: csc_matrix, MatrixBase

A scipy.sparse csc matrix subclass that allows such objects to conform to the MatrixBase interface.

SparseMatrix is instantiated in the same way as scipy.sparse.csc_matrix.

Parameters
  • shape (Tuple[int, int]) –

  • dtype (dtype) –

astype(dtype, order='K', casting='unsafe', copy=True)

Return SparseMatrix cast to new type.

matvec(vec, cols=None, out=None)

Perform self[:, cols] @ other[cols].

Parameters
  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

multiply(other)

Element-wise multiplication.

See scipy.sparse.csc_matrix.multiply. The method is taken almost directly from the parent class except that other is assumed to be a vector of size self.shape[0].

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters
  • d (ndarray) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

Return type

ndarray

sandwich_dense(B, d, rows, L_cols, R_cols)

Perform a sandwich product: self.T @ diag(d) @ B.

Parameters
  • B (ndarray) –

  • d (ndarray) –

  • rows (ndarray) –

  • L_cols (ndarray) –

  • R_cols (ndarray) –

Return type

ndarray

transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec.

Parameters
  • vec (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

property x_csr

Cache the CSR representation of the matrix.

class tabmat.CategoricalMatrix(cat_vec, drop_first=False, dtype=<class 'numpy.float64'>)

Bases: MatrixBase

A faster, more memory efficient sparse matrix adapted to the specific settings of a one-hot encoded categorical variable.

Parameters
  • cat_vec (Union[List, ndarray, Categorical]) – array-like vector of categorical data.

  • drop_first (bool) – drop the first level of the dummy encoding. This allows a CategoricalMatrix to be used in an unregularized setting.

  • dtype (numpy.dtype) – data type

astype(dtype, order='K', casting='unsafe', copy=True)

Return CategoricalMatrix cast to new type.

getcol(i)

Return matrix column at specified index.

Parameters

i (int) –

Return type

csc_matrix

matvec(other, cols=None, out=None)

Multiply self with vector ‘other’, and add vector ‘out’ if it is present.

out[i] += sum_j mat[i, j] other[j] = other[mat.indices[i]]

The cols parameter allows restricting to a subset of the matrix without making a copy.

If out is None, then a new array will be returned.

Test: test_matrices::test_matvec

Parameters
  • other (Union[List, ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

multiply(other)

Element-wise multiplication.

This assumes that other is a vector of size self.shape[0].

Return type

SparseMatrix

recover_orig()

Return 1d numpy array with same data as what was initially fed to __init__.

Test: matrix/test_categorical_matrix::test_recover_orig

Return type

ndarray

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

sandwich(self, d)[i, j] = (self.T @ diag(d) @ self)[i, j]
    = sum_k (self[k, i] (diag(d) @ self)[k, j])
    = sum_k self[k, i] sum_m diag(d)[k, m] self[m, j]
    = sum_k self[k, i] d[k] self[k, j]
    = 0 if i != j
sandwich(self, d)[i, i] = sum_k self[k, i] ** 2 * d(k)

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

Parameters
  • d (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

Return type

dia_matrix

toarray()

Return array representation of matrix.

Return type

ndarray

tocsr()

Return scipy csr representation of matrix.

Return type

csr_matrix

transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec.

for i in cols: out[i] += sum_{j in rows} self[j, i] vec[j]
self[j, i] = 1(indices[j] == i)


for j in rows:
    for i in cols:
        out[i] += (indices[j] == i) * vec[j]
If cols == range(self.shape[1]), then for every row j, there will be exactly

one relevant column, so you can do

for j in rows,
    out[indices[j]] += vec[j]

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

If out is None, then a new array will be returned.

Test: tests/test_matrices::test_transpose_matvec

Parameters
  • vec (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

class tabmat.SplitMatrix(matrices, indices=None)

Bases: MatrixBase

A class for matrices with sparse, dense and categorical parts.

For real-world tabular data, it’s common for the same dataset to contain a mix of columns that are naturally dense, naturally sparse and naturally categorical. Representing each of these sets of columns in the format that is most natural allows for a significant speedup in matrix multiplications compared to representations that are entirely dense or entirely sparse.

Initialize a SplitMatrix directly with a list of matrices and a list of column indices for each matrix. Most of the time, it will be best to use tabmat.from_pandas() or tabmat.from_csc() to initialize a SplitMatrix.

Parameters
  • matrices (List[Union[DenseMatrix, SparseMatrix, CategoricalMatrix]]) – The sub-matrices composing the columns of this SplitMatrix.

  • indices (Optional[List[ndarray]]) – If indices is not None, then for each matrix passed in matrices, indices must contain the set of columns which that matrix covers.

astype(dtype, order='K', casting='unsafe', copy=True)

Return SplitMatrix cast to new type.

getcol(i)

Return matrix column at specified index.

Parameters

i (int) –

Return type

Union[ndarray, csr_matrix]

matvec(v, cols=None, out=None)

Perform self[:, cols] @ other.

Parameters
  • v (ndarray) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

multiply(other)

Element-wise multiplication.

This assumes that other is a vector of size self.shape[0].

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters
  • d (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

Return type

ndarray

toarray()

Return array representation of matrix.

Return type

ndarray

transpose_matvec(v, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec.

self.transpose_matvec(v, rows, cols) = self[rows, cols].T @ v[rows]
self.transpose_matvec(v, rows, cols)[i]
    = sum_{j in rows} self[j, cols[i]] v[j]
    = sum_{j in rows} sum_{mat in self.matrices} 1(cols[i] in mat)
                                                self[j, cols[i]] v[j]
Parameters
  • v (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

class tabmat.StandardizedMatrix(mat, shift, mult=None)

Bases: object

StandardizedMatrix allows for storing a matrix standardized to have columns that have mean zero and standard deviation one without modifying underlying sparse matrices.

To be precise, for a StandardizedMatrix:

self[i, j] = (self.mult[j] * self.mat[i, j]) + self.shift[j]

This class is returned from MatrixBase.standardize.

Parameters
  • mat (MatrixBase) –

  • shift (Union[ndarray, List]) –

  • mult (Union[ndarray, List]) –

property A: ndarray

Return array representation of self.

astype(dtype, order='K', casting='unsafe', copy=True)

Return StandardizedMatrix cast to new type.

getcol(i)

Return matrix column at specified index.

Returns a StandardizedMatrix.

>>> from scipy import sparse as sps
>>> x = StandardizedMatrix(SparseMatrix(sps.eye(3).tocsc()), shift=[0, 1, -2])
>>> col_1 = x.getcol(1)
>>> isinstance(col_1, StandardizedMatrix)
True
>>> col_1.A
array([[1.],
       [2.],
       [1.]])
Parameters

i (int) –

matvec(other_mat, cols=None, out=None)

Perform self[:, cols] @ other.

This function returns a dense output, so it is best geared for the matrix-vector case.

Parameters
  • other_mat (Union[ndarray, List]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

multiply(other)

Element-wise multiplication.

Note that the output of this function is always a DenseMatrix and might require a lot more memory. This assumes that other is a vector of size self.shape[0].

Return type

DenseMatrix

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters
  • d (ndarray) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

Return type

ndarray

toarray()

Return array representation of matrix.

Return type

ndarray

transpose_matvec(other, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec.

Let self.shape = (N, K) and other.shape = (M, N). Let shift_mat = outer(ones(N), shift)

(X.T @ other)[k, i] = (X.mat.T @ other)[k, i] + (shift_mat @ other)[k, i] (shift_mat @ other)[k, i] = (outer(shift, ones(N)) @ other)[k, i] = sum_j outer(shift, ones(N))[k, j] other[j, i] = sum_j shift[k] other[j, i] = shift[k] other.sum(0)[i] = outer(shift, other.sum(0))[k, i]

With row and col restrictions:

self.transpose_matvec(other, rows, cols)[i, j]
= self.mat.transpose_matvec(other, rows, cols)[i, j]
  • (outer(self.shift, ones(N))[rows, cols] @ other[cols])

= self.mat.transpose_matvec(other, rows, cols)[i, j]
  • shift[cols[i]] other.sum(0)[rows[j]

Parameters
  • other (Union[ndarray, List]) –

  • rows (Optional[ndarray]) –

  • cols (Optional[ndarray]) –

  • out (Optional[ndarray]) –

Return type

ndarray

unstandardize()

Get unstandardized (base) matrix.

Return type

MatrixBase