tabmat package

tabmat.from_pandas(df, dtype=<class 'numpy.float64'>, sparse_threshold=0.1, cat_threshold=4, object_as_cat=False, cat_position='expand', drop_first=False, categorical_format='{name}[{category}]', cat_missing_method='fail', cat_missing_name='(MISSING)')

Transform a pandas.DataFrame into an efficient SplitMatrix. For most users, this will be the primary way to construct tabmat objects from their data.

Parameters:

df (pd.DataFrame) – pandas DataFrame to be converted.
dtype (np.dtype, default np.float64) – dtype of all sub-matrices of the resulting SplitMatrix.
sparse_threshold (float, default 0.1) – Density threshold below which numerical columns will be stored in a sparse format.
cat_threshold (int, default 4) – Number of levels of a categorical column under which the column will be stored as sparse one-hot-encoded columns instead of CategoricalMatrix
object_as_cat (bool, default False) – If True, DataFrame columns stored as python objects will be treated as categorical columns.
cat_position (str {'end'|'expand'}, default 'expand') – Position of the categorical variable in the index. If “last”, all the categoricals (including the ones that did not satisfy cat_threshold) will be placed at the end of the index list. If “expand”, all the variables will remain in the same order.
drop_first (bool, default False) – If true, categoricals variables will have their first category dropped. This allows multiple categorical variables to be included in an unregularized model. If False, all categories are included.
cat_missing_method (str {'fail'|'zero'|'convert'}, default 'fail') – How to handle missing values in categorical columns: - if ‘fail’, raise an error if there are missing values. - if ‘zero’, missing values will represent all-zero indicator columns. - if ‘convert’, missing values will be converted to the ‘(MISSING)’ category.
cat_missing_name (str, default '(MISSING)') – Name of the category to which missing values will be converted if cat_missing_method='convert'.
categorical_format (str)

Return type:

SplitMatrix

tabmat.from_csc(mat, threshold=0.1, column_names=None, term_names=None)

Convert a CSC-format sparse matrix into a SplitMatrix.

The threshold parameter specifies the density below which a column is treated as sparse.

Parameters:: mat (csc_matrix)

class tabmat.MatrixBase

Bases: ABC

Base class for all matrix classes. MatrixBase cannot be instantiated.

property A: ndarray: Convert self into an np.ndarray. Synonym for toarray().

property column_names: Column names of the matrix.

abstract get_names(type='column', missing_prefix=None, indices=None)

Get column names.

For columns that do not have a name, a default name is created using the following pattern: "{missing_prefix}{start_index + i}" where i is the index of the column.

Parameters:

type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).
missing_prefix (Optional[str], default None) – Prefix to use for columns that do not have a name. If None, then no default name is created.
indices (list[int] | None) – The indices used for columns that do not have a name. If None, then the indices are list(range(self.shape[1])).

Returns:

Column names.

Return type:

list[Optional[str]]

abstract matvec(other, cols=None, out=None)

Perform: self[:, cols] @ other[cols], so result[i] = sum_j self[i, j] other[j].

The ‘cols’ parameter allows restricting to a subset of the matrix without making a copy. If provided:

result[i] = sum_{j in cols} self[i, j] other[j].

If ‘out’ is provided, we modify ‘out’ in place by adding the output of this operation to it.

Parameters:

cols (ndarray)
out (ndarray)

abstract sandwich(d, rows=None, cols=None)

Perform a sandwich product: (self[rows, cols].T * d[rows]) @ self[rows, cols].

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

Parameters:

d (ndarray)
rows (ndarray)
cols (ndarray)

Return type:

ndarray

set_names(names, type='column')

Set column names.

Parameters:

names (list[Optional[str]]) – Names to set.
type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).

standardize(weights, center_predictors, scale_predictors)

Return a StandardizedMatrix along with the column means and column standard deviations.

It is often useful to modify a dataset so that each column has mean zero and standard deviation one. This function does this “standardization” without modifying the underlying dataset by storing shifting and scaling factors that are then used whenever an operation is performed with the new StandardizedMatrix.

Note: If center_predictors is False, col_means will be zeros.

Note: If scale_predictors is False, col_stds will be None.

Parameters:

weights (ndarray)
center_predictors (bool)
scale_predictors (bool)

Return type:

tuple[Any, ndarray, ndarray | None]

property term_names

Term names of the matrix.

For differences between column names and term names, see get_names.

abstract toarray()

Convert self into an np.ndarray.

Return type:: ndarray

abstract transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec[rows], so result[i] = sum_j self[j, i] vec[j].

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

If ‘rows’ and ‘cols’ are provided:

result[i] = sum_{j in rows} self[j, cols[i]] vec[j].

Note that the length of the output is len(cols).

If out is provided:

out[cols[i]] += sum_{j in rows} self[j, cols[i]] vec[j]

Parameters:

vec (ndarray | list)
rows (ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

class tabmat.DenseMatrix(input_array, column_names=None, term_names=None)

Bases: MatrixBase

A numpy.ndarray subclass with several additional functions that allow it to share the MatrixBase API with SparseMatrix and CategoricalMatrix.

In particular, we have added:

The sandwich product
getcol to support the same interface as SparseMatrix for retrieving a single column
toarray
matvec

property T: Returns a view of the array with axes transposed.

astype(dtype, order='K', casting='unsafe', copy=True): Copy of the array, cast to a specified type.

property dtype: Data-type of the array’s elements.

get_names(type='column', missing_prefix=None, indices=None)

Get column names.

For columns that do not have a name, a default name is created using the following pattern: "{missing_prefix}{start_index + i}" where i is the index of the column.

Parameters:

type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).
missing_prefix (Optional[str], default None) – Prefix to use for columns that do not have a name. If None, then no default name is created.
indices (list[int] | None) – The indices used for columns that do not have a name. If None, then the indices are list(range(self.shape[1])).

Returns:

Column names.

Return type:

list[Optional[str]]

getcol(i): Return matrix column at specified index.

matvec(vec, cols=None, out=None)

Perform self[:, cols] @ other[cols].

Parameters:

vec (ndarray | list)
cols (ndarray)
out (ndarray)

Return type:

ndarray

multiply(other)

Element-wise multiplication.

This assumes that other is a vector of size self.shape[0].

property ndim: Number of array dimensions.

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters:

d (ndarray)
rows (ndarray)
cols (ndarray)

Return type:

ndarray

set_names(names, type='column')

Set column names.

Parameters:

names (list[Optional[str]]) – Names to set.
type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).

property shape: Tuple of array dimensions.

toarray(): Return array representation of matrix.

transpose(): Returns a view of the array with axes transposed.

transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec[rows].

Parameters:

vec (ndarray | list)
rows (ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

unpack(): Return the underlying numpy.ndarray.

class tabmat.SparseMatrix(input_array, shape=None, dtype=None, copy=False, column_names=None, term_names=None)

Bases: MatrixBase

A scipy.sparse csc matrix subclass that allows such objects to conform to the MatrixBase interface.

SparseMatrix is instantiated in the same way as scipy.sparse.csc_matrix.

Parameters:

shape (tuple[int, int])
dtype (dtype)

property T: Returns a view of the array with axes transposed.

property array_csc: Return the CSC representation of the matrix.

property array_csr: Cache the CSR representation of the matrix.

astype(dtype, order='K', casting='unsafe', copy=True): Return SparseMatrix cast to new type.

property data: Data of the matrix.

dot(other): Return the dot product as a scipy sparse matrix.

property dtype: Data-type of the array’s elements.

get_names(type='column', missing_prefix=None, indices=None)

Get column names.

For columns that do not have a name, a default name is created using the following pattern: "{missing_prefix}{start_index + i}" where i is the index of the column.

Parameters:

type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).
missing_prefix (Optional[str], default None) – Prefix to use for columns that do not have a name. If None, then no default name is created.
indices (list[int] | None) – The indices used for columns that do not have a name. If None, then the indices are list(range(self.shape[1])).

Returns:

Column names.

Return type:

list[Optional[str]]

getcol(i): Return matrix column at specified index.

property indices: Indices of the matrix.

property indptr: Indptr of the matrix.

matvec(vec, cols=None, out=None)

Perform self[:, cols] @ other[cols].

Parameters:

cols (ndarray)
out (ndarray)

multiply(other)

Element-wise multiplication.

See scipy.sparse.csc_matrix.multiply. The method is taken almost directly from the parent class except that other is assumed to be a vector of size self.shape[0].

property ndim: Number of array dimensions.

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters:

d (ndarray)
rows (ndarray)
cols (ndarray)

Return type:

ndarray

sandwich_dense(B, d, rows, L_cols, R_cols)

Perform a sandwich product: self.T @ diag(d) @ B.

Parameters:

B (ndarray)
d (ndarray)
rows (ndarray)
L_cols (ndarray)
R_cols (ndarray)

Return type:

ndarray

set_names(names, type='column')

Set column names.

Parameters:

names (list[Optional[str]]) – Names to set.
type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).

property shape: Tuple of array dimensions.

toarray(): Return a dense ndarray representation of the matrix.

tocsc(copy=False): Return the matrix in CSC format.

transpose(): Returns a view of the array with axes transposed.

transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec[rows].

Parameters:

vec (ndarray | list)
rows (ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

unpack(): Return the underlying scipy.sparse.csc_matrix.

class tabmat.CategoricalMatrix(cat_vec, drop_first=False, dtype=<class 'numpy.float64'>, column_name=None, term_name=None, column_name_format='{name}[{category}]', cat_missing_method='fail', cat_missing_name='(MISSING)')

Bases: MatrixBase

A faster, more memory efficient sparse matrix adapted to the specific settings of a one-hot encoded categorical variable.

Parameters:

cat_vec (list | ndarray | Categorical) – array-like vector of categorical data.
drop_first (bool) – drop the first level of the dummy encoding. This allows a CategoricalMatrix to be used in an unregularized setting.
cat_missing_method (str {'fail'|'zero'|'convert'}, default 'fail') –
- if ‘fail’, raise an error if there are missing values.
- if ‘zero’, missing values will represent all-zero indicator columns.
- if ‘convert’, missing values will be converted to the cat_missing_name category.
cat_missing_name (str, default '(MISSING)') – Name of the category to which missing values will be converted if cat_missing_method='convert'. If this category already exists, an error will be raised.
dtype (numpy.dtype) – data type
column_name (str | None)
term_name (str | None)
column_name_format (str)

astype(dtype, order='K', casting='unsafe', copy=True): Return CategoricalMatrix cast to new type.

get_names(type='column', missing_prefix=None, indices=None)

Get column names.

For columns that do not have a name, a default name is created using the following pattern: "{missing_prefix}{start_index + i}" where i is the index of the column.

Parameters:

type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).
missing_prefix (Optional[str], default None) – Prefix to use for columns that do not have a name. If None, then no default name is created.
indices (list[int] | None) – The indices used for columns that do not have a name. If None, then the indices are list(range(self.shape[1])).

Returns:

Column names.

Return type:

list[Optional[str]]

getcol(i)

Return matrix column at specified index.

Parameters:: i (int)
Return type:: SparseMatrix

matvec(other, cols=None, out=None)

Multiply self with vector ‘other’, and add vector ‘out’ if it is present.

out[i] += sum_j mat[i, j] other[j] = other[mat.indices[i]]

The cols parameter allows restricting to a subset of the matrix without making a copy.

If out is None, then a new array will be returned.

Test: test_matrices::test_matvec

Parameters:

other (list | ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

multiply(other)

Element-wise multiplication.

This assumes that other is a vector of size self.shape[0].

Return type:: SparseMatrix

recover_orig()

Return 1d numpy array with same data as what was initially fed to __init__.

Test: matrix/test_categorical_matrix::test_recover_orig

Return type:: ndarray

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

sandwich(self, d)[i, j] = (self.T @ diag(d) @ self)[i, j]
    = sum_k (self[k, i] (diag(d) @ self)[k, j])
    = sum_k self[k, i] sum_m diag(d)[k, m] self[m, j]
    = sum_k self[k, i] d[k] self[k, j]
    = 0 if i != j
sandwich(self, d)[i, i] = sum_k self[k, i] ** 2 * d(k)

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

Parameters:

d (ndarray | list)
rows (ndarray)
cols (ndarray)

Return type:

dia_matrix

set_names(names, type='column')

Set column names.

Parameters:

names (list[Optional[str]]) – Names to set.
type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).

to_sparse_matrix(): Return a tabmat.SparseMatrix representation.

toarray()

Return array representation of matrix.

Return type:: ndarray

tocsr()

Return scipy csr representation of matrix.

Return type:: csr_matrix

transpose_matvec(vec, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec[rows].

for i in cols: out[i] += sum_{j in rows} self[j, i] vec[j]
self[j, i] = 1(indices[j] == i)


for j in rows:
    for i in cols:
        out[i] += (indices[j] == i) * vec[j]

If cols == range(self.shape[1]), then for every row j, there will be exactly: one relevant column, so you can do

for j in rows,
    out[indices[j]] += vec[j]

The rows and cols parameters allow restricting to a subset of the matrix without making a copy.

If out is None, then a new array will be returned.

Test: tests/test_matrices::test_transpose_matvec

Parameters:

vec (ndarray | list)
rows (ndarray | None)
cols (ndarray | None)
out (ndarray | None)

Return type:

ndarray

unpack(): Return the underlying pandas.Categorical.

class tabmat.SplitMatrix(matrices, indices=None)

Bases: MatrixBase

A class for matrices with sparse, dense and categorical parts.

For real-world tabular data, it’s common for the same dataset to contain a mix of columns that are naturally dense, naturally sparse and naturally categorical. Representing each of these sets of columns in the format that is most natural allows for a significant speedup in matrix multiplications compared to representations that are entirely dense or entirely sparse.

Initialize a SplitMatrix directly with a list of matrices and a list of column indices for each matrix. Most of the time, it will be best to use tabmat.from_pandas() or tabmat.from_csc() to initialize a SplitMatrix.

Parameters:

matrices (Sequence[MatrixBase]) – The sub-matrices composing the columns of this SplitMatrix.
indices (list[ndarray] | None) – If indices is not None, then for each matrix passed in matrices, indices must contain the set of columns which that matrix covers.

astype(dtype, order='K', casting='unsafe', copy=True): Return SplitMatrix cast to new type.

get_names(type='column', missing_prefix=None, indices=None)

Get column names.

For columns that do not have a name, a default name is created using the following pattern: "{missing_prefix}{start_index + i}" where i is the index of the column.

Parameters:

type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).
missing_prefix (Optional[str], default None) – Prefix to use for columns that do not have a name. If None, then no default name is created.
indices (list[int] | None) – The indices used for columns that do not have a name. If None, then the indices are list(range(self.shape[1])).

Returns:

Column names.

Return type:

list[Optional[str]]

getcol(i)

Return matrix column at specified index.

Parameters:: i (int)
Return type:: ndarray | csr_matrix

matvec(v, cols=None, out=None)

Perform self[:, cols] @ other[cols].

Parameters:

v (ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

multiply(other)

Element-wise multiplication.

This assumes that other is a vector of size self.shape[0].

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters:

d (ndarray | list)
rows (ndarray)
cols (ndarray)

Return type:

ndarray

set_names(names, type='column')

Set column names.

Parameters:

names (list[Optional[str]]) – Names to set.
type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).

toarray()

Return array representation of matrix.

Return type:: ndarray

transpose_matvec(v, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec[rows].

self.transpose_matvec(v, rows, cols) = self[rows, cols].T @ v[rows]
self.transpose_matvec(v, rows, cols)[i]
    = sum_{j in rows} self[j, cols[i]] v[j]
    = sum_{j in rows} sum_{mat in self.matrices} 1(cols[i] in mat)
                                                self[j, cols[i]] v[j]

Parameters:

v (ndarray | list)
rows (ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

class tabmat.StandardizedMatrix(mat, shift, mult=None)

Bases: object

StandardizedMatrix allows for storing a matrix standardized to have columns that have mean zero and standard deviation one without modifying underlying sparse matrices.

To be precise, for a StandardizedMatrix:

self[i, j] = (self.mult[j] * self.mat[i, j]) + self.shift[j]

This class is returned from MatrixBase.standardize.

Parameters:

mat (MatrixBase)
shift (ndarray | list)
mult (ndarray | list)

property A: ndarray: Return array representation of self.

astype(dtype, order='K', casting='unsafe', copy=True): Return StandardizedMatrix cast to new type.

property column_names: Column names of the matrix.

get_names(type='column', missing_prefix=None, indices=None)

Get column names.

For columns that do not have a name, a default name is created using the following pattern: "{missing_prefix}{start_index + i}" where i is the index of the column.

Parameters:

type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).
missing_prefix (Optional[str], default None) – Prefix to use for columns that do not have a name. If None, then no default name is created.
indices (list[int] | None) – The indices used for columns that do not have a name. If None, then the indices are list(range(self.shape[1])).

Returns:

Column names.

Return type:

list[Optional[str]]

getcol(i)

Return matrix column at specified index.

Returns a StandardizedMatrix.

>>> from scipy import sparse as sps
>>> x = StandardizedMatrix(SparseMatrix(sps.eye(3).tocsc()), shift=[0, 1, -2])
>>> col_1 = x.getcol(1)
>>> isinstance(col_1, StandardizedMatrix)
True
>>> col_1.A
array([[1.],
       [2.],
       [1.]])

Parameters:: i (int)

matvec(other_mat, cols=None, out=None)

Perform self[:, cols] @ other[cols].

This function returns a dense output, so it is best geared for the matrix-vector case.

Parameters:

other_mat (ndarray | list)
cols (ndarray)
out (ndarray)

Return type:

ndarray

multiply(other)

Element-wise multiplication.

Note that the output of this function is always a DenseMatrix and might require a lot more memory. This assumes that other is a vector of size self.shape[0].

Return type:: DenseMatrix

sandwich(d, rows=None, cols=None)

Perform a sandwich product: X.T @ diag(d) @ X.

Parameters:

d (ndarray)
rows (ndarray)
cols (ndarray)

Return type:

ndarray

set_names(names, type='column')

Set column names.

Parameters:

names (list[Optional[str]]) – Names to set.
type (str {'column'|'term'}) – Whether to get column names or term names. The main difference is that a categorical submatrix counts as one term, but can count as multiple columns. Furthermore, matrices created from formulas distinguish between columns and terms (c.f. formulaic docs).

property term_names

Term names of the matrix.

For differences between column names and term names, see get_names.

toarray()

Return array representation of matrix.

Return type:: ndarray

transpose_matvec(other, rows=None, cols=None, out=None)

Perform: self[rows, cols].T @ vec[rows].

Let self.shape = (N, K) and other.shape = (M, N). Let shift_mat = outer(ones(N), shift)

(X.T @ other)[k, i] = (X.mat.T @ other)[k, i] + (shift_mat @ other)[k, i] (shift_mat @ other)[k, i] = (outer(shift, ones(N)) @ other)[k, i] = sum_j outer(shift, ones(N))[k, j] other[j, i] = sum_j shift[k] other[j, i] = shift[k] other.sum(0)[i] = outer(shift, other.sum(0))[k, i]

With row and col restrictions:

self.transpose_matvec(other, rows, cols)[i, j]

= self.mat.transpose_matvec(other, rows, cols)[i, j]

(outer(self.shift, ones(N))[rows, cols] @ other[cols])

= self.mat.transpose_matvec(other, rows, cols)[i, j]

shift[cols[i]] other.sum(0)[rows[j]

Parameters:

other (ndarray | list)
rows (ndarray)
cols (ndarray)
out (ndarray)

Return type:

ndarray

unstandardize()

Get unstandardized (base) matrix.

Return type:: MatrixBase