Image

Sparse matrix recommender package

Introduction

This post proclaims and briefly describes the Python package, SparseMatrixRecommender, which has different functions for computations of recommendations based on (user) profile or history using Sparse Linear Algebra (SLA). The package mirrors the Mathematica implementation [AAp1]. (There is also a corresponding implementation in R; see [AAp2]).

The package is based on a certain “standard” Information retrieval paradigm — it utilizes Latent Semantic Indexing (LSI) functions like IDF, TF-IDF, etc. Hence, the package also has document-term matrix creation functions and LSI application functions. I included them in the package since I wanted to minimize the external package dependencies.

The package includes two data-sets dfTitanic and dfMushroom in order to make easier the writing of introductory examples and unit tests.

For more theoretical description see the article “Mapping Sparse Matrix Recommender to Streams Blending Recommender” , [AA1].

For detailed examples see the files “SMR-experiments-large-data.py” and “SMR-creation-from-long-form.py”.

The list of features and its implementation status is given in the org-mode file “SparseMatrixRecommender-work-plan.org”.

Remark: “SMR” stands for “Sparse Matrix Recommender”. Most of the operations of this Python package mirror the operations of the software monads “SMRMon-WL”, “SMRMon-R”, [AAp1, AAp2].


Workflows

Here is a diagram that encompasses the workflows this package supports (or will support):

SMRworkflows

Here is narration of a certain workflow scenario:

  1. Get a dataset.
  2. Create contingency matrices for a given identifier column and a set of “tag type” columns.
  3. Examine recommender matrix statistics.
  4. If the assumptoins about the data hold apply LSI functions.
    • For example, the “usual trio” IDF, Frequency, Cosine.
  5. Do (verify) example profile recommendations.
  6. If satisfactory results are obtained use the recommender as a nearest neighbors classifier.

Monadic design

Here is a diagram of typical pipeline building using a SparseMatrixRecommender object:

SMRMonpipelinePython

Remark: The monadic design allows “pipelining” of the SMR operations — see the usage example section.


Installation

To install from GitHub use the shell command:

python -m pip install git+https://github.com/antononcube/Python-packages.git#egg=SparseMatrixRecommender\&subdirectory=SparseMatrixRecommender

To install from PyPI:

python -m pip install SparseMatrixRecommender

Related Python packages

This package is based on the Python package SSparseMatrix, [AAp5].

The package LatentSemanticAnalyzer, [AAp6], uses the cross tabulation and LSI functions of this package.


Usage example

Here is an example of an SMR pipeline for creation of a recommender over Titanic data and recommendations for the profile “passengerSex:male” and “passengerClass:1st”:

from SparseMatrixRecommender.SparseMatrixRecommender import *
from SparseMatrixRecommender.DataLoaders import *

dfTitanic = load_titanic_data_frame()

smrObj = (SparseMatrixRecommender()
          .create_from_wide_form(data = dfTitanic, 
                                 item_column_name="id", 
                                 columns=None, 
                                 add_tag_types_to_column_names=True, 
                                 tag_value_separator=":")
          .apply_term_weight_functions(global_weight_func = "IDF", 
                                       local_weight_func = "None", 
                                       normalizer_func = "Cosine")
          .recommend_by_profile(profile=["passengerSex:male", "passengerClass:1st"], 
                                nrecs=12)
          .join_across(data=dfTitanic, on="id")
          .echo_value())

Remark: More examples can be found the directory “./examples”.


Related Mathematica packages

The software monad Mathematica package “MonadicSparseMatrixRecommender.m” [AAp1], provides recommendation pipelines similar to the pipelines created with this package.

Here is a Mathematica monadic pipeline that corresponds to the Python pipeline above:

smrObj =
  SMRMonUnit[]⟹
   SMRMonCreate[dfTitanic, "id", 
                "AddTagTypesToColumnNames" -> True, 
                "TagValueSeparator" -> ":"]⟹
   SMRMonApplyTermWeightFunctions["IDF", "None", "Cosine"]⟹
   SMRMonRecommendByProfile[{"passengerSex:male", "passengerClass:1st"}, 12]⟹
   SMRMonJoinAcross[dfTitanic, "id"]⟹
   SMRMonEchoValue[];   

(Compare the pipeline diagram above with the corresponding diagram using Mathematica notation .)


Related R packages

The package SMRMon-R, [AAp2], implements a software monad for SMR workflows. Most of SMRMon-R functions delegate to SparseMatrixRecommender.

The package SparseMatrixRecommenderInterfaces, [AAp3], provides functions for interactive Shiny interfaces for the recommenders made with SparseMatrixRecommender and/or SMRMon-R.

The package LSAMon-R, [AAp4], can be used to make matrices for SparseMatrixRecommender and/or SMRMon-R.

Here is the SMRMon-R pipeline that corresponds to the Python pipeline above:

smrObj <-
  SMRMonCreate( data = dfTitanic, 
                itemColumnName = "id", 
                addTagTypesToColumnNamesQ = TRUE, 
                sep = ":") %>%
  SMRMonApplyTermWeightFunctions(globalWeightFunction = "IDF", 
                                 localWeightFunction = "None", 
                                 normalizerFunction = "Cosine") %>%
  SMRMonRecommendByProfile( profile = c("passengerSex:male", "passengerClass:1st"), 
                            nrecs = 12) %>%
  SMRMonJoinAcross( data = dfTitanic, by = "id") %>%
  SMRMonEchoValue

Recommender comparison project

The project repository “Scalable Recommender Framework”, [AAr1], has documents, diagrams, tests, and benchmarks of a recommender system implemented in multiple programming languages.

This Python recommender package is a decisive winner in the comparison — see the first 10 min of the video recording [AAv1] or the benchmarks at [AAr1].


Code generation with natural language commands

Using grammar-based interpreters

The project “Raku for Prediction”, [AAr2, AAv2, AAp6], has a Domain Specific Language (DSL) grammar and interpreters that allow the generation of SMR code for corresponding Mathematica, Python, R, and Raku packages.

Here is Command Line Interface (CLI) invocation example that generate code for this package:

> ToRecommenderWorkflowCode Python 'create with dfTitanic; apply the LSI functions IDF, None, Cosine;recommend by profile 1st and male' 

obj = SparseMatrixRecommender().create_from_wide_form(data = dfTitanic).apply_term_weight_functions(global_weight_func = "IDF", local_weight_func = "None", normalizer_func = "Cosine").recommend_by_profile( profile = ["1st", "male"])

NLP Template Engine

Here is an example using the NLP Template Engine, [AAr2, AAv3]:

Concretize["create with dfTitanic; apply the LSI functions IDF, None, Cosine;recommend by profile 1st and male", 
 "TargetLanguage" -> "Python"]

(*
"smrObj = (SparseMatrixRecommender()
 .create_from_wide_form(data = None, item_column_name=\"id\", columns=None, add_tag_types_to_column_names=True, tag_value_separator=\":\")
 .apply_term_weight_functions(\"IDF\", \"None\", \"Cosine\")
 .recommend_by_profile(profile=[\"1st\", \"male\"], nrecs=profile)
 .join_across(data=None, on=\"id\")
 .echo_value())"
*)

References

Articles

[AA1] Anton Antonov, “Mapping Sparse Matrix Recommender to Streams Blending Recommender” (2017), MathematicaForPrediction at GitHub.

Mathematica/WL and R packages

[AAp1] Anton Antonov, Monadic Sparse Matrix Recommender Mathematica package, (2018), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Sparse Matrix Recommender Monad in R (2019), R-packages at GitHub/antononcube.

[AAp3] Anton Antonov, Sparse Matrix Recommender framework interface functions (2019), R-packages at GitHub/antononcube.

[AAp4] Anton Antonov, Latent Semantic Analysis Monad in R (2019), R-packages at GitHub/antononcube.

Python packages

[AAp5] Anton Antonov, SSparseMatrix package in Python (2021), Python-packages at GitHub/antononcube.

[AAp6] Anton Antonov, LatentSemanticAnalyzer package in Python (2021), Python-packages at GitHub/antononcube.

Raku packages

[AAp6] Anton Antonov, DSL::English::RecommenderWorkflows Raku package, (2018-2022), GitHub/antononcube. (At raku.land).

Repositories

[AAr1] Anton Antonov, Scalable Recommender Framework project, (2022) GitHub/antononcube.

[AAr2] Anton Antonov, “Raku for Prediction” book project, (2021-2022), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “TRC 2022 Implementation of ML algorithms in Raku”, (2022), Anton A. Antonov’s channel at YouTube.

[AAv2] Anton Antonov, “Raku for Prediction”, (2021), The Raku Conference (TRC) at YouTube.

[AAv3] Anton Antonov, “NLP Template Engine, Part 1”, (2021), Anton A. Antonov’s channel at YouTube.

Image

Sparse matrices with named rows and columns

Introduction

This blog post introduces and describes the Python package “SSparseMatrix” that provides the class SSparseMatrix, the objects of which are sparse matrices with named rows and columns.

We can say the package attempts to cover as many as possible of the functionalities for sparse matrix objects that are provided by R’s library Matrix. (R is a implementation of S. S introduced named data structures for statistical computations, [RB1], hence the name SSparseMatrix.)

The package builds on top of the scipy sparse matrices. (The added functionalities though are general — other sparse matrix implementations could be used.)

Here is a list of functionalities provided for SSparseMatrix:

  • Sub-matrix extraction by row and column names:
    • Single element access
    • Subsets of row names and column names
  • Slices (with integers)
  • Row and column names propagation for dot products with:
    • Lists
    • Dense vectors (numpy.array)
    • scipy sparse matrices
    • SSparseMatrix objects
  • Row and column sums
    • Vector form
    • Dictionary form
  • Transposing
  • Representation:
    • Tabular, matrix form (“pretty printing”)
    • String and repr forms
  • Row and column binding of SSparseMatrix objects
  • “Export” functions
    • Triplets
    • Row-dictionaries
    • Column-dictionaries
    • Wolfram Language full form representation

The full list of features and development status can be found in the org-mode file SSparseMatrix-work-plan.org.

This package more or less follows the design of the Mathematica package SSparseMatrix.m.

The usage examples below can be also run through the file “examples.py”.

Usage in other packages

The class SSparseMatrix is foundational in the packages SparseMatrixRecommender and LatentSemanticAnalyzer. (The implementation of those packages was one of the primary motivations to develop SSparseMatrix.)

The package RandomSparseMatrix can be used to generate random sparse matrices (SSparseMatrix objects.)


Installation

Install from GitHub

pip install -e git+https://github.com/antononcube/Python-packages.git#egg=SSparseMatrix-antononcube\&subdirectory=SSparseMatrix

From PyPi

pip install SSparseMatrix


Setup

Import the package:

from SSparseMatrix import *

The import command above is equivalent to the import commands:

from SSparseMatrix.SSparseMatrix import SSparseMatrix
from SSparseMatrix.SSparseMatrix import make_s_sparse_matrix
from SSparseMatrix.SSparseMatrix import is_s_sparse_matrix
from SSparseMatrix.SSparseMatrix import column_bind

Creation

Create a sparse matrix with named rows and columns (a SSparseMatrix object):

mat = [[1, 0, 0, 3], [4, 0, 0, 5], [0, 3, 0, 5], [0, 0, 1, 0], [0, 0, 0, 5]]
smat = SSparseMatrix(mat)
smat.set_row_names(["A", "B", "C", "D", "E"])
smat.set_column_names(["a", "b", "c", "d"])
<5x4 SSparseMatrix (sparse matrix with named rows and columns) of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format, and fill-in 0.4>

Print the created sparse matrix:

smat.print_matrix()
===================================
  |       a       b       c       d
-----------------------------------
A |       1       .       .       3
B |       4       .       .       5
C |       .       3       .       5
D |       .       .       1       .
E |       .       .       .       5
===================================

Another way to create using the function make_s_sparse_matrix:

ssmat=make_s_sparse_matrix(mat)
ssmat
<5x4 SSparseMatrix (sparse matrix with named rows and columns) of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format, and fill-in 0.4>

Structure

The SSparseMatrix objects have a simple structure. Here are the attributes:

  • _sparseMatrix
  • _rowNames
  • _colNames
  • _dimNames

Here are the methods to “query” SSparseMatrix objects:

  • sparse_matrix()
  • row_names() and row_names_dict()
  • column_names() and column_names_dict()
  • shape()
  • dimension_names()

SSparseMatrix over-writes the methods of scipy.sparse.csr_matrix that might require the handling of row names and column names.

Most of the rest of the scipy.sparse.csr_matrix methods are delegated to the _sparseMatrix attribute.

For example, for a given SSparseMatrix object smat the dense version of smat‘s sparse matrix attribute can be obtained by accessing that attribute first and then using the method todense:

print(smat.sparse_matrix().todense())
[[1 0 0 3]
 [4 0 0 5]
 [0 3 0 5]
 [0 0 1 0]
 [0 0 0 5]]

Alternatively, we can use the “delegated” form and directly invoke todense on smat:

print(smat.todense())
[[1 0 0 3]
 [4 0 0 5]
 [0 3 0 5]
 [0 0 1 0]
 [0 0 0 5]]

Here is another example showing a direct application of the element-wise operation sin through the scipy.sparse.csr_matrix method sin:

smat.sin().print_matrix(n_digits=20)
>  ===================================================================================
      |                   a                   b                   c                   d
    -----------------------------------------------------------------------------------
    A |  0.8414709848078965                   .                   .  0.1411200080598672
    B | -0.7568024953079282                   .                   . -0.9589242746631385
    C |                   .  0.1411200080598672                   . -0.9589242746631385
    D |                   .                   .  0.8414709848078965                   .
    E |                   .                   .                   . -0.9589242746631385
    ===================================================================================

Representation

Here the function print uses the string representation of SSparseMatrix object:

print(smat)
  ('A', 'a')	1
  ('A', 'd')	3
  ('B', 'a')	4
  ('B', 'd')	5
  ('C', 'b')	3
  ('C', 'd')	5
  ('D', 'c')	1
  ('E', 'd')	5

Here we print the representation obtained with repr:

print(repr(smat))
<5x4 SSparseMatrix (sparse matrix with named rows and columns) of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format, and fill-in 0.4>

Here is the matrix form (“pretty printing” ):

smat.print_matrix()
===================================
  |       a       b       c       d
-----------------------------------
A |       1       .       .       3
B |       4       .       .       5
C |       .       3       .       5
D |       .       .       1       .
E |       .       .       .       5
===================================

The method triplets can be used to obtain a list of (row, column, value) triplets:

smat.triplets()
[('A', 'a', 1),
 ('A', 'd', 3),
 ('B', 'a', 4),
 ('B', 'd', 5),
 ('C', 'b', 3),
 ('C', 'd', 5),
 ('D', 'c', 1),
 ('E', 'd', 5)]

The method row_dictionaries gives a dictionary with keys that are row-names and values that are column-name-to-matrix-value dictionaries:

smat.row_dictionaries()
{'A': {'a': 1, 'd': 3},
 'B': {'a': 4, 'd': 5},
 'C': {'b': 3, 'd': 5},
 'D': {'c': 1},
 'E': {'d': 5}}

Similarly, the method column_dictionaries gives a dictionary with keys that are column-names and values that are row-name-to-matrix-value dictionaries:

smat.column_dictionaries()
{'a': {'A': 1, 'B': 4},
 'b': {'C': 3},
 'c': {'D': 1},
 'd': {'A': 3, 'B': 5, 'C': 5, 'E': 5}}

Multiplication

Multiply with the transpose and print:

ssmat2 = ssmat.dot(smat.transpose())
ssmat2.print_matrix()
===========================================
  |       A       B       C       D       E
-------------------------------------------
0 |      10      19      15       .      15
1 |      19      41      25       .      25
2 |      15      25      34       .      25
3 |       .       .       .       1       .
4 |      15      25      25       .      25
===========================================

Multiply with a list-vector:

smat3 = smat.dot([1, 2, 1, 0])
smat3.print_matrix()
===========
  |       0
-----------
A |       1
B |       4
C |       6
D |       1
E |       .
===========

Remark: The type of the .dot argument can be:

  • SSparseMatrix
  • list
  • numpy.array
  • scipy.sparse.csr_matrix

Slices

Single element access:

print(smat["A", "d"])
print(smat[0, 3])
3
3

Get sub-matrix of rows using row names:

smat[["A", "D", "B"], :].print_matrix()
===================================
  |       a       b       c       d
-----------------------------------
A |       1       .       .       3
D |       .       .       1       .
B |       4       .       .       5
===================================

Get sub-matrix using row indices:

smat[[0, 3, 1], :].print_matrix()
===================================
  |       a       b       c       d
-----------------------------------
A |       1       .       .       3
D |       .       .       1       .
B |       4       .       .       5
===================================

Get sub-matrix with columns names:

smat[:, ['a', 'c']].print_matrix()
===================
  |       a       c
-------------------
A |       1       .
B |       4       .
C |       .       .
D |       .       1
E |       .       .
===================

Get sub-matrix with columns indices:

smat[:, [0, 2]].print_matrix()
===================
  |       a       c
-------------------
A |       1       .
B |       4       .
C |       .       .
D |       .       1
E |       .       .
===================

Remark: The current implementation of scipy (1.7.1) does not allow retrieval of sub-matrices by specifying both row and column ranges or slices.

Remark: “Standard” slices with integers also work.


Row and column sums

Row sums and dictionary of row sums:

print(smat.row_sums())
print(smat.row_sums_dict())
[4, 9, 8, 1, 5]
{'A': 4, 'B': 9, 'C': 8, 'D': 1, 'E': 5}

Column sums and dictionary of column sums:

print(smat.column_sums())
print(smat.column_sums_dict())
[5, 3, 1, 18]
{'a': 5, 'b': 3, 'c': 1, 'd': 18}

Column and row binding

Column binding

Here we create another SSparseMatrix object:

mat2=smat.sparse_matrix().transpose()
smat2 = SSparseMatrix(mat2, row_names=list("ABCD"), column_names="c")
smat2.print_matrix()
===========================================
  |      c0      c1      c2      c3      c4
-------------------------------------------
A |       1       4       .       .       .
B |       .       .       3       .       .
C |       .       .       .       1       .
D |       3       5       5       .       5
===========================================

Here we column-bind two SSparseMatrix objects:

smat[list("ABCD"), :].column_bind(smat2).print_matrix()
>===========================================================================
  |       a       b       c       d      c0      c1      c2      c3      c4
---------------------------------------------------------------------------
A |       1       .       .       3       1       4       .       .       .
B |       4       .       .       5       .       .       3       .       .
C |       .       3       .       5       .       .       .       1       .
D |       .       .       1       .       3       5       5       .       5
===========================================================================

Remark: If during column-binding some column names are duplicated then to the column names of both matrices are added suffixes that designate to which matrix each column belongs to.

Row binding

Here we rename the column names of smat to be the same as smat2:

smat3 = smat.copy()
smat3.set_column_names(smat2.column_names()[0:4])
smat3 = smat3.impose_column_names(smat2.column_names())
smat3.print_matrix()
===========================================
  |      c0      c1      c2      c3      c4
-------------------------------------------
A |       1       .       .       3       .
B |       4       .       .       5       .
C |       .       3       .       5       .
D |       .       .       1       .       .
E |       .       .       .       5       .
===========================================

Here we row-bind smat2 and smat3:

smat2.row_bind(smat3).print_matrix()

=============================================
    |      c0      c1      c2      c3      c4
---------------------------------------------
A.1 |       1       4       .       .       .
B.1 |       .       .       3       .       .
C.1 |       .       .       .       1       .
D.1 |       3       5       5       .       5
A.2 |       1       .       .       3       .
B.2 |       4       .       .       5       .
C.2 |       .       3       .       5       .
D.2 |       .       .       1       .       .
E.2 |       .       .       .       5       .
=============================================

Remark: If during row-binding some row names are duplicated then to the row names of both matrices are added suffixes that designate to which matrix each row belongs to.


In place computations

  • The methods for setting row- and column-names are “in place” methods — no new SSparseMatrix objects a created.
  • The dot product, arithmetic, and transposing methods have an optional argument whether to do computations in place or not.
    • The optional argument is copy, which corresponds to argument with the same name and function in scipy.sparse.
    • By default, the computations are not in place: new objects are created.
    • I.e. copy=True default.
  • The class SSparseMatrix has the method copy() that produces deep copies when invoked.

Unit tests

The unit tests (so far) are broken into functionalities; see the folder ./tests. Similar unit tests are given in [AAp2].


References

Articles

[AA1] Anton Antonov, “RSparseMatrix for sparse matrices with named rows and columns”, (2015), MathematicaForPrediction at WordPress.

[RB1] Richard Becker, “A Brief History of S”, (2004).

Packages

[AAp1] Anton Antonov, SSparseMatrix.m, (2018), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, SSparseMatrix Mathematica unit tests, (2018), MathematicaForPrediction at GitHub.