I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.
Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.
And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:
DROP SCHEMA IF EXISTS glab;
CREATE SCHEMA glab;
-- Create a table to hold some bogus data
DROP TABLE IF EXISTS glab.bs_data;
CREATE TABLE glab.bs_data
(
rw int DEFAULT NULL
, cl int DEFAULT NULL
, vl numeric DEFAULT NULL
) DISTRIBUTED BY (rw, cl)
;
-- Now create some bogus data. I use a few postgres tricks.
-- Ultimately, this just creates a relatively sparse "matrix" in tall format
-- I'm trying to use best practices in formatting and stuff
INSERT INTO glab.bs_data (rw, cl, vl)
SELECT
x
, y
, random() AS vl
FROM
(
SELECT
CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x
, CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y
FROM
generate_series(1, 1000) AS x
, generate_series(1, 1000) AS y
) AS A
WHERE 1=1
AND a.x IS NOT NULL
AND a.y IS NOT NULL
;
After the table is ready we can push it into a file: -- So we have some data. Let's create an external table
-- to push it OUT of the database
-- You need to run gpfdist to write to an external table.
-- In this example, from the directory /Users/gpadmin/glab/tables the command was
-- > gpfdist
DROP EXTERNAL TABLE IF EXISTS glab.ext_out;
CREATE WRITABLE EXTERNAL TABLE glab.ext_out (
i bigint
, j bigint
, v numeric
)
LOCATION ('gpfdist://macbuddha.local/glab_out.txt')
FORMAT 'TEXT' (DELIMITER ',' NULL ' ')
DISTRIBUTED BY (i, j)
;
-- Right, now let's put the data in it. This is two queries unioned so
-- we can create the first row. Best to do them separately.
INSERT INTO glab.ext_out
SELECT
max(rw) AS i
, max(cl) AS j
, count(*) AS v
FROM glab.bs_data;
INSERT INTO glab.ext_out
SELECT
rw AS i
, cl AS j
, vl AS v
FROM glab.bs_data
ORDER BY 1,2
;
And here is the example output glab_out.txt: 1000,1000,10002 1,69,0.71019060537219 1,577,0.747299919836223 1,627,0.252593372948468 1,753,0.120338548440486 1,768,0.34520703041926 1,826,0.756854422856122 1,838,0.827316934708506 1,936,0.342057122848928 2,323,0.090937509201467 ...Now it is very easy to read it in GraphLab. In Matlab/Octave do:
load glab_out.txt; % load data to memory
A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix
mmwrite('A', A); % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m
Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.
Next, Brian promised to teach me how to export data from GraphLab back to SQL.

