Showing posts with label GraphLab. Show all posts
Showing posts with label GraphLab. Show all posts

Sunday, September 2, 2012

Installing GraphLab 2.1 on Mountain Lion (MAC OS X 10.8)

Here is a quick explanation on how to install GraphLab 2.1 on Mountain Lion.

1) Install XCODE 4.4  
2) After XCODE is installed, go to XCode-> Preferences->Downloads 
and install "command line tools".
Image









3) Install mercurial .
4) Go to Graphlab download page and follow the link to the mercurial code page.
    Open a new terminal and use the command to checkout the code:
    hg clone https://...
5) cd graphlabapi
    ./configure

6) For compiling the collaborative filtering library:
    cd release/toolkits/collaborative_filtering/
    make -j4

Friday, April 20, 2012

KDD CUP 2012 Track 1 using GraphLab..

As some of you may recall, last year we had some great excitement with KDD CUP 2011, where we won the 5th place in track1. Additionally, the group who won the contest was using GraphLab as well. This year I really wanted to take a look again at this interesting and high visibility contest, but I did not find the time for it up to now. I got some interesting questions from students at Georgia Tech ML course taught by Le Song, were the course task is to compete in the KDD CUP 2012.

In this blog post I will share some of my approaches of dealing with KDD CUP 2012 track 1. I must warn my readers that it is very preliminary results, but I will be glad to get any inputs along the way.
Acknowledgement: my close collaborator, JustinYan from the Chinese National Academy of Science
is to credit most of the below results. By the way he is looking for summer internship in the US. Contact him if you are looking for intern. He is based in China - he will need a US intern visa.

First try - matrix factorization using alternating least squares. 

Goal: build a linear model for the user/item matrix and predict the missing values (the test data).


Preparing the training data:

Some explanations and motivation. Track1 does not have a separate validation dataset, as was for example in Netflix and last year KDD CUP. It is very useful to have a validation dataset, or else one may easily overfit over the training dataset. Following JustinYan advice, I have separated the data based on its time properties. According to the unix timestamps of the data,
ratings are distributed from day 283 to day 314. (int date is in [131834878,1321027199] , 131834878 is 2011.10.12 00:00:00 (unix time +8) and 1321027199 is 2011.11.11(unix time +8). ). We take the last 5 days (310-314) as the validation and the first days (283-309) as the training set.

And here is what I did to convert the data for Graphlab v1 matrix factorization format:
1) download and unzip the file track1.zip.
2) I am using graphlab matrix factorization version which assumes there is a single rating between each user and data item. In rec_log_train.txt there are many items which are ranked several times. (Later we can add support to multiple ratings of the same item). Currently I am also avoiding the temporal aspect
we can later add that. Filter out duplicate ratings using the command:
sort -n -k 1,1 -k 2,2 rec_log_train.txt > rec_log_train.sorted.txt
3) Use graphlab version 2 kdd train parser to filter out duplicate ratings:
<468|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kdd_train_data_parser rec_log_train.sorted.txt
WARNING:  kdd_train_data_parser.cpp(main:224): Eigen detected. (This is actually good news!)
INFO:     kdd_train_data_parser.cpp(main:226): GraphLab parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Currently implemented parsers are: Call data records, document tokens 
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO:     io.hpp(gzip_in_file:800): Opening input file: rec_log_train.sorted.txt
INFO:     io.hpp(gzip_out_file:831): Opening output file rec_log_train.sorted.txt.out
INFO:     kdd_train_data_parser.cpp(operator():181): Finished parsing total of 73209277 lines in file rec_log_train.sorted.txt
INFO:     kdd_train_data_parser.cpp(operator():197): Finished exporting a total of 39752418 ratings 
Finished in 103.088


Explanation: whenever there are multiple instances of the same user-item pair, with the same ratings, they are pruned and only one is left. Whenever there are two or more instances of the same user-item pair with conflicting ratings, namely both -1 and 1, they are ignored and removed from the training data.

Preparing the test data:

The test data contains the following format:
[user ] [ item ] [ empty rating ] [ time ]
It has unique 34910937 user/item pairs. The test data is explained here:
The rec_log_test.txt file contains data for both the public leaderboard set and final evaluation set. The file is sorted temporally. Any data with a timestamp < 1321891200 is used for the public leaderboard set, and >= 1321891200 is used for the final evaluation set. The goal is to predict which items that were recommended to the user were followed by the user, within each set.
In other words, the test data can be divided into two. In each set we should recommend the highest ranked items for this set. testA has 19,349,608 unique users and testB has 15,561,329 unique users.

 Here are some more detailed instructions:
5) Download the test data file: rec_log_test.txt.zip
6) Unzip the file using
# unzip rec_log_test.txt.zip
Use GraphLab v2 kdd_test_splitter to prepare the test:
./kdd_test_splitter rec_log_test.txt --ncpus=1
WARNING:  kdd_test_splitter.cpp(main:159): Eigen detected. (This is actually good news!)
INFO:     kdd_test_splitter.cpp(main:161): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO:     io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt
Rows:      2421057
Cols:      2421057
Nonzeros:  34910937
Constructing all vertices.
Adding edges.
Graph size:    34910937
Start finalize..
INFO:     kdd_test_splitter.cpp(operator():132): Finished going over 19349608 rating  for 1196411 unique users 
INFO:     kdd_test_splitter.cpp(operator():133): Successfully created output file: testA.sorted
INFO:     io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt
Rows:      2421057
Cols:      2421057
Nonzeros:  34910937
Constructing all vertices.
Adding edges.
Graph size:    34910937
Start finalize..
INFO:     kdd_test_splitter.cpp(operator():132): Finished going over 15561329 rating  for 1196411 unique users 
INFO:     kdd_test_splitter.cpp(operator():133): Successfully created output file: testB.sorted
Finished in 164.793
INFO:     kdd_test_splitter.cpp(main:184): Finished exporting test data!


The output of this procedure are split test files, testA.sorted holding earlier ratings and testB.sorted holding later ratings.  Ratings are sorted first by the user and then by the item numbers.

Understanding the cost function

Track one using AP@3 (average precision for 3 items) as its error measure. Here is pseudo code I got from JustinYan which computes it:

private static double ProcessList(List<Sample> labelList1)
{
//labelList is an ordered list of the top 3 computed predictions
List<Sample> labelList = labelList1.OrderByDescending(x => x.Prediction).Take(3).ToList();
//realList is a list of ratings that where equal to one
List<Sample> realList = labelList1.Where(x => x.RealValue > 0).ToList();
//realClickCount is the number of items a user clicked on
double realClickCount =realList.Count();
double ap = 0;
int count = 0;
//if user clicked on nothing, we get AP of zero.
if (realClickCount == 0) 
  return 0;
//for each of the recorded ratings
for (int i = 0; i < labelList.Count; i++)

  //if one of the top 3 items we recommended, had actually been
  //selected increase our precision.
  if (labelList[i].RealValue > 0)
  {
    ap += ++count * 1.0 / (i + 1);
  }
}
return ap / (realClickCount);
}


GraphLab now supports AP@3 cost function calculation using the command line switch --calc_ap=true

Running GraphLab SGD

8) Rename the test file to be recognized in graphlab using:

ln -s testA.sorted rec_log_train.sorted.out.txtt


6) Run bias-SGD using the command:



 ./pmf rec_log_train.sorted.txt.out 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --sgd_step_dec=0.98 --sgd_gamma=0.005 --sgd_lambda=0.005 --exporttest=true --calc_ap=true --reduce_mem_consumption=true --exportlinearmodel=false
INFO:     pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO:     pmf.cpp(start:291): Bias-SGD starting


loading data file rec_log_train.sorted.txt.out
Loading Matrix Market file rec_log_train.sorted.txt.out TRAINING
Loading rec_log_train.sorted.txt.out TRAINING
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556
loading data file rec_log_train.sorted.txt.oute
Loading Matrix Market file rec_log_train.sorted.txt.oute VALIDATION
Loading rec_log_train.sorted.txt.oute VALIDATION
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863
loading data file rec_log_train.sorted.txt.outt
Loading Matrix Market file rec_log_train.sorted.txt.outt TEST
Loading rec_log_train.sorted.txt.outt TEST
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 19349608
INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.


Bias-SGD for matrix (2421057, 2421057, 1):33589556.  D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO:     pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO:     asynchronous_engine.hpp(run:111): Worker 0 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 1 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 2 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 3 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 4 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 5 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 6 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 7 started.


Entering last iter with 1
21.4589) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.328273 AP@3 for validation: 0.292205
Entering last iter with 2
68.1817) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332562 AP@3 for validation: 0.308924
Entering last iter with 3
114.985) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332799 AP@3 for validation: 0.312879
Entering last iter with 4
162.032) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.333644 AP@3 for validation: 0.314124
Entering last iter with 5
219.16) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336898 AP@3 for validation: 0.319216
Entering last iter with 6
289.758) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340856 AP@3 for validation: 0.321952
Entering last iter with 7
360.691) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344672 AP@3 for validation: 0.322379
Entering last iter with 8
431.754) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.352534 AP@3 for validation: 0.325194
INFO:     asynchronous_engine.hpp(run:119): Worker 5 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 2 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 1 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 6 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 7 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 4 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 3 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 0 finished.


Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 466.943851 seconds
INFO:     io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.740713
INFO:     read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out.test.predictions vector size: 19349608
INFO:     read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m
Performance counters are: 0) EDGE_TRAVERSAL, 1421.59
Performance counters are: 2) CALC_RMSE_Q, 0.296477
Performance counters are: 6) CALC_OBJ, 1.17573





As you can see, each iteration of bias-SGD takes around 60 seconds using 8 cores machine. So it is enough to have a single good multicore machine. The output is the file: rec_log_train2.txt.test.predictions
Let's look at the output:

head rec_log_train2.txt.test.predictions
%%MatrixMarket matrix array real general
%%output predictions for test data
19349608 1
        -1
-0.7164881229401
0.07807982712984
-0.9745061397552
-0.7039368152618
-0.6436884999275
        -1
-0.6042827963829

For each user/item pair in the testA.sorted data, we get a prediction. 

Now we store the results for testA:
mv rec_log_train2.txt.test.predictions testA.predictions


And again we run graphlab, this time with testB:

./pmf rec_log_train.sorted.txt.out2 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --lambda=0.01 --sgd_gamma=0.005 --sgd_lambda=0.005 --sgd_step_dec=0.98 --exporttest=true --calc_ap=true
INFO:     pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO:     pmf.cpp(start:291): Bias-SGD starting


loading data file rec_log_train.sorted.txt.out2
Loading Matrix Market file rec_log_train.sorted.txt.out2 TRAINING
Loading rec_log_train.sorted.txt.out2 TRAINING
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556
loading data file rec_log_train.sorted.txt.out2e
Loading Matrix Market file rec_log_train.sorted.txt.out2e VALIDATION
Loading rec_log_train.sorted.txt.out2e VALIDATION
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863
loading data file rec_log_train.sorted.txt.out2t
Loading Matrix Market file rec_log_train.sorted.txt.out2t TEST
Loading rec_log_train.sorted.txt.out2t TEST
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 15561329
INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.


INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.


Bias-SGD for matrix (2421057, 2421057, 1):33589556.  D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO:     pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO:     asynchronous_engine.hpp(run:111): Worker 1 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 0 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 2 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 3 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 4 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 5 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 6 started.


INFO:     asynchronous_engine.hpp(run:111): Worker 7 started.


Entering last iter with 1
42.9776) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.327499 AP@3 for validation: 0.291468
Entering last iter with 2
113.61) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.331413 AP@3 for validation: 0.308851
Entering last iter with 3
184.123) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332523 AP@3 for validation: 0.312983
Entering last iter with 4
251.239) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.334125 AP@3 for validation: 0.314801
Entering last iter with 5
307.7) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336783 AP@3 for validation: 0.318807
Entering last iter with 6
377.682) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340606 AP@3 for validation: 0.320896
Entering last iter with 7
447.13) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344897 AP@3 for validation: 0.322647
Entering last iter with 8
517.358) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.351736 AP@3 for validation: 0.324784
INFO:     asynchronous_engine.hpp(run:119): Worker 2 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 0 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 5 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 6 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 1 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 4 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 7 finished.


INFO:     asynchronous_engine.hpp(run:119): Worker 3 finished.


Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 552.339237 seconds
INFO:     io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.733542
INFO:     read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out2.test.predictions vector size: 15561329
INFO:     read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m
Performance counters are: 0) EDGE_TRAVERSAL, 2109.31
Performance counters are: 2) CALC_RMSE_Q, 0.264258
Performance counters are: 6) CALC_OBJ, 1.1756


Final parsing and compilation of submission

Use GraphLab v2, for creating the results in KDD cup format. This code go over the recommendation and builds a list of the top 3 recommendations out of the test items.

Submission file contains exactly 1,340,127 users (713,611 from testA, and 626,516 from testB).

There are four inputs to the output creation. The first one is the sorted testA we created previously using matlab. The second is the prediction file which is output from GraphLab.


<309|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kddcup_output_builder testA.sorted testA.predictions testB.sorted testB.predictions --ncpus=1
WARNING:  kddcup_output_builder.cpp(main:165): Eigen detected. (This is actually good news!)
INFO:     kddcup_output_builder.cpp(main:167): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO:     io.hpp(load_matrixmarket_graph:325): Reading matrix market file: testB.sorted
Rows:      2421057
Cols:      2421057
Nonzeros:  15561329
Constructing all vertices.
Adding edges.
Graph size:    15561329
INFO:     io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testB.predictions
INFO:     kddcup_output_builder.cpp(operator():131): Finished going over 15561329 rating  for 626516 unique users 
INFO:     kddcup_output_builder.cpp(operator():132): Successfully created output file: testB.sorted.out
INFO:     io.hpp(load_matrixmarket_graph:325): Reading matrix market file: testA.sorted
Rows:      2421057
Cols:      2421057
Nonzeros:  19349608
Constructing all vertices.
Adding edges.
Graph size:    19349608
INFO:     io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testA.predictions
INFO:     kddcup_output_builder.cpp(operator():131): Finished going over 19349608 rating  for 713611 unique users 
INFO:     kddcup_output_builder.cpp(operator():132): Successfully created output file: testA.sorted.out
Finished in 82.1905
INFO:     kddcup_output_builder.cpp(main:189): Merging the two files together
updating: testA.sorted.out (deflated 72%)
INFO:     kddcup_output_builder.cpp(main:208): Successfully created zip file: submission.zip
INFO:     kddcup_output_builder.cpp(main:214): removed temporary files
INFO:     kddcup_output_builder.cpp(main:216): Created successfully zip file: submission.zip

Preliminary results

Here is a very preliminary report on the results I was able to obtain so far:
ALS - 0.254 on the leaderboard
SVD++ - 0.29 on the leaderboard (Thanks for my collaborators in Hungarian National Academy of Science, specifically Tamas Kiss)
bias-SVD 0.330 on the leaderboard (After some optimization JustinYan got 0.34 with it).
logistic regression - 0.336

Disclaimer

As the code is evolving, there may be some bugs/inaccuracies in the code. However, I did not want to wait for the last minute to get a perfect solution that no one will be able to use. Please ping me if you find any bugs or have any suggestions.

Wednesday, April 18, 2012

KDD CUP 2011 reflection

I just learned from my collaborator JustinYan from the chinese national academy of science,about the paper A Linear Ensemble of Individual and Blended Models for Music Rating Prediction" was published by the Taiwanese team from the National Taiwan University, which won KDD CUP 2011 both tracks one and two.

What is specially interesting about this paper IMHO, is that GraphLab matrix factorization library was used for computing two models out of their ensemble.

That raises the number of teams who utilized GraphLab and won high places to two in this contest.

Thursday, April 12, 2012

LSI vs. SVD

Here is a question about LSI I got from Ashish Dhamelia from capitalnovus.com:

I am new to LSI.
I have read few papers about LSI and they all suggest using SVD.
Our data are mostly emails and few office documents as well.
We do get lots of data and we index using Lucene/Solr.

I started with SemanticVectors. It basically performs SVD on lucene index ( We can view lucene index as Term X Document matrix with element being TF-IDF score) So our input matrix is huge and sparse (lots of zeros). I do not have exact count about non-zeros element. But I will try to get more details. So given TFIDF matrix (Lucene index), Semantic vectors performs SVD and created two output matrix. Left singular matrix is term vector matrix while right singular matrix is document vector matrix.

So from term vector matrix, we can query a term and find semantically similar terms based on cosine similarity. Issue here is scalability. Semantic vector does everything in memory so it’s not scalable.

Right now we are evaluating mahout and graphlab to address this issue.

I forwarded this question to our LSI expert, Andrew Olney from University of Memphis. And this is what I got:

For either LSI/LSA
  • Basically you construct a term/doc matrix (each cell is frequency count of word (row) in document (col)
  • Normalize if you want e.g. log-entropy or tfidf
  • Take the SVD
LSA is primarily interested in word/word;doc/doc similarity and uses the U matrix (left singular vectors) LSI is primarily interested in IR and also uses the V matrix (right singular vectors) For LSA the edited book "Handbook of LSA" is a good reference. For LSI I suggest this paper. What is meant in the email below by 'concept search' is not clear to me, but it seems similar to a collaborative filtering approach.

Stochastic SVDs (e.g. Gorrell) have been used for that though they are approximate. See also perhaps Lingo.

And finally, maybe an online topic model would be better if the user is selecting 'concepts' or concepts are otherwise reflected in the UI. Topics seem more comprehensible to people than SVD dimensions. See Blei’s paper.

Since we have now a very efficient SVD solver in GraphLab v2, I would love to help anyone who is interested in implementing LSA/LSI on top of it. Let me know if you want to try it out!

Saturday, April 7, 2012

The first GraphLab workshop is coming up!

I am absolutely excited to report that we are organizing the first Graphlab workshop. The workshop has two goals:

  • Expose distributed/multicore GraphLab v2  to a wider ML community using demos and tutorials. Meet other GraphLab users and hear what they are working on!
  • A meeting place for technology companies working on large scale machine learning. Companies are invited to present their related research and discuss future challenges.

Here is the current program committee:

Additional companies who confirmed their participation include:
Universities / research lab participating at the workshop:

  • Carnegie Mellon University
  • Johns Hopkins University
  • University of California, Berkeley
  • University of California, Santa Cruz
  • University of Pennsylvania
  • University of San Francisco
  • Lawrence Berkeley National Lab
  • Georgia Tech
  • Stanford University
Date
Workshop date is July 9th in San Francisco.

Registration
Online registration is now open.

  • 50$ for student
  • 100$ for early bird
  • 150$ for regular registration


If you are interested in getting additional information about the workshop, please email me.

Friday, March 30, 2012

Interesting twitter dataset and other big datasets

I got this from Aapo Kyrola, how got it from Guy Blleloch:

An interesting paper which explores the twitter social network is:

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010.

Image


The twitter graph is available fro download from here. The format is very simple:
user follower\n

The graph gas 41M nodes, and 1.4 billion edges. What is nice about it, is that you can view the profile of each node id using the twitter web API. For example, for user 12 you can do:
http://api.twitter.com/1/users/show.xml?user_id=12
Some statistics about the graph are found here.

If you like to use it in Graphlab v2, you need to do the following:
1) assuming the graph file name is user_follower.txt, sort the graph using:
   sort -u -n -k 1,1 -k 2,2 -T . user_follower.txt > user_follower.sorted
2) Add the following matrix market format header to the file:
   %%MatrixMarket matrix coordinate real general
   61578414 61578414 1468365182

I am using k-cores algorithm to reveal this graph structure. I will add some results soon.

And here is a library of webgraphs and other big graphs I got from 
Kanat Tangwongsan.

Friday, March 23, 2012

Large scale machine learning benchmark

About a week ago Joey and I had some interesting talk with Nicholas Kolegraff. Nicholas suggested a very useful initiative - to compile an EC2 Linux distribution with several machine learning frameworks installed, so one can very easily asses their performance on a set of benchmark problems.  Nicholas has a preliminary demo, when one can login into the Linux configured system via his website and run multiple benchmarks using precompiled scripts. Using his system, it is be easier for people to evaluate different ML frameworks without the inevitable learning and setup curve of using each system independently. Since we really liked his idea, we invited Nicholas to give a demo at our coming GraphLab bay area workshop.

Currently, Nicholas includes the following systems in his distribution: GraphLab, Mahout, MadLIB and Vowpal Wabbit. He further asked us to help him compile some interesting datasets to test the different systems on.

Not long ago, during our Intel Labs-GraphLab collaboration, the Intel guys have also asked for help in creating some kind of baseline benchmark suite for comparing different machine learning methods and implementations. Additionally, Edin Muharemagic an architect @ HPCC systems have asked for help in assembling several large datasets for comparing their new SVD solver, as part of their new LexisNexis ML library.

Now it seems that many people find the same basic idea useful then we should probably make an effort in this direction. For a start, I am looking in providing some benchmarks for large scale sparse SVD problems. So why not crowd source this effort with the help of my readers?

As a first try, I contacted Joshua Vogelstein from Johns Hopkins who kindly offered to donate neural data. Now I am looking for some additional researchers who are willing to donate some large scale data for some real SVD / spectral clustering tasks in different fields. Matrices should be in the size between 100M non-zeros to 10 billion non-zeros. Please email me if you are intesting in helping create a standard benchmarking suite!

[An update]
Now I got a note from Andrew Aolney, from the University of Memphis that he contributes a wikipedia term occurrence matrix to the benchmark collection.  I also got some wikipedia term occurrences matrices from: Jamie Callan, Brian Murphy, and Partha Talukdar, CMU. Thanks everyone!

The datasets are added to our GraphLab datasets page here.

GraphLab in Budapest

Image
 Image

This week I was visiting the Hungarian National Academy of Sciences. I was invited as part of our newly formed collaboration with the Lawa FP7 EU research project. Lawa stands for "Longitude Analytics for Web Scale Data". As part of this collaboration, GraphLab will be used for running algorithms on web scale data collected and analyzed by the Lawa project. My host was András Benczúr's who hosted us superbly. Scott Kirkpatrick from the Hebrew University of Jerusalem is heading this collaboration and create the link between the projects.

Image
I gave a talk about GraphLab and I got a crowd of about 30 researchers with a lot of interesting questions and comments. One pleasant surprise was that several people from Gravity showed up in my talk. Gravity was one of the leading teams in the Netflix prize. Specifically I had a nice discussion with Levente Török from the Gravity team. Levente has installed GraphLab not long ago and started to play with it.

An interesting research performed as part of the Lawa project is done one joining range queries by Peter Triantafillou from University of Patras, Greece. Join range queries are very useful when querying large datasets. Some cool bay area startup called quantiFind have some really impressive demo of which kinds of data you can gather using join range queries. Here is a video I got from Ari Tuchman their co-founder:


While staying in Budapest, and "stealing" wireless internet from some coffee shops I was still busy arranging our first GraphLab workshop. Piero P. Bonissonem chief scientist in General Electric Global Research has emailed me and kindly agree to participate in our program committee. Gilad Shainer from Mellanox, is the chair of the High Performance Computing Advisory Council, a non-profit organization promoting HPC computing with over 300 worldwide companies and university participating, has kindly agreed to help us organize the event.

I got the following encouraging note from John Marc Agnosta, a reseacher at Toyota InfoTechnology Center USA:

I just came across your blog post on the first Graphlab workshop planned for this summer, and I'd like to join the list of participating individuals & companies.

Let me introduce myself. I have been a member of Intel Research, where I gotten to know some of the other folks mentioned in the program committee. After leaving Intel Research last year, I've just recently joined a small Bay Area lab that is a Toyota joint venture. I am their machine learning lead. We are planning some efforts this coming year where the GraphLab could be employed.

I've heard about GraphLab at UAI and in discussions with other research folks, and I'm excited to see it progress.
I want to deeply thank Piero, Gilad and John, and all the others who are actively helping to promote and organize our workshop - without your great help it was simply not possible to make it happen.

Sunday, March 18, 2012

How to prepare your problem to be used in GraphLab / GraphLab parsers

In many cases your data is a collection of strings and you like to convert it to numerical form so it can be used in any machine learning software.

In GraphLab v2, I have added parsers library that can help you accomplish this task hopefully easier.
Let's start with an example. Suppose I have a collection of documents and in each document I have
a bag of words that appear. The input to GraphLab parser is:

1::the boy with big hat was here
2::no one would have believe in the last years of the nineteenth century

where 1 and 2 are the file numeric ids, we use '::' as a separator, and the rest of the line contains keywords that appear in that document.

Assuming you have this format, it is very easy to convert it to be used in GraphLab. You simply use
the texttokenparser application.
Preliminaries: you will need to install GraphLab v2 (explanation under installation section here).

And here is an example run of the parser:

./texttokenparser --dir=./ --outdir=./ --data=document_corpus.txt --gzip=false --debug=true
WARNING:  texttokenparser.cpp(main:209): Eigen detected. (This is actually good news!)
INFO:     texttokenparser.cpp(main:211): GraphLab parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Currently implemented parsers are: Call data records, document tokens 
Schedule all vertices
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO:     io.hpp(gzip_in_file:698): Opening input file: ./document_corpus.txt
INFO:     io.hpp(gzip_out_file:729): Opening output file ./document_corpus.txt.out
Read line: 1 From: 1 To: 1 string: the
Read line: 1 From: 1 To: 2 string: boy
Read line: 4 From: 1 To: 3 string: with

INFO:     texttokenparser.cpp(operator():159): Parsed line: 50000 map size is: 30219
INFO:     texttokenparser.cpp(operator():159): Parsed line: 100000 map size is: 39510
INFO:     texttokenparser.cpp(operator():159): Parsed line: 150000 map size is: 45200
INFO:     texttokenparser.cpp(operator():159): Parsed line: 200000 map size is: 50310
INFO:     texttokenparser.cpp(operator():164): Finished parsing total of 230114 lines in file document_corpus.txt
total map size: 52655
Finished in 17.0022
Total number of edges: 0
INFO:     io.hpp(save_map_to_file:813): Save map to file: ./.map map size: 52655
INFO:     io.hpp(save_map_to_file:813): Save map to file: ./.reverse.map map size: 52655

The output of the parser :
1) Text file containing consecutive integers in sparse matrix market format. In other words, each string is assigned an id, and a sparse matrix is formed where the rows are the document numbers and the non-zero columns are the strings.
NOTE: currently you will need to manually create the two header lines as explained here. The header lines specify the number of rows, columns and non-zero entires in the matrix. In the future I will automate this process.
2) A mapping from each text keyword to its matching integer
3) A mapping from each integer to its matching string.

Advanced options:
1) It is possible to parse in parallel (on a multicore machine) multiple files and still have the ids assigned correctly. Use the --filter= command line argument to select all files starting with a certain prefix. Do not use the --data= command line argument in that case.
2) Support for gzip input format. Using --gzip=true command line option.
3) Save the mapping into readable text file using the --save_in_text=true command line argument.
4) Incrementally add more documents to an existing map by using the --load=true command line flag.
5) Limit the number of parsed lines using --lines=XX command line flag (useful for debugging!)
6) Enable verbose mode using --debug=true command line flag.

Monday, February 27, 2012

Matrix Market Format

Matrix Market is a very simple format devised by NIST to store different types of matrices.

For GraphLab matrix libraries: linear solvers, matrix factorization and clustering we recommend using this format for the input file. Once this format is used for the input, the same format will be used for the output files.


Sparse matrices
Matrix market has a very simple format: 2 header lines. Here are some examples. Here is
example 3x4 matrix:

A =

0.8147         0    0.2785         0
0.9058    0.6324    0.5469    0.1576
0.1270    0.0975    0.9575         0

And here is the matching matrix market file:

%%MatrixMarket matrix coordinate real general
% Generated 27-Feb-2012
3 4 9
1 1  0.8147236863931789
2 1  0.9057919370756192
3 1  0.1269868162935061
2 2  0.6323592462254095
3 2  0.09754040499940952
1 3  0.2784982188670484
2 3  0.5468815192049838
3 3  0.9575068354342976
2 4  0.1576130816775483

The first line, include the header. coordinate means this is a sparse matrix.
The third row includes the matrix size: 3 4 9 means 3 rows, 4 columns and 9 non zero entries.
The rest of the rows include the edges. For example the 4th row is:
1 1 0.8147236863931789, namely means that it is the first row, first column and its value.

TIP: Sparse matrices should NOT include zero entries!
For example, the row 1 1 0 is not allowed!
TIP: First two numbers in each non-zero entry line should be integers and not double notation. For example 1e+2 is not a valid row/col number. It should be 100 instead.
TIP: Row/column number always start from one (and not from zero as in C!)
TIP: It is not legal to include the same non zero entry twice. In GraphLab it will result in a duplicate edge error. Note that the number of edges in GraphLab starts in zero, so you have to add one to source and target values to detect the edge in the matrix market file.

Dense matrices:
Here is an example on how to save the same matrix as dense matrix:

A =

0.8147         0    0.2785         0
0.9058    0.6324    0.5469    0.1576
0.1270    0.0975    0.9575         0


%%MatrixMarket matrix array real general
3 4
0.8147
0.9058
0.1270
0
0.6324
0.0975
0.2785
0.5469
0.9575
0
0.1576
0

Symmetric sparse matrices:
Here is an example for sparse symmetric matrix:
B =

    1.5652         0    1.4488
         0    2.0551    2.1969
    1.4488    2.1969    2.7814

And here is the matching matrix market file:
%%MatrixMarket matrix coordinate real symmetric
% Generated 27-Feb-2012
3 3 5
1 1  1.5652
3 1  1.4488
2 2  2.0551
3 2  2.1969
3 3  2.7813

Note that each non-diagonal edges is written only once.

Sparse vectors:
Here is an example for sparse vector:
v =

     1     0     0     1

%%MatrixMarket matrix coordinate real general
% Generated 27-Feb-2012
1 4 2
1 1 1
1 4 1

Working with matlab:
download the files http://graphlab.org/mmwrite.m and http://graphlab.org/mmread.m
In Matlab you can save a dense matrix using:
>> mmwrite('filename', full(matrixname));
And save a sparse matrix using:
>> mmwrite('filename', sparse(matrixname));
For reading a sparse or dense matrix you can:
>> A = mmread('filename');

Writing a conversion function in Java
This section explains how to convert Mahout 0.4 sequence vectors into matrix market format.

Create a file named Vec2mm.java with the following content:
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Iterator;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;

/**
 * Code for converting Hadoop seq vectors to matrix market format
 * @author Danny Bickson, CMU
 *
 */

public class Vec2mm{
 
 
 public static int Cardinality;
 
 /**
  * 
  * @param args[0] - input svd file
  * @param args[1] - output matrix market file
  * @param args[2] - number of rows
  * @param args[3] - number of columns
  * @param args[4] - number oi non zeros
  * @param args[5] - transpose
  */
 public static void main(String[] args){
 
   try {
     if (args.length != 6){
        System.err.println(Usage: java Vec2mm <input seq vec file> < output matrix market file> <number of rows> <number of cols> <number of non zeros> <transpose output>);
           System.exit(1);
        }

        final Configuration conf = new Configuration();
        final FileSystem fs = FileSystem.get(conf);
        final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(args[0]), conf);
        BufferedWriter br = new BufferedWriter(new FileWriter(args[1]));
        int rows = Integer.parseInt(args[2]);
        int cols = Integer.parseInt(args[3]);
        int nnz = Integer.parseInt(args[4]);
        boolean transpose = Boolean.parseBoolean(args[5]);
        IntWritable key = new IntWritable();
        VectorWritable vec = new VectorWritable();
        br.write("%%MatrixMarket matrix coordinate real general\n");    
        if (!transpose)
          br.write(rows + " " +cols + " " + nnz + "\n");
        else br.write(cols + " " + rows + " " +  nnz + "\n");
        while (reader.next(key, vec)) {
          //System.out.println("key " + key);
          SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();
          //System.out.println("key " + key + " value: " + vect);
          Iterator iter = vect.iterateNonZero();

          while(iter.hasNext()){
            Vector.Element element = iter.next();
           //note: matrix market offsets starts from one and not from zero
           if (!transpose)
             br.write((element.index()+1) + " " + (key.get()+1)+ " " + vect.getQuick(element.index())+"\n");
           else 
             br.write((key.get()+1) + " " + (element.index()+1) + " " + vect.getQuick(element.index())+"\n");
           }
       }
  
       reader.close();
       br.close();
   } catch(Exception ex){
     ex.printStackTrace();
   }
 }
}
Compile this code using the command:
javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.java

Now run it using the command:
java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/google-collections-1.0-rc2.jar Vec2mm A.seq A.mm 25 100 2500 false

Where A.seq is the input file in SequentialAccessSparseVector key/value store. A.mm
will be the resulting output file in matrix market format. 25 is the number of rows, 100 columns, and 2500 non zeros. false - not to transpose the resulting matrix.

Depends on the saved vector type, you may want to change my code from SequentialAccessSparseVector to the specific type you need to convert.

Sunday, February 26, 2012

SVD strikes again!

As you may now, SVD is a topic which constantly comes up in my blog. This is a chicken and egg problem: as more I work on SVD, I get more requests for help with large scale SVD solutions, and then I work more on SVD..

Here is some updates from the last couple of weeks.
Vincent Bor, an independent consultant from Dallas/Forth Worth area sent me the following. The problem size he was trying is a document term matrix of size 207,710 terms 153,328 documents with 22,518,138 non zero elements. It seems he has some older SVD implementation he is trying to compare to GraphLab.

I do bidiagonalization so I use original matrix A not ATA or AAT, maybe because of this Lanczos better processes ill-conditioned matrices (?), I have restarting but I haven't seen yet that restarting was ever used (I do not see all logs, only from time to time). The algorithm is based on the paper "Lanczos bidiagonalization with partial reorthogonalization" by Rasmus Munk Larsen, but I did full reorthogonalization. I use The Compressed Row Storage (CRS) format to keep matrix in memory and for vector matrix multiplication. During the iterations I keep V, U vectors on the disk, only matrix is in memory. I use parallel calculations but only in two places: vector,matrix multiplication and creating final singular vectors.

It seems that our first version of SVD solver has numerical issues with his matrix. In a few days I am going to release a fix that include restarts, as a part of GraphLab v2.

Brian Murphy, a postdoc at CMU, sent me the following:
By the way, the SVD task I have is not enormous - it's a 50m (web documents) x 40k (words of vocabulary) matrix of real values, and very sparse (3.8bn entries or sparsity of 0.2%). Until now we've been using the sparse SVD libraries in python's scipy (which uses ARPACK). But with that the largest matrix I can comfortably manage is about 10-20m.

Even if I could refactor the code to fit 50m, we would still like to be able to run this with much larger collections of web documents.

Another solution I have tried is the GenSim package, which uses a couple of approximate SVD solutions: , but so far the results using the decompositions it produces are not so encouraging. This paper describes the methods used in Gensim: Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms

Now, this sounds as a challenging magnitude - 3.8 billion non-zeros. I am definitely going to help Brian use GraphLab v2 for cracking his problem!

Andrew McGregor Olney, assistant prof at the dept. for psycology at the university of Memphis has sent me the following:
You might be interested in this paper that describes my partial SVD with no reorthongonalization. It's not highly original, but attempts to synthesize some of the previous work in to a simple, practical, and accurate low-memory solution for SVD of term-doc matrices.

I've got some open-source C# code to share (it's not really pretty, but at least it's OO) and some of the original octave scripts I cobbled together in 2005 (using B instead of AtA) if you are interested.

Andrew further sent me a sample matrix of size 4364328 rows, 3333798 columns, 513883084 non zeros to try out. He only care about the left vectors and singular values.

Edin Muharemagic is working on LexisNexis new ML implementation, and he is currently implementing SVD on top of it. You can read some more in my blog about it here.

The best SVD large scale implementation I am aware of is Slepc. Now, if this problem is solved, you may ask, why is everyone emailing me? Unfortunately using Slepc requires understanding of MPI, C programming ability and a non negligible learning curve. To understand why I a proper SVD implementation is no trivial, I asked Joel Tropp, who gave the following answer:

In short, the Lanczos method is very unstable because it tries to "cheat" when it computes an orthogonal basis for the Krylov subspace. This procedure only works well in exact arithmetic; it breaks very quickly in the presence of floating-point errors. Drastic remedies are required to fix this problem. See the book by Golub and Van Loan for a discussion and references. A more introductory presentation appears in Trefethen and Bau.

Another possible approach to computing a large-scale SVD is to use column-sampling approaches, perhaps followed by a randomized SVD as advocated by Kwok and Li. See Chiu and Demanet (arXiv) or Gittens (arXiv) for some related methods.

Anyway, stay tuned, since it the coming week or so I am going to release an improved SVD solver in Graphlab version 2. I hope it will be able to address all the above requests for help. And if you are working on large scale SVD/ or need such an implementation please let me know!

Thursday, February 9, 2012

New Java GraphLab tutorial

I am glad to report that we have a new addition to our GraphLab team: Jim Lim, a CMU sophomore in information systems (computer science).

Jim will be working on the GraphLab Java interface: making it more usable, adapting it to GraphLab v2 and adding new features. As a first step, Jim wrote a tutorial of the Java GraphLab interface.

Monday, February 6, 2012

Installing GraphLab on Ubuntu - 32 bit

Warning: this tutorial is deprecated. We no longer support GraphLab for Ubutnu 32 bit. You should use Ubutnu 64 bit with the instructions here: http://bickson.blogspot.co.il/2011/10/installing-graphlabboost-libboost-on.html

1) Computer setup
Login into your Ubuntu 10.10 machine or login into Amazon AWS console and select a 32 instance to run from here. I chose ami-71dc0b18.

2) Update your system
sudo apt-get update
sudo apt-get install build-essential

3) Install Boost
sudo apt-get install libboost-dev libboost-iostreams1.40-dev libboost-program-options1.40.0 libboost-filesystem1.40.0 libboost-system1.40-dev
Fix links:
sudo ln -s /usr/lib/libboost_program_options.so.1.40.0 /usr/lib/libboost_program_options.so
sudo ln -s /usr/lib/libboost_filesystem.so.1.40.0 /usr/lib/libboost_filesystem.so
sudo apt-get install libitpp-dev

4) Install Mercurial
sudo apt-get install mercurial

5) Install CMake
sudo apt-get install cmake

6a) Checkout graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.

or 6b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

7) configure and compile
cd graphlabapi
export BOOST_ROOT=/usr/
./configure --bootstrap --yes
cd release/
make

8) Test your build
cd tests
./runtests.sh

Possible errors:
libs/iostreams/src/zlib.cpp:20:76: error: zlib.h: No such file or directory
libs/iostreams/src/zlib.cpp:31: error: ‘Z_NO_COMPRESSION’ was not declared in this scope
libs/iostreams/src/zlib.cpp:32: error: ‘Z_BEST_SPEED’ was not declared in this scope
libs/iostreams/src/zlib.cpp:33: error: ‘Z_BEST_COMPRESSION’ was not declared in this scope
libs/iostreams/src/zlib.cpp:34: error: ‘Z_DEFAULT_COMPRESSION’ was not declared in this scope
libs/iostreams/src/zlib.cpp:38: error: ‘Z_DEFLATED’ was not declared in this scope
libs/iostreams/src/zlib.cpp:42: error: ‘Z_DEFAULT_STRATEGY’ was not declared in this scope
libs/iostreams/src/zlib.cpp:43: error: ‘Z_FILTERED’ was not declared in this scope
libs/iostreams/src/zlib.cpp:44: error: ‘Z_HUFFMAN_ONLY’ was not declared in this scope
libs/iostreams/src/zlib.cpp:48: error: ‘Z_OK’ was not declared in this scope
libs/iostreams/src/zlib.cpp:49: error: ‘Z_STREAM_END’ was not declared in this scope
libs/iostreams/src/zlib.cpp:50: error: ‘Z_STREAM_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:51: error: ‘Z_VERSION_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:52: error: ‘Z_DATA_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:53: error: ‘Z_MEM_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:54: error: ‘Z_BUF_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp:58: error: ‘Z_FINISH’ was not declared in this scope
libs/iostreams/src/zlib.cpp:59: error: ‘Z_NO_FLUSH’ was not declared in this scope
libs/iostreams/src/zlib.cpp:60: error: ‘Z_SYNC_FLUSH’ was not declared in this scope
libs/iostreams/src/zlib.cpp: In static member function ‘static void boost::iostreams::zlib_error::check(int)’:
libs/iostreams/src/zlib.cpp:77: error: ‘Z_OK’ was not declared in this scope
libs/iostreams/src/zlib.cpp:78: error: ‘Z_STREAM_END’ was not declared in this scope
libs/iostreams/src/zlib.cpp:81: error: ‘Z_MEM_ERROR’ was not declared in this scope
libs/iostreams/src/zlib.cpp: In constructor ‘boost::iostreams::detail::zlib_base::zlib_base()’:
libs/iostreams/src/zlib.cpp:94: error: expected type-specifier before ‘z_stream’
libs/iostreams/src/zlib.cpp:94: error: expected ‘)’ before ‘z_stream’
libs/iostreams/src/zlib.cpp: In destructor ‘boost::iostreams::detail::zlib_base::~zlib_base()’:
libs/iostreams/src/zlib.cpp:97: error: expected type-specifier before ‘z_stream’
libs/iostreams/src/zlib.cpp:97: error: expected ‘>’ before ‘z_stream’
libs/iostreams/src/zlib.cpp:97: error: expected ‘(’ before ‘z_stream’
libs/iostreams/src/zlib.cpp:97: error: ‘z_stream’ was not declared in this scope
libs/iostreams/src/zlib.cpp:97: error: expected primary-expression before ‘>’ token
libs/iostreams/src/zlib.cpp:97: error: expected ‘)’ before ‘;’ token
libs/iostreams/src/zlib.cpp: In member function ‘void boost::iostreams::detail::zlib_base::before(const char*&, const char*, char*&, char*)’:
libs/iostreams/src/zlib.cpp:102: error: ‘z_stream’ was not declared in this scope

Solution:
Either install zlib, or use the above commands for installing libboost1.40 from apt-get (recommended).

Error:
In file included from /home/ubuntu/graphlabapi/demoapps/gabp/gamp.cpp:26:
/home/ubuntu/graphlabapi/demoapps/gabp/cas_array.h: In function ‘void mul(double*, double)’:
/home/ubuntu/graphlabapi/demoapps/gabp/cas_array.h:175: warning: dereferencing type-punned pointer will break strict-aliasing rules
/home/ubuntu/graphlabapi/demoapps/gabp/cas_array.h:175: warning: dereferencing type-punned pointer will break strict-aliasing rules
c++: Internal error: Killed (program cc1plus)
Please submit a full bug report.
See  for instructions.
Solution: It is a known compiler error in Ubuntu 32 bit. I simply tried again and it worked..
Alternatively you may need to upgrade your compiler.


Error:
Probing for boost...
Probing in /usr/
/usr/bin/ld: cannot find -lboost_system
collect2: ld returned 1 exit status
Probing in /home/ubuntu/graphlabapi/deps
/usr/bin/ld: cannot find -lboost_system
Solution:
If the package libboost_system is not installed, install it. You can view all available
package by using the command:
apt-cache pkgnames | grep boost | grep 1.40
You can view all installed packages by
dpkg -l | grep boost

If the package is installed, create a symbolic link as explained in section 3 (subsection fix links).

Error:
Mon Feb  6 13:25:48 UTC 2012
~/graphlabapi ~/graphlabapi
cmake detected in /usr/bin/cmake. Skipping cmake installation.
~/graphlabapi
~/graphlabapi ~/graphlabapi
Probing for boost...
BOOST_ROOT not defined. Probing in usual directories...
/usr/bin/ld: cannot find -lboost_iostreams
collect2: ld returned 1 exit status
Probing in /home/ubuntu/graphlabapi/deps
/usr/bin/ld: cannot find -lboost_iostreams
collect2: ld returned 1 exit status

Solution:
export BOOST_ROOT as described above.