Recently I’ve been thinking about storing records on disk quickly. For my general use case, an RDBMS isn’t quite fast enough. My first thought, probably like many a NoSQL person before me, is how fast can I go if I give up ACID?
Even if I’m intending to do the final implementation in C++, I’ll often experiment with Perl first. It’s often possible to use the C libraries anyway.
First up, how about serialising records using Storable to an on-disk hash table like Berkeley DB.
(Aside: I’m probably going to appear obsessed with benchmarking now, but really I’m just sticking a finger in the air to get an idea about how various approaches perform. I can estimate 90cm given a metre stick. I don’t need a more precise way to do a rough estimate.)
use Storable;
use Benchmark qw(:hireswallclock);
use BerkeleyDB;
I need to make a random record to store in the DB.
sub make_record
{
my ($order_id, $fields, $key_len, $val_len) = @_;
my %record;
$record{'order_id'} = $order_id;
for my $field_no (1..$fields-1) {
my $key = 'key';
my $val = 'val';
$key .= chr(65 + rand(26)) for (1..$key_len - length($key));
$val .= chr(65 + rand(26)) for (1..$val_len - length($val));
$record{$key} = $val;
print "$key -> $val\n";
}
return \%record;
}
And a wrapper handles the general case I’ll be testing – key and value equal length and order_id starting at 1.
sub rec
{
my ($fields, $len) = @_;
return make_record(1, $fields, $len, $len);
}
I’ll compare serialisation only against actually storing the data to disk to see what the upper limit I could achieve is if, for example, I was using an SSD.
sub freeze_only
{
my ($db, $ref_record, $no_sync) = @_;
$no_sync //= 0;
my $key = "/order/$ref_record->{'order_id'}";
Storable::freeze($ref_record);
}
And I’m curious to know how much overhead syncing to disk adds.
my $ORDER_ID = 0;
sub store_record
{
my ($db, $ref_record, $no_sync) = @_;
$no_sync //= 0;
$ref_record->{'order_id'} = ++$ORDER_ID;
my $key = "/order/$ref_record->{'order_id'}";
$db->db_put($key, Storable::freeze($ref_record));
$db->db_sync() unless $no_sync;
}
The Test Program
A record with 50 fields, each of size 50 seems reasonable.
my $filename = "$ENV{HOME}/test.db";
unlink $filename;
my $db = new BerkeleyDB::Hash
-Filename => $filename,
-Flags => DB_CREATE
or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;
my $rec_50_50 = rec(50, 50);
Benchmark::cmpthese(-1, {
'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) },
'freeze-sync-50/50' => sub { store_record($db, $rec_50_50) },
'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) },
});
The Results
Rate freeze-sync-50/50 freeze-no-sync-50/50 freeze-only-50/50
freeze-sync-50/50 1543/s -- -80% -93%
freeze-no-sync-50/50 7696/s 399% -- -63%
freeze-only-50/50 21081/s 1267% 174% --
Conclusion
Unsurprisingly syncing is expensive – it adds 400% overhead. However, even with the sync, we’re still able to store 5.5 million records an hour. Is that fast enough for me? (I need some level of reliability) It might well be.
Berkeley DB is fast. It only adds 170% overhead to the serialisation itself. I’m impressed.
In case anyone is interested. I ran a more comprehensive set of benchmarks.
freeze-sync-50/50 1846/s
freeze-sync-50/05 2262/s
freeze-sync-05/50 2546/s
freeze-sync-05/05 2799/s
freeze-no-sync-50/50 7313/s
freeze-no-sync-50/05 9514/s
freeze-no-sync-05/50 11395/s
freeze-no-sync-05/05 12589/s
freeze-only-50/50 20031/s
freeze-only-50/05 21920/s
freeze-only-05/05 26547/s
freeze-only-05/50 26547/s
fcall-only 2975364/s
Read Full Post »