Even Flow
Posted on 8th December 2013
The following is part of an occasional series of highlighting CPAN modules/distributions and why I use them. This article looks at Data::FlexSerializer.
Many years ago the most popular module for persistent data storage in Perl was Storable. While still used, it's limitations have often cause problems. It's most significant problem was that each version was incompatible with another. Upgrading had to be done carefully. The data store was often unportable, and made making backups problematic. In more recent years JSON has grown to be more acceptable as a data storage format. It benefits from being a compact data structure format, and human readable, and was specifically a reaction to XML, which requires lots of boilerplate and data tags to form simple data elements. It's one reason why most modern websites use JSON for AJAX calls rather than XML.
Booking.com had a desire to move away from Storable and initially looked to moving to JSON. However, since then they have designed their own data format, Sereal. But more of that later. Firstly they needed some formatting code to read their old Storable data, and translate into JSON. The next stage was to compress the JSON. Although JSON is already a compact data format, it is still plain text. Compressing a single data structure can reduce the storage by as much as half the original data size, which when you're dealing with millions of data items can be considerable. In Booking.com's case they needed to do this with zero downtime, running the conversion on live data as it was being used. The resulting code was to later become the basis for Data::FlexSerializer.
However, for Booking.com they found JSON to be unsuitable for their needs, as they were unable to store Perl data structures they way they wanted to. As such they created a new storage format, which they called Searal. You can read more about the thoughts behind the creation of Sereal on the Booking.com blog. That blog post also looks at the performance and sizes of the different formats, and if you're looking for a suitable serialisation format, Sereal is very definitely worth investigating.
Moving back to my needs, I had become interested in the work Booking.com had done, as within the world of CPAN Testers, we store the reports in JSON format. With over 32 million reports at the time (now over 37 million), the database table had grown to over 500GB. The old server was fast running out of disk space, and before exploring options for increasing storage capacity, I wanted to try and see whether there was an option to reduce the size of the JSON data structures. Data::FlexSerializer was an obvious choice. It could read uncompressed JSON and return compressed JSON in milliseconds.
So how easy was it to convert all 32 million reports? Below is essentially the code that did the work:
my $serializer = Data::FlexSerializer->new( detect_compression => 1 );
for my $next ( $options{from} .. $options{to} ) {
my @rows = $dbx->GetQuery('hash','GetReport',$next);
return unless(@rows);
my ($data,$json);
eval {
$json = $serializer->deserialize($rows[0]->{report});
$data = $serializer->serialize($json);
};
next if($@ || !$data);
$dbx->DoQuery('UpdateReport',$data,$rows[0]->{id});
}
Simple, straighforward and got the job done very efficiently. The only downside was the database calls. As the old server was maxed out on I/O, I could only run the script to convert during quiet periods as the CPAN Testers server would become unresponsive. This wasn't a fault of Data::FlexSerializer, but very much a problem with our old server.
Before the conversion script completed, the next step was to add functionality to permanently store reports in a compressed format. This only required 3 extra lines being added to CPAN::Testers::Data::Generator.
use Data::FlexSerializer;
$self->{serializer} = Data::FlexSerializer->new( detect_compression => 1 );
my $data = $self->{serializer}->serialize($json);
The difference has been well worth the move. The compressed version of the table has reclaimed around 250GB. Because MySQL doesn't automatical free the data back to the system, you need to run the optimize command on a table. Unfortunately, for CPAN Testers this wouldn't be practical as it would mean locking the database for far too long. Also with the rapid growth of CPAN Testers (we now receive over 1 million reports a month) it is likely we'll be back up to 500GB in a couple of years anyway. Now that we've moved to a new server, our backend hard disk is 3TB, so has plenty of storage capacity for several years to come.
But I've only scratched the surface of why I think Data::FlexSerializer is so good. Aside from its ability to compress and uncompress, as well as encode and decode, at speed, it is ability to switch between formats is what makes it such a versatile tool to have around. Aside from Storable, JSON and Sereal, you can also create your own serialisation interface, using the add_format method. Below is an example, from the module's own documentation, which implements Data::Dumper as a serialsation format:
Data::FlexSerializer->add_format(
data_dumper => {
serialize => sub { shift; goto \&Data::Dumper::Dumper },
deserialize => sub { shift; my $VAR1; eval "$_[0]" },
detect => sub { $_[1] =~ /\$[\w]+\s*=/ },
}
);
my $flex_to_dd = Data::FlexSerializer->new(
detect_data_dumper => 1,
output_format => 'data_dumper',
);
It's unlikely CPAN Testers will move from JSON to Sereal (or any other format), but if we did, Data::FlexSerializer would be only tool I would need to look to. My thanks to Booking.com for releasing the code, and thanks to the authors; Steffen Mueller, Ævar Arnfjörð Bjarmason, Burak Gürsoy, Elizabeth Matthijsen, Caio Romão Costa Nascimento and Jonas Galhordas Duarte Alves, for creating the code behind the module in the first place.
Comments
Why wouldnt you switch to Sereal?
Just curious why you wouldn't switch from JSON to Sereal. You would see a very large saving in space if you did, and you would see a performance boost too. One nice thing about Sereal is that if you store multiple hashes with the same keys in a structure then Sereal automatically dedupes the keys, and if you use the right options you can also dedupe *values*. This combined with built in high speed compression support would delivery considerable benefits. So I am curious why you would not do the switch. Is it because you value human readable blobs?Posted by demerphq on Saturday, 28th December 2013
Not wouldn't, just unlikely
Hi Yves, At the time I first looked into compressing JSON, I wasn't too familiar with Sereal. As such my main aim was to find an efficient way to compress just JSON. The benefits of using Data::FlexSerializer so far have been perfect, and the change in hardware have meant I haven't had to investigate further efficiencies as yet. That's not to say we won't use Sereal in the future, just that the need for further compression isn't needed currently.Posted by Barbie on Sunday, 23rd February 2014