Abstract
Relational Database Management Systems have failed to deliver the availability and performance demands required by many internet giants. This is primarily due to both the exponential growth of the internet and Relational Database Management System response times as a result of increases in database transactions and overall database size [5]. Since it is impossible for a distributed computer or database system to simultaneously provide Consistency, Availability, and Partition Intolerance [14], recently, alternative database solutions have emerged with relaxed Consistency requirements (consistency means that all database nodes have the ability to see the same data at the same time) in exchange for optimal data operations performance gains. These cutting edge data management solutions use highly specialized forms of object serialization to produce these performance gains [1]. The Cassandra database is an example of such a database that has been successfully used by Google, Facebook, and Twitter in different variations and capacities [1] [2]. C# developers can also successfully utilize various forms of object serialization to enhance application data performance. This does not mean that Relational Database Management Systems must be removed from C# applications, or that NoSQL databases such as Cassandra must be implemented in order to enhance application data performance. However, there are many opportunities for forms of object serialization to be used in ways similar to the Cassandra key-value data stores in order to see viable application data performance gains.
Introduction
The continual and steady increase in application data requirements has forced many of today’s internet giants to pursue and in some cases even create higher performance application data management solutions. Google, Facebook, and Twitter are three profound examples of internet giants which no longer use traditional Relational Database Management Systems to meet all of their data management related needs [1][2]. Giving up the “all or nothing” consistency of ACID compliant Relational Database Management Systems, these internet trendsetters prefer to exchange total ACID compliance for faster data reads and writes while settling for an “eventually-consistent key-value store” data management solution [3]. Eventually consistent databases are also commonly referred to in the technology marketplace as NoSQL databases [3]. The Apache Cassandra database is one such system. “Originally developed by Facebook, Apache Cassandra built upon Google’s Big Table model using Amazon’s Dynamo infrastructure to produce a key-value store that supports tunable consistency” [1]. The Cassandra source code was published on Google Code around 2008 [1], and was donated to Apache as an open source incubator project during 2009 [1]. Twitter was in the process of converting parts of their MySQL Relational Database Management Solution to Cassandra as early as February 2010, citing scalability and size constraints as primary reasons for the conversion [2].
Cassandra clustering is a key benefit to high volume internet applications used by companies like both Twitter and Facebook. Each node in a Cassandra cluster plays the same role. In more simplistic terms, this means that there is no single point of failure for the entire cluster. Furthermore, Cassandra dynamically manages server load balancing by randomly distributing its data across the entire clustered network [1]. Using other more traditional Relational Database Management Solutions, this type of clustering functionality comes at a steep price. Even MySQL, commonly thought of as completely open source, charges approximately 10,000 per server for cluster support up to only 4 sockets [4]. Although clustering and decentralized data distribution do allow Cassandra’s users to scale almost infinitely using only commodity hardware [1], Cassandra’s high availability, speed, and throughput come from a multidimensional data map that is not much more than what most programmers would recognize as a binary serialized object [1]. “A table in Cassandra is a distributed multidimensional map indexed by a key. The value is an object that is highly structured [1].” In a typical Relational Database Management system, as transactions and database size increase, response time increases exponentially [5]. Unlike typical Relational Database Management systems and with Cassandra, “Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications [1].” The broad and successful application of Cassandra and other similar NoSQL data management solutions by many of today’s internet giants provides ample proof that the object serialization of internet application data has great potential to provide huge benefits when used carefully and correctly within the context of an internet information system.
Although the Cassandra database has been used as a positive example of large scale object serialization in action, the purpose of this document is to focus on the benefits of object serialization outside the Cassandra paradigm. Many forms of object serialization can be used to enhance C# application performance without utilizing a NoSQL database solution. Furthermore, object serialization can also be used to enhance application performance while utilizing a Relational Database Management System at the same time. The remainder of this document will explore various forms of object serialization and specific strategies for utilizing serialized objects to enhance the speed of data access and data updates within different C# data intensive applications. The analysis will begin with an overview of different object serialization methods discussing differences in size, performance, and some programmatic syntax examples for Binary serialization. Finally, a brief case study will be utilized to demonstrate the power and effectiveness of a well implemented object serialization strategy.
Object Serialization Methods
There are many various forms of object serialization available to the C# programmer. Each implementation has advantages, disadvantages, and appropriate use cases depending on the application’s specific requirements. Four primary object serialization types that are well suited for data manipulation are Binary, XML, JSON, and Protocol Buffers. Although there may very well be many more object serialization types used in the industry for various forms of data manipulation, the following sections will provide a brief overview for each type mentioned above which will also include each type’s specific implementation and niche within the C# programming language.
Binary
Binary serialization is implemented in C# using the [Serializable] attribute. Each serialized C# class must also implement the ISerializable interface. When a C# object is serialized using this method the resulting data is written to a file in Binary format. When a Binary object is deserialized, Binary data is read from a file to load or restore an object to its former state at the time of serialization. The C# programming code required to serialize objects in most formats is fairly similar with minor changes for each specific format. Once an object has been defined as serializable, each variable within the object is typically marked in some way for serialization. In Binary serialization, SerializationInfo.GetValue and SerializationInfo.AddValue can be used to mark specific object variables for serialization loading and unloading.
Binary Serializable C# classes can be written to and from files using a Binary stream.
Figure 1 – Binary Serializable Object Example [12]
XML and JSON
Both XML and JSON produce human readable, text-based data encoding that can be communicated across multiple information systems without concern for the various programming languages used to implement each system [8]. However, language independence and human friendly format do come at the cost of both the performance and additional space required to implement these types of serialization. Binary forms of XML have been suggested as a compromise between the pure Binary and XLM serialization formats [8]. So far, none of the competing Binary XML formats have been adopted by any standards organization [13].
Protocol Buffers
“Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats [6].” Created by Google, Protocol Buffers was originally designed to be faster than XML [7]. Although most forms of serialization “breaks the opacity of an abstract data type by potentially exposing private implementation details” [8], Protocol Buffers “are serialized into a binary wire format which is compact, forwards-compatible, backwards-compatible, but not self-describing [7].” The specific implementation of Protocol Buffers for .NET was written by Mark Gravell, developer for the popular C# programming community website Stack Overflow and can be used under the Apache 2.0 open source license [9]. It is also the author’s first choice of serialization methods. Although Protocol Buffers does not use any form of data compression by default, it does use variable length encoding for integer values that reduces the space required by small integer values [10].
When compared to other forms of serialization, protobuf.net is by far the most efficient approach. In one personal instance, a serialized Dictionary object of type <string, int[]> was used to load a list of approximately 256,000 strings and an array of 43 integers per string (about 11 million integers) from a tab delimited text file for data processing. The original size of the text file was approximately 48,000 KB. Binary serialization was then used to save the Dictionary object to a Binary file. The respective Binary file size was about 49,000 KB. This was just slightly larger than the original file size and expected with the object data also added to the original file. Next, using the same tab delimited text file, an identical Dictionary object was created and serialized using Protobuf.net. The respective protobuf.net Binary file size was about 22,000 KB. This was less than half the original file size. In addition, deserialization of the protobuf.net occurred in about half the time as standard Binary deserialization. These results also appear similar to existing benchmarks:
Figure 2 – Protocol Buffer Performance [11]
Regardless of which serialization method is used, Protocol Buffers takes up less space, loads faster, and unloads faster during the serialization and deserialization processes. Utilization of Protobuf.net in a C# project is as simple as including the protobuf.dll file in the project and adding “using protobuf;” at the top of a class module. Syntax for creating the serialized protobuf objects is actually more simplistic than the example shown above for Binary serialization.
A Brief Case Study in Data Performance Optimization Using Object Serialization
In the web analytics application mentioned above, one table was used to hold a list of approximately 256,000 unique strings. The table contained 43 additional columns of integer data representing frequency scores for 43 different categories of information. The table data was used intensively each time the web analytics application executed. This usage amounted to one unique string lookup for each individual term passed to the application. In some instances, the application received several hundred thousand terms at a time which resulted in several hundred thousand lookups to the table (one lookup per term). In addition, each lookup always resulted in calculations which utilized the values from all 43 category scores attached to each specific term key record.
As stated above, the table design only accommodated for 43 category scores per term. Aside from the obvious performance impacts of several hundred thousand, nearly simultaneous keyed lookups, the data model was not well suited for the addition of new categories or the deletion of old categories which were no longer needed. Alternate Relational Database Management System designs to accommodate for any number of categories would always result in the duplication of term key strings or term key integers that pointed to term key strings in a term key lookup table. To complicate matters further, these higher levels of table normalization would also further degrade table response times which were already of primary concern.
Although a de-normalized table was not well suited for this task, utilization of a serializeable in-memory Dictionary<string,int[]> object improved data performance on multiple levels.
- The object could easily be loaded into memory each time the server was started.
- The object’s integer array easily accommodated for an infinite number of categories with no duplication of term keys.
- Since Dictionary objects are implemented as hash tables, each term key lookup was magnitudes faster than individual record lookups to a database table.
- Protobuf.net’s serialization format reduced the application’s data foot by almost 50% while increase serialization and deserialization processing times.
Conclusions
It is doubtful that the need for Relational Database Management Systems will ever disappear. However, recent NoSQL systems such as Cassandra have proven that there is great opportunity to achieve data processing performance gains through the use of key-value stores serialized within highly structured objects. Programmers can also take advantage of hybrid data management approaches which utilize both Relational Database Management Systems and serialized objects to optimize application data management performance.
References
- Sae1962, Apache Cassandra, Wikipedia.org, 03/07/2011, http://en.wikipedia.org/wiki/Cassandra_(database
- Alex Popscu, “Cassandra @ Twitter: An Interview with Ryan King”, nosql.mypopescu.com, 2/23/2010, 10/10/2011, http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king
- Davidhorman , NoSQL, Wikipedia.org, 07/30/2011, 10/10/2011, http://en.wikipedia.org/wiki/NoSQL_(concept)#Key-value_store
- Author Unknown, MySQL Editions, MySQL.com, 10/10/2011, http://www.mysql.com/products
- Lorenzo Alberton, “NoSQL Databases: Why, what and when”, pg. 9, 02/25/2011, 10/10/2011, http://nosql.mypopescu.com/post/6412803549/nosql-databases-what-why-and-when
- Author Unknown, Protobuf, code.google.com, 2011, 10/10/2011, http://code.google.com/p/protobuf
- Alexkon, Protocol Buffers, Wikipedia.org, 08/08/2008, 10/10/2011, http://en.wikipedia.org/wiki/Protobuf
- GoingBatty, Serialization, Wikipedia.org, 11/19/2010, 10/10/2011, http://en.wikipedia.org/wiki/Serialization
- Author Unknown, Protobuf-net, code.google.com, 2011, 10/10/2011, http://code.google.com/p/protobuf-net/
- Mark Gravell, Does Protobuf-net has build-in compression for serialization, stackoverflow.com, 08/24/2011,10/10/2011, http://stackoverflow.com/questions/7174635/does-protobuf-net-has-build-in-compression-for-serialization
- Author Unknown, Performance, code.google.com, 2011, 10/10/2011, http://code.google.com/p/protobuf-net/wiki/Performance
- Omkamal, Object Serialization using C#, codeproject.com, 01/31/2002, 10/13/2011, http://www.codeproject.com/KB/cs/objserial.aspx
- Jzhang2007, Binary XML, Wikipedia.org, 10/21/2007, 10/10/2011, http://en.wikipedia.org/wiki/Binary_XML
- Davnor, CAP Theorem, Wikipedia.org, 03/03/2010, 10/10/2011, http://en.wikipedia.org/wiki/CAP_theorem
Serialization is tricky to use, I know because I spend a long time implementing VelocityDB which provides C# object persistence with the addition of giving most objects a 64-bit object identifier. The standard serializtion in C# is very inefficient using very large xml (Soap) representations of objects. With VelocityDB all data is compressed to the bare minimum binary data. Try it out it is way faster than any other database solution for C#. http://www.VelocityDB.com
See comparisons with other database systems at: http://www.velocitydb.com/Compare.aspx