This piece was originally posted by Hyeong-jun Kim, CTO and Chief Architect at Gruter, on his Korean language blog.
SQL-on-Hadoop technologies have been drawing considerable attention in the big data analytics area of late. These solutions enable developers—that is, users and data analysts—to process big data stored on Hadoop via a familiar SQL interface rather than through complex programming languages. As they are based on the well-established SQL standard, these platforms can be easily integrated with existing data analysis solutions used in the BI area; thus, it’s little surprise they are generating so much interest.
Hive, developed at Facebook four years ago and now an Apache project, has been the most popular SQL-on-Hadoop solution to date. Over the last couple of years, however, alternative solutions have laid claim to being faster than Hive by several times or even several hundred times. In this post, I would like to examine a few of the key factors which affect the performance of SQL-on-Hadoop, and explain how to distinguish the claims being made about this new generation of products from the marketing bluster.
The main features of SQL-on-Hadoop
In order to assess the performance claims of the various new SQL-on-Hadoop technologies, we first need to understand the general features of these solutions.
In most cases, data files are stored on the Hadoop Distributed File System (HDFS). As a general rule, these are in flat file format without indexes. For read performance, data files can be stored in structured format—for instance, RC, ORC or Parquet.
Another important component of such solutions is the use of SQL. Used in data processing, SQL is executed in a distributed environment, with the results then stored on HDFS or other file systems.
There are many different SQL-on-Hadoop solutions on the market. The most prominent, Hive, converts SQL queries to MapReduce jobs, which means that it inherits the strengths as well as weaknesses of the latter model, such as M/R’s well-known stability and shuffling overheads.
Stinger seeks to correct some of those deficiencies by shifting Hive’s platform from MapReduce to the distributed execution framework Tez; as such, it does not function as a standalone solution, but is rather an addendum to Hive which fixes some of Hive’s prior problems.
Apache Tajo, in contrast, runs queries using its own distributed processing engine atop HDFS. Cloudera Impala also runs queries on its own distributed computing platform, though it has a different challenge to overcome: By keeping all data in memory at the shuffling stage, Impala can avoid related shuffling overheads, but as a consequence it cannot handle data which exceeds memory size.
Meanwhile, Apache Drill is in its early stages of development and has not yet implemented its core distributed processing module. Other options such as IBM’s Big SQL and Pivotal HAWQ, are also available, though these are proprietary rather than open-source solutions.
Key SQL-on-Hadoop performance factors
To gain a better grasp of the costs and benefits of each of these new solutions, it is important to understand the key elements which underlie the performance of SQL-on-Hadoop.
Since data files are usually stored on HDFS and full-scanned in most cases, scanning speed affects processing speed more than any other performance factor. Nevertheless, intermediate data transfer is also a crucial cog in the wheel. Given that operations such as GROUP BY and ORDER BY are executed in multiple processing steps, the efficiency (or inefficiency as the case may be) of data transfer from each step to the next affects performance significantly.
Furthermore, the efficiency of the execution plan of such query operations also plays a major role in overall performance. For instance, because Hive lacks native support for multiple reducers, it is unable to exploit the power of the distributed processing platform when running ORDER BY operations, thus severely limiting its effectiveness in such instances.
Query execution initialization time—the period required for a query plan to be built and a job initialized in a distributed environment—is yet another important part of the SQL-on-Hadoop machinery. In cases of short query execution times, initialization accounts for a significant portion of the overall processing duration. However, in cases of relatively longer query execution times (say, 100 sec plus), initialization constitutes only a small fraction of the processing duration.
Another important consideration affecting SQL-on-Hadoop performance is file format. Recently, a range of newer columnar file formats—such as RCFile, ORCFile and Parquet—have been developed, joining traditionally-used file formats such as textfile and seqfile. As these file types have quite different properties, it is important to choose the appropriate file format for the specific data and query types under consideration.
Notably, in instances where queries generate large output data, the performance of data file format and file writing can have a significant impact on overall processing. For instance, the ORC file format has strong read performance, but comparatively weak write performance. The judicious use of compression as part of overall resource management is therefore another important aspect of SQL-on-Hadoop systems.
The absolute performance limits of SQL-on-Hadoop
Beyond the factors outlined above, it is also important to keep in mind that SQL-on-Hadoop systems cannot presently exceed the following hard performance constraint: Performance will be less than or equal to the total size of disk bandwidth made available to the HDFS.
For example, ten servers having eight SATA 7200rpm disks each would have a total performance capacity as follows:
|10 servers * 8 disks * 100 MB/sec = 8 GB/sec|
In practice, performance comparisons between Hive, Tajo and Impala on over 1.5TB data do not demonstrate much difference on queries such as the following:
|select count(*) from T_TEST|
This is because there are no shuffling overheads in this query, with the main performance factors being file scanning speed, query execution initialization time, and file format. Given the same file format and a sufficiently large volume of data, file scanning speed becomes the differentiating factor, and therefore the performance of each solution is unlikely to vary much.
Consequently, across a range of natural field queries, if Hive were to be rated a “1.0”, under present constraints new generation solutions such as Stinger, Impala and Tajo are likely to be 1.5–3 times that at most. As such, claims that a product is or aims to be “100 (or even 200!) times” faster than Hive are futuristic and are not rooted in current practicable technology and methods. The design goal is of course always to beat yesterday by 100-fold, but the relevant information here concerns the known technical capabilities of the architecture and methods which comprise the SQL-on-Hadoop solution being assessed.
Frequently-used performance comparison queries
Due to the high query initialization costs of Hive—a problem not shared by Stinger as it does not use MapReduce—its query execution time can be dozens or even hundreds of times that of other solutions. Given this, if hypothetical Solution A takes 0.1 sec to process a short-duration query, and Hive takes 20 sec to perform the same task, it might be tempting to conclude that Solution A outperforms Hive by 200x. However, this speed advantage does not apply uniformly across all query types and query sets. Thus, whereas Hive may take 200 seconds to resolve a query on a very large data set, it is extremely unlikely that Solution A can also process the same query in just one second (i.e., 200x faster). In other words, the advantages that certain solutions have when dealing with short-duration queries often preclude them from holding such an advantage when dealing with relatively long-duration queries. This asymmetry is the cause of much confusion in the market.
One way of starkly illustrating this asymmetry is to run comparative tests which contain an ORDER BY operation. On Hive, ORDER BY is executed on only one server, regardless of the cluster size. This means that in a test environment of 100 servers, each running 10 concurrent tasks, a solution that supports distributed ORDER BY can already theoretically achieve results in the magnitude of 1000x times faster than Hive. The fact that Hive does not yet support the use of multiple reducers means that it is not entirely false to make a “1000x faster claim” right now; nonetheless, it would be quite misleading for such a claim to be framed as a meaningful general insight into the performance of SQL-on-Hadoop solutions relative to Hive. And we could demonstrate much the same again if we looked at JOIN, a query function heavily dependent upon the query execution plan capabilities of each particular solution.
As is the natural human wont, advocates of particular SQL-on-Hadoop solutions tend to emphasize favorable performance results derived from advantageous test settings, failing to explain the whole picture, including results across the full set of ordinary queries. According to our lab experiments, the average performance advantage over Hive tends to be in the order of 3-5 times (with that multiple varying according to the weights placed upon different query types).
As such, it is crucial that prospective users choose the platform which best aligns with their specific data and query specs, and that they investigate each SQL-on-Hadoop solution accordingly. Taking test claims at face value—even in a field inhabited by sharp minds, reputable vendors, passionate bloggers and an enthusiastic media—can be unwise. After all, even the world of open source software is not immune to a little exaggeration.