Gruter Research Director and Apache Tajo PMC Chair, Dr. Hyunsik Choi, was a presenter at Hadoop Summit 2014, held in San Jose on June 3-5. A highlight each year on the Hadoop calendar, the summit brings together big data experts from across the world to discuss new developments in Hadoop-related technologies.
Hadoop Summit 2014 Key Note Address
Choi’s challenging session looked at sophisticated methods being trialed by the Apache Tajo team which promise to boost SQL-on-Hadoop performance. With Apache Tajo having been granted Top-Level Project Status by the ASF earlier this year, the session generated much interest from advanced SQL-oriented developers.
Choi, who says he takes “a pragmatic and empirical approach to software development”, began his presentation with an examination of two techniques which can be used to boost SQL-on-Hadoop performance: Cost-based Join Optimization (CBO), and Progressive Query Optimization.
As query plans are usually generated by statistical estimation, Choi contended there is significant room for performance improvement during the planning phase of the query. According to Choi, approaches such as CBO and Progressive Query Optimization alone ensure improved SQL-on-Hadoop performance across a wide range of queries, saving minutes or even hours in certain worst-case scenarios.
Discussing Apache Tajo’s implementation of Progressive Optimization, Choi explained that Tajo currently uses the technique to refine the partitioning of running queries, reducing query processing times. With plans to reoptimize the range and join order of queries, as well as change their distributed join strategy and transmission method, Choi told the audience he expected further performance payoffs from this approach.
Segueing to a related challenge facing SQL-on-Hadoop solutions, Choi turned his focus to “CPU bottlenecking”, an increasing problem with the arrival of new, faster storage technologies such as Solid State Drives. Caused by inefficient query execution logics (for the initiated, namely “iterative function calls in common query execution logics such as the ‘tuple-at-a-time’ approach”), Choi presented data from his team’s lab work implicating CPU processing time as a major drag on many common big data queries.
Choi then spent considerable time on the potential of “vectorization” as a means of overcoming such query execution inefficiencies. Speaking particularly to the advanced database experts in the room, he outlined Tajo’s proposed JIT-based Vectorized Processing Model, an approach developed in Java which includes both “unsafe-based vector in-memory structures” and “unsafe-based cuckoo hashing”.
According to Choi, such advanced query modeling is needed to improve the implementation of SQL within the Hadoop environment, particularly given the Hadoop ecosystem is young and lags its traditional DB counterpart in experience and technical prowess.
In fact, it was this broader challenge which led Choi and his colleagues at the Korea University Data Lab to launch Tajo back in 2010, with the project’s overtly experimental and eclectic approach geared toward solving the big problems facing SQL-on-Hadoop. This frontier-oriented philosophy is one Choi hopes will enthuse Hadoop ecosystem developers and researchers, and motivate them to get involved in the work being done by the Apache Tajo community.
Still on the road in California with core members of the Tajo team, Choi is speaking at Big Data Camp LA on June 14, and is scheduled to meet IT departments at several major California-based tech companies interested in the Apache Tajo SQL-on-Hadoop “Big DW” solution.
Gruter’s Dr. Hyunsik Choi chats with audience members after his presentation
The Apache Tajo team catches up with respected Apache PIG PMC Chair and friend, Cheolsoo Park (far right), at the Summit.