PRESS RELEASE
PALO ALTO, CA—22 Oct 2014: The Apache Software Foundation announced the release of Apache Tajo v0.9 on Tuesday morning, US EDT time. The release heralds significant enhancements to the enterprise “SQL-on-Hadoop” big data warehouse solution, including extended native SQL support and processing improvements across a spectrum of workloads, from common office analytics tasks to massive data queries.
Apache Tajo PMC Chair, Gruter’s own Dr. Hyunsik Choi, thanked the Tajo community for their support with the release, saying the team had enjoyed bringing the latest version to fruition, particularly as it built on Tajo’s core strengths, including leading-edge native SQL support and lightning fast processing speed.
In use at a variety of organizations including Gruter, Korea University and SK Telecom, as well as at the NASA JPL Radio Astronomy and Airborne Snow Observatory, Apache Tajo continues to push towards its goal of “bringing traditional SQL performance to massive data workloads”. With the solution being submitted to rigorous “telco-scale” testing at SK Telecom, South Korea’s largest wireless carrier, an aggressive roadmap which has resulted three major releases over the past year, and early planning for vectorization and SSD support underway, Tajo is proving itself a must-consider solution for enterprises looking to scale-up their ability to run complex queries on massive data stores at high speeds.
Renowned for adhering to design parameters which include fault tolerance, a combination of working memory utilization and external storage writes, and data source neutrality—all onerous but essential enterprise-grade features—Apache Tajo uses advanced query techniques and processing algorithms to drive its blazingly fast query processing speeds. With the release of Apache Tajo v0.9, those credentials have been further bolstered.
Having spent a lot of time enhancing Tajo’s enterprise-grade stability and feature set earlier in the year, Choi said the team was able to focus on “the really fun stuff” with Tajo v0.9, adding more mature SQL features such as TIMESTAMP, DATE, TIME, and INTERVAL type support, as well as WINDOW functions, OVER clause support and multiple distinct aggregation.
While these new additions will warm the hearts of SQL users and data analysts alike, Choi was quick to point out that Apache Tajo is also in the speed game, with new features such as an offheap sort algorithm for ORDER BY, hash shuffle I/O improvement, and Runtime code generation for evaluating expressions expected to push industry speed limits.
Moreover, Choi was clearly delighted to report that the team’s “tweaks” to Tajo’s hash shuffle I/O have made the big data warehouse 200-300% faster on demanding, complex queries.
The release also sees additional integration features including support for Hadoop 2.2.0 through to Hadoop 2.5.1, expanded Hive Metastore access, and increased file and Catalog Store support, with Tajo v0.9 not only offering leading native SQL support on a faster platform, but also making Tajo more accessible.
The release of Apache Tajo v0.9 means Choi and his team have now taken the SQL-on-Hadoop solution through three distinct development phases over the past year: Phase 1, which saw the broadening of Tajo’s native SQL support, file format coverage, and feature set; Phase 2, which involved a focus on stabilization and robustness, and enterprise feature support; and Phase 3, capped by the release of v0.9, which saw the team return to core SQL support and processing speed work. Comments Choi:
“This has been a fantastic year for our team, with three version releases and of course the honor of being granted Top-Level Project Status by the ASF in March. The release of Apache Tajo 0.9 is very satisfying for us as a community of open-source developers because it demonstrates the longevity of our original goals and design choices, and their applicability to current enterprise and industry demands.”
Not one for lingering on past successes, Choi’s mind was quick to look forward: “We’re currently carrying out cutting-edge research and benchmarking work, and we’re always hopeful of bringing new features to Tajo ahead of schedule. With the enthusiasm and energy of the Apache Tajo community behind us, watch this space!”
Apache Tajo v0.9 Release Features (Major Enhancements Highlighted)
More comprehensive and powerful SQL capabilities:
- ALTER TABLE statement support
- TIMESTAMP, DATE, TIME, INTERVAL type support
- WINDOW functions and OVER clause support
- ORDER BY and GROUP BY clauses allow column references as well as expressions.
- Multiple distinct aggregation
- ORDER BY NULL FIRST support
- CREATE TABLE LIKE support
- concat() and concat_ws() function support
- to_char() for date time format
- COALESCE() for BOOLEAN, DATE, TIME, and TIMESTAMP
Performance improvements:
- Offheap sort algorithm for ORDER BY
- Hash shuffle I/O improvement (2-3 times faster in heavy queries)
- Repartitioner to choose more proper parallel degree and shuffle number
- Runtime code generation for evaluating expressions
- Fetch performance improvements
- Various query optimization improvements
Enhanced Hadoop Integration:
- Hadoop 2.2.0 or higher (to 2.5.1) support
- Hive Meta Store Access for 0.13.0 and 0.13.1
- Avro File support (flat schema)
- Parquet updated to 1.5.0
Other important improvements:
- TajoMaster HA (removed the last single point of failure)
- MariaDB Catalog Store support
- HDFS utility in tajo shell (tsql)
- Improved Catalog backup and restore feature
Apache Tajo Further Information
- Apache Tajo™ ASF page (web)
- Apache Tajo™ recent solution overview (slideshare)