Integrating Tajo with Hive

This piece is based on a post by Jae-hwa Jung, Apache Tajo committer and Gruter senior developer, originally published on his Korean language blog.

Apache Tajo, a big data warehouse on Hadoop, uses its own SQL processing engine for query processing, unlike Hive which uses MapReduce. It supports both short-term queries which can range from seconds to minutes, and long-term queries which can take hours. Thus, it can be used for ad-hoc interactive analysis as well as complex ETL jobs. (For further information on Tajo’s basic architecture and features, see this presentation).

In this post I would like to explain the Hive integration features which have been added to the latest release of Tajo, Tajo 0.8. This is an important step in the development of Tajo as the widely-used Hive can be easily transitioned to take advantage of Tajo’s many advanced features.

Tajo provides two methods of Hive integration: The first method uses HiveQL, a Hive query language, and the second uses HiveMetaStore, a file that contains Hive metadata.

Running HiveQL on Tajo

Tajo uses its own algebraic expression set in order to process queries. These algebraic expressions process data in JSON format and have been designed to contain various attributes such as predicates, arithmetic expressions and relational operators. When a user runs an SQL query on the Tajo client, Tajo converts the SQL syntax to an algebraic expression, using this expression to build and execute a query execution plan.

For example, when Tajo encounters an SQL expression such as the following:

select id, name, age, gender from people

A corresponding algebraic expression is generated:

{
  "all": false,
  "distinct": false,
  "targets": [
    {
      "expr": {
        "name": "id",
        "opType": "Column"
      },
      "opType": "Target"
    },
…
  ],
  "child": {
    "relations": [
      {
        "tableName": "people",
        "opType": "Relation"
      }
    ],
    "opType": "RelationList"
  },
  "opType": "Projection"
}

In order to generate this output, Tajo uses a parser class called SQLAnalyzer that interprets and converts ANSI SQL syntax to Tajo’s corresponding algebraic structure. Through this process, Tajo can accept most SQL dialects and other types of query processing languages that can be converted into algebraic expressions. This of course explains how Tajo is able to understand HiveQL, with its HiveConverter for HiveQL parser class enabling it to execute most HiveQL syntax, though not all of it.

By default, Tajo is set to run ANSI SQL. To run HiveQL, execute the following command in the Tajo tsql shell:

set tajo.hive.query.mode=true;

This switches on Tajo’s HiveQL mode. To return back to ANSI SQL mode, simply reverse this with the “set tajo.hive.query.mode=false” command.

At present, Tajo supports ANSI SQL and HiveQL; however, Tajo’s highly flexible algebra-based architecture enables it to accept other SQL dialects and query languages such as PigLatin, the procedural script language of Pig. We look forward to seeing support for more languages added to Tajo in the near future.

Tajo algebra parsing process
Figure 1. Tajo algebra parsing process

Sharing the Hive MetaStore

Tajo stores all metadata in CatalogServer. CatalogServer calls functions such as getTable and addTable via the Catalog interface. CatalogServer can access almost any type of DB if the Catalog interface is implemented accordingly. In addition to the Catalog implementations for DerbyDB and MySQL made available in earlier Tajo versions, Tajo 0.8 adds support for HiveMetaStore.

CatalogStore interface structure
Figure 2. CatalogStore interface structure

Tajo uses HiveMetaStore as a Catalog server and HCatalog as a metadata layer which enables the sharing of data between Hadoop-based data processing systems such as Hive, Pig and MapReduce. This means Tajo is able to query tables stored on HiveMetaStore. Given Hive users may not want to modify or delete Hive tables, this allows queries to be executed directly on Hive tables in addition to the running of INSERT queries on Tajo.

To use HCatalogStore, compile the latest source code to generate the binary archives as follows:


$ git clone https://git-wip-us.apache.org/repos/asf/incubator-tajo.git tajo
$ mvn clean package -DskipTests -Pdist -Dtar -Phcatalog-0.1x.0
$ ls tajo-dist/target/tajo-0.8.0-SNAPSHOT.tar.gz

Note that Tajo supports HCatalog for Hive 0.11.0 and 0.12.0. Set “-Phcatalog-0.11.0″ or “-Phcatalog-0.12.0″, depending on the version of Hive you are using.

In order to utilize HiveMetaStore, two files need to be modified. The first of these, the “tajo-env.sh” file, can be modified by setting two environment variables, HIVE_HOME and HIVE_JDBC_DRIVER_DIR, as follows:


export HIVE_HOME=/path/to/your/hive/directory
export HIVE_JDBC_DRIVER_DIR=/path/to/your/mysq_jdbc_driver/mysql-connector-java-x.x.x-bin.jar

The second file to be modified, “catalog-site.xml”, then needs to be configured to enable HCatalog to be employed by the CatalogServer:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>tajo.catalog.store.class</name>
  <value>org.apache.tajo.catalog.store.HCatalogStore</value>
</property>
</configuration> 

As there are no HiveMetaStore server addresses or JDBC settings through which Tajo can be configured to access HiveMetaStore, Tajo obtains configuration instructions from the “hive-site.xml” file in the HIVE_HOME/conf directory:

Connecting HCatalogStore with HiveMetaStore
Figure 3. Connecting HCatalogStore with HiveMetaStore

Conclusion

Hive is the most widely used SQL-on-Hadoop solution deployed today. Yes, it is slower than newer solutions such as Impala and Preso due to its dependency on MapReduce. But, unlike these newer solutions, it retains the advantage of being able to process very large data sets under conditions of fault-tolerance. Also, as most users have a lot of existing query sets written for Hive, these alternative SQL-on-Hadoop solutions are usually regarded as complements to, not substitutes for, Hive.

Tajo not only has the fault tolerance strengths of the original Hive solution, enabling it to handle very large data sets, but it can also process data sets without the memory limits that constrain some SQL-on-Hadoop solutions. Moreover, with the new Hive integration features that have been built into Tajo 0.8, Hive users can now access their data on Hadoop through Tajo without migration headaches, directly running HiveQL queries in Tajo on existing Hive tables stored in HiveMetaStore, at a speed 3-5 times faster, on average, than Hive.

With the release of Tajo 0.8, Hive users can now use Hive or Tajo or both on the same system. Given the ease of Hive integration, why not enjoy the speed advantages of Tajo 0.8?