Modern Big Data Analysis with SQL

I recently completed the Modern Big Data Analysis with SQL specialization from Cloudera on Coursera.

Its three courses introduced:

foundations for big data analysis with SQL;
SQL SELECT statements for Hive and Impala;
managing big data in clusters and cloud storage.

Among the three courses, most valuable for me was the Foundations course. It laid out the differences between a traditional RDBMS and a big data system (distributed SQL engines like Hive and Impala)—in particular, how data and metadata are “loosely coupled” with a distributed SQL engine.

I was most familiar with a traditional RDBMS where the RDBMS encapsulates the data storage. In other words, you can only interact with the data by writing SQL statements. Data must be structured (or at least semi-structured such as JSON). The “schema on write” property of an RDBMS means that data must adhere to the table’s data dictionary or it will be rejected.

For a big data system though, the storage of the data and metadata are separate. A table’s metadata lives in the (Hive) metastore, and the data are typically just files in HDFS or cloud storage.

When a distributed SQL engine like Hive receives a query, it first retrieves the table structure and data location from the metastore, then retrieves the files in that directory, and only then processes the results.

This “schema on read” property allows the big data system to store unstructured data, and only apply (but not necessarily guarantee) the metadata’s structure when reading the data.

This fundamental differences has many interesting consequences. For one, pre-joined denormalized tables may actually be preferable for many analytic big data warehouses due to the computational cost of join operations on very large tables. Having spent a lot of time normalizing tables in the PostgreSQL for Everybody and Databases for Data Scientists specializations, this was an interesting perspective.

The third course on managing big data in clusters and cloud storage was also helpful. I was able to practice browsing S3 files from the command line with commands like:

hdfs dfs -cat s3a://bucket-name/path-to-file/file.csv | head

Then write CREATE TABLE statements pointing to S3 buckets and specifying options like delimiters and null values.

CREATE TABLE IF NOT EXISTS tbl_name (
  col1 STRING,
  col2 SMALLINT
)
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
  LOCATION 's3a://bucket-name/path-to-file/'
  TBLPROPERTIES ('serialization.null.format' = '\N');

The course finished by introducing the idea of views or partitioned tables as ways to optimize query times depending on the use case.