Wednesday, March 15, 2017

Hue SQL tricks

I had to do the following to get a create table to run. Be aware that it will complain about everything without much of a clue as to the error.

1) I had to be in the area the source tables were in

2) Removed hive from all table references

create table ~~hive.~~ltobias.xyz

3) Removed the formatting from the to_date:

> to_date('2017-01-01'~~, 'yyyy-MM-dd'~~)

> to_date('2017-01-01')

4) Put backticks around items as alias (apparently, it's a keyword)

join db1.items `items`

5) Remember to hit refresh to see the table.

Thursday, March 2, 2017

HUE: Create Database

Click the refresh button above the drop down list if your database doesn't automatically display.

Wednesday, March 1, 2017

MapR 5.0 Documentation / Drill Tutorial - Lesson 1: Learn About the Data Set

Lesson 1:

MapR Drill - Lesson 1

drill:> show databases;
------------------------------------------------------------------

Enable the DECIMAL Data Type

This tutorial uses the DECIMAL data type in some examples. The DECIMAL data type is disabled by default in this release, so enable the DECIMAL data type before proceeding:

alter session set `planner.enable_decimal_data_type`=true;

------------------------------------------------------------------

describe hive.nextgen_rjr.transaction_line_items

------------------------------------------------------------------

Lesson 2:

MapR Drill - Lesson 2

select cast(row_key as int)
, regexp_replace(cast(t.address.state as varchar(10)),'"','')

from customers t limit 1;

------------------------------------------------------------------

A mutable (or writable) workspace is a workspace that is enabled for “write” operations. This attribute is part of the storage plugin configuration. You can create Drill views and tables in mutable workspaces.
-----------------------------------------------------------------

create or replace view custview as select cast(row_key as int) as cust_id,

cast(t.personal.name as varchar(20)) as name,

cast(t.personal.gender as varchar(10)) as gender,

cast(t.personal.age as varchar(10)) as age,

cast(t.address.state as varchar(4)) as state,

cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,

cast(t.loyalty.membership as varchar(20)) as membership

from maprdb.customers t;

A view is simply a special file with a specific extension (.drill). You can store views even in your local file system or point to a specific workspace
------------------------------------------------------------------

!set maxwidth 10000 -- can't get to work

------------------------------------------------------------------

Repeated Count

https://drill.apache.org/docs/repeated-count/

This function counts the values in an array.

Syntax ¶

REPEATED_COUNT (array)

array is the name of an array.

Usage Notes ¶

The COUNT function requires a single argument, which must be an array. Note that this function is not a standard SQL aggregate function and does not require the count to be grouped by other columns in the select list (such as name in this example).

Example ¶

The following example returns the counts for the categories array in the yelp_academic_dataset_business.json file. The counts are restricted to rows that contain the string pizza.

SELECT name, REPEATED_COUNT(categories) 
FROM   dfs.yelp.`yelp_academic_dataset_business.json` 
WHERE  name LIKE '%pizza%';

+---------------+------------+
|    name       |   EXPR$1   |
+---------------+------------+
| Villapizza    | 2          |
| zpizza        | 4          |
| zpizza        | 4          |
| Luckys pizza  | 2          |
| Zpizza        | 2          |
| S2pizzabar    | 4          |
| Dominos pizza | 5          |
+---------------+------------+

7 rows selected (2.03 seconds)

Monday, January 23, 2017

Apache Drill

Drill is the industry's 1st schema-less SQL engine.

Feel the need for speed? It can process petabytes of data in seconds.

It can combine data from multiple types of files on the fly in a single query.

Uses ANSI SQL 2003 support

Rethinking SQL for Big Data with Apache Drill (MapR)

Wednesday, January 18, 2017

JSON Sample

{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
  },
  "phoneNumber": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ],
  "gender": {
    "type": "male"
  }
}

https://en.wikipedia.org/wiki/JSON#JSON_sample

Lynda Courses

Beginner

q Agile vs. waterfall

q Data Analysis on Hadoop

q Hadoop Fundamentals

q GIS on the Web

q Learn the Linux Command Line: The Basics

q Linux: Bash Shell and Scripts

q Manage Your Organization's Big Data Program

q Real-World GIS

q Techniques and Concepts of Big Data

q Transitioning from Data Warehousing to Big Data

q Transitioning from Waterfall to Agile Project Management

q Understanding Data Science

Intermediate

q Java Essential Training

q NoSQL for SQL Professionals

q Overview of IDEs for Java

q Up and Running with Java

q XML Essential Training

q Learn Java Concepts By Example

q Java Essential Training for Students

q Code Clinic: Java

q Foundations of Programming: Object-Oriented Design

q Up and Running with Git and GitHub

MapR EcoSystem

https://www.mapr.com/products/product-overview/overview%20

Execution Engines	Batch	Tez
		Spark	A fast and general engine for large-scale data processing
		Cascading
		Pig	an ETL library for Hadoop. It generates MapReduce jobs. You use it when you have processes that are ETL-like.
		Map Reduce v1/v2
	ML, Graph	Graphx
		MLLIB
		Mahout	machine learning or predictive analytics. A library.
	SQL	Drill	A schema-free SQL query engine for Hadoop, NoSQL, and Cloud Storage. Doesn't use MapReduce.
		Shark
		Impala
		Hive	SQL like query used with Hbase. It uses H-sql. Ad-hoc querying.
	NoSql & Search	Accumulo
		Soir
		HBase
	Streaming	Storm	A free and open source distributed real-time computation system.
	Streaming	Spark Streaming
	Yarn		“Yet Another Resource Negotiator”. sometimes called MapReduce 2.0. Apache YARN decouples resource management and data processing in Hadoop.
Data Governance & Operations	Data Integration & Access	Hue
		HttpFS
		Flume	a log collector because Hadoop jobs produce a large amount of log information about job process because the jobs are running batch, so they take time to run
		Sqoop	Transfers bulk data between Hadop and Oracle’s DBMS.
	Security	Knox
	Security	Sentry
	Workflow & Data Governance	Falcon
	Workflow & Data Governance	Oozie	a Workflow scheduler library for Hadoop jobs
	Provisioning & Coordination	Savannah
		Juju
		Zookeeper	A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Hadoop

Hadoop is a file system with a processing library.

The file system is HDFS (Hadoop File System).

The api is MapReduce.

HBase

A nosql db commonly used with Hadoop systems. It is a wide-columnstore (See Hive)

This is called Schema on Read, which means no schema is imposed as data is written to the file, but a create table will be required when it is read.

ID	Data
1	Name=”Jan”, Location=”Pittsburgh”
2	Name=”Heather”, Car=”Volvo”
3	Location=”South Hills”, Car=”Ford”, Color=”Red”

An open-source, distributed, versioned, non-relational database modeled after Google's Bigtable.

Tuesday, January 17, 2017

Bytes

1000 kB kilobyte 1,000

1000² MB megabyte 1,000,000

1000³ GB gigabyte 1,000,000,000

1000⁴ TB terabyte 1,000,000,000,000

1000⁵ PB petabyte 1,000,000,000,000,000

1000⁶ EB exabyte 1,000,000,000,000,000,000

1000⁷ ZB zettabyte 1,000,000,000,000,000,000,000

1000⁸ YB yottabyte 1,000,000,000,000,000,000,000,000

3 V's & Data Types

Volume

Variety

unstructured data

actually very little structure

semi-structured data

(log files)
tab-delimited

structured data

highly stuctured (rdbms)

multi-structured = unstructured / semi-structured / structured

Velocity

data with value for an amount of time

Learning Big Data