Monday, January 23, 2017

Apache Drill

Drill is the industry's 1st schema-less SQL engine.

Feel the need for speed? It can process petabytes of data in seconds.

It can combine data from multiple types of files on the fly in a single query.

Uses ANSI SQL 2003 support


Rethinking SQL for Big Data with Apache Drill (MapR)

Wednesday, January 18, 2017

JSON Sample

{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
  },
  "phoneNumber": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ],
  "gender": {
    "type": "male"
  }
}
https://en.wikipedia.org/wiki/JSON#JSON_sample

Lynda Courses

Beginner

q  Agile vs. waterfall
q  Data Analysis on Hadoop
q  Hadoop Fundamentals
q  GIS on the Web
q  Learn the Linux Command Line: The Basics
q  Linux: Bash Shell and Scripts
q  Manage Your Organization's Big Data Program
q  Real-World GIS
q  Techniques and Concepts of Big Data
q  Transitioning from Data Warehousing to Big Data
q  Transitioning from Waterfall to Agile Project Management
q  Understanding Data Science

Intermediate

q  Java Essential Training
q  NoSQL for SQL Professionals
q  Overview of IDEs for Java
q  Up and Running with Java
q  XML Essential Training
q  Learn Java Concepts By Example
q  Java Essential Training for Students
q  Code Clinic: Java
q  Foundations of Programming: Object-Oriented Design
q  Up and Running with Git and GitHub

MapR EcoSystem

https://www.mapr.com/products/product-overview/overview%20

Execution Engines
Batch
Tez

Spark
A fast and general engine for large-scale data processing
Cascading

Pig
an ETL library for Hadoop. It generates MapReduce jobs. You use it when you have processes that are ETL-like.
Map Reduce v1/v2

ML, Graph
Graphx

MLLIB

Mahout
machine learning or predictive analytics. A library.
SQL
Drill
A schema-free SQL query engine for Hadoop, NoSQL, and Cloud Storage. Doesn't use MapReduce.
Shark

Impala

Hive
SQL like query used with Hbase. It uses H-sql. Ad-hoc querying.
NoSql & Search
Accumulo

Soir

HBase

Streaming
Storm
A free and open source distributed real-time computation system.
Spark Streaming


Yarn
“Yet Another Resource Negotiator”. sometimes called MapReduce 2.0. Apache YARN decouples resource management and data processing in Hadoop.
Data Governance & Operations
Data Integration & Access
Hue

HttpFS

Flume
a log collector because Hadoop jobs produce a large amount of log information about job process because the jobs are running batch, so they take time to run
Sqoop
Transfers bulk data between Hadop and Oracle’s DBMS.
Security
Knox

Sentry

Workflow & Data Governance
Falcon

Oozie
a Workflow scheduler library for Hadoop jobs
Provisioning & Coordination
Savannah

Juju

Zookeeper
A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Hadoop

Hadoop is a file system with a processing library.

The file system is HDFS (Hadoop File System).

The api is MapReduce.

HBase

A nosql db commonly used with Hadoop systems. It is a wide-columnstore (See Hive)

This is called Schema on Read, which means no schema is imposed as data is written to the file, but a create table will be required when it is read.

ID
Data
1
Name=”Jan”, Location=”Pittsburgh”
2
Name=”Heather”, Car=”Volvo”
3
Location=”South Hills”, Car=”Ford”, Color=”Red”

An open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. 

Tuesday, January 17, 2017

Bytes

1000    kB   kilobyte                                          1,000
10002   MB   megabyte                                      1,000,000
10003   GB   gigabyte                                  1,000,000,000
10004   TB   terabyte                              1,000,000,000,000
10005   PB   petabyte                          1,000,000,000,000,000
10006   EB   exabyte                       1,000,000,000,000,000,000
10007   ZB   zettabyte                 1,000,000,000,000,000,000,000
10008   YB   yottabyte             1,000,000,000,000,000,000,000,000

3 V's & Data Types


  • Volume 
  • Variety
    • unstructured data
      • actually very little structure
    • semi-structured data
      • (log files) 
      • tab-delimited
    • structured data
      • highly stuctured (rdbms)
    • multi-structured = unstructured / semi-structured / structured
  • Velocity 
    • data with value for an amount of time