Drill is the industry's 1st schema-less SQL engine.
Feel the need for speed? It can process petabytes of data in seconds.
It can combine data from multiple types of files on the fly in a single query.
Uses ANSI SQL 2003 support
Rethinking SQL for Big Data with Apache Drill (MapR)
Monday, January 23, 2017
Wednesday, January 18, 2017
JSON Sample
{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ], "gender": { "type": "male" } }
https://en.wikipedia.org/wiki/JSON#JSON_sample
Lynda Courses
Beginner
q
Agile vs. waterfall
q
Data Analysis on Hadoop
q
Hadoop Fundamentals
q
GIS on the Web
q
Learn the Linux Command Line: The Basics
q
Linux: Bash Shell and Scripts
q
Manage Your Organization's Big Data Program
q
Real-World GIS
q
Techniques and Concepts of Big Data
q
Transitioning from Data Warehousing to Big
Data
q
Transitioning from Waterfall to Agile
Project Management
q
Understanding Data Science
Intermediate
q
Java Essential Training
q
NoSQL for SQL Professionals
q
Overview of IDEs for Java
q
Up and Running with Java
q
XML Essential Training
q
Learn Java Concepts By Example
q
Java Essential Training for Students
q
Code Clinic: Java
q
Foundations of Programming: Object-Oriented
Design
q Up and Running with Git and GitHub
MapR EcoSystem
https://www.mapr.com/products/product-overview/overview%20
Execution Engines
|
Batch
|
Tez
|
|
Spark
|
A fast and
general engine for large-scale data processing
|
||
Cascading
|
|||
Pig
|
an ETL library
for Hadoop. It generates MapReduce jobs. You use it when you have processes
that are ETL-like.
|
||
Map Reduce v1/v2
|
|||
ML, Graph
|
Graphx
|
||
MLLIB
|
|||
Mahout
|
machine
learning or predictive analytics. A library.
|
||
SQL
|
Drill
|
A schema-free
SQL query engine for Hadoop, NoSQL, and Cloud Storage. Doesn't use MapReduce.
|
|
Shark
|
|||
Impala
|
|||
Hive
|
SQL like query
used with Hbase. It uses H-sql. Ad-hoc querying.
|
||
NoSql &
Search
|
Accumulo
|
||
Soir
|
|||
HBase
|
|||
Streaming
|
Storm
|
A free and
open source distributed real-time computation system.
|
|
Spark
Streaming
|
|||
Yarn
|
“Yet Another
Resource Negotiator”. sometimes called MapReduce 2.0. Apache YARN decouples
resource management and data processing in Hadoop.
|
||
Data Governance & Operations
|
Data
Integration & Access
|
Hue
|
|
HttpFS
|
|||
Flume
|
a log
collector because Hadoop jobs produce a large amount of log information about
job process because the jobs are running batch, so they take time to run
|
||
Sqoop
|
Transfers bulk
data between Hadop and Oracle’s DBMS.
|
||
Security
|
Knox
|
||
Sentry
|
|||
Workflow &
Data Governance
|
Falcon
|
||
Oozie
|
a Workflow
scheduler library for Hadoop jobs
|
||
Provisioning
& Coordination
|
Savannah
|
||
Juju
|
|||
Zookeeper
|
A centralized
service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services.
|
Hadoop
Hadoop is a file system with a processing library.
The file system is HDFS (Hadoop File System).
The api is MapReduce.
The file system is HDFS (Hadoop File System).
The api is MapReduce.
HBase
A nosql db commonly used with Hadoop systems. It is a
wide-columnstore (See Hive)
This is called Schema on Read, which means no schema is imposed as data is written to the file, but a create table will be required when it is read.
ID
|
Data
|
1
|
Name=”Jan”, Location=”Pittsburgh”
|
2
|
Name=”Heather”, Car=”Volvo”
|
3
|
Location=”South Hills”, Car=”Ford”, Color=”Red”
|
An open-source, distributed, versioned,
non-relational database modeled after Google's Bigtable.
Tuesday, January 17, 2017
Bytes
1000 kB kilobyte 1,000
10002 MB megabyte 1,000,000
10003 GB gigabyte 1,000,000,000
10004 TB terabyte 1,000,000,000,000
10005 PB petabyte 1,000,000,000,000,000
10006 EB exabyte 1,000,000,000,000,000,000
10007 ZB zettabyte 1,000,000,000,000,000,000,000
10008 YB yottabyte 1,000,000,000,000,000,000,000,000
3 V's & Data Types
- Volume
- Variety
- unstructured data
- actually very little structure
- semi-structured data
- (log files)
- tab-delimited
- structured data
- highly stuctured (rdbms)
- multi-structured = unstructured / semi-structured / structured
- Velocity
- data with value for an amount of time
Subscribe to:
Posts (Atom)