Drill is the industry's 1st schema-less SQL engine.
Feel the need for speed? It can process petabytes of data in seconds.
It can combine data from multiple types of files on the fly in a single query.
Uses ANSI SQL 2003 support
Rethinking SQL for Big Data with Apache Drill (MapR)
Monday, January 23, 2017
Wednesday, January 18, 2017
JSON Sample
{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ], "gender": { "type": "male" } }
https://en.wikipedia.org/wiki/JSON#JSON_sample
Lynda Courses
Beginner
q 
Agile vs. waterfall 
q 
Data Analysis on Hadoop 
q 
Hadoop Fundamentals 
q 
GIS on the Web
q 
Learn the Linux Command Line: The Basics
q 
Linux: Bash Shell and Scripts
q 
Manage Your Organization's Big Data Program
q 
Real-World GIS
q 
Techniques and Concepts of Big Data
q 
Transitioning from Data Warehousing to Big
Data
q 
Transitioning from Waterfall to Agile
Project Management
q 
Understanding Data Science
Intermediate
q 
Java Essential Training
q 
NoSQL for SQL Professionals
q 
Overview of IDEs for Java
q 
Up and Running with Java
q 
XML Essential Training
q 
Learn Java Concepts By Example
q 
Java Essential Training for Students
q 
Code Clinic: Java
q 
Foundations of Programming: Object-Oriented
Design
q  Up and Running with Git and GitHub
MapR EcoSystem
https://www.mapr.com/products/product-overview/overview%20
 
Execution Engines 
 | 
  
Batch 
 | 
  
Tez 
 | 
  |
Spark 
 | 
  
A fast and
  general engine for large-scale data processing 
 | 
 ||
Cascading 
 | 
  |||
Pig 
 | 
  
an ETL library
  for Hadoop. It generates MapReduce jobs. You use it when you have processes
  that are ETL-like. 
 | 
 ||
Map Reduce v1/v2 
 | 
  |||
ML, Graph 
 | 
  
Graphx 
 | 
  ||
MLLIB 
 | 
  |||
Mahout 
 | 
  
machine
  learning or predictive analytics. A library. 
 | 
 ||
SQL 
 | 
  
Drill 
 | 
  
A schema-free
  SQL query engine for Hadoop, NoSQL, and Cloud Storage. Doesn't use MapReduce. 
 | 
 |
Shark 
 | 
  |||
Impala 
 | 
  |||
Hive 
 | 
  
SQL like query
  used with Hbase. It uses H-sql. Ad-hoc querying. 
 | 
 ||
NoSql &
  Search 
 | 
  
Accumulo 
 | 
  ||
Soir 
 | 
  |||
HBase 
 | 
  |||
Streaming 
 | 
  
Storm 
 | 
  
A free and
  open source distributed real-time computation system. 
 | 
 |
Spark
  Streaming 
 | 
  |||
Yarn 
 | 
  
“Yet Another
  Resource Negotiator”. sometimes called MapReduce 2.0. Apache YARN decouples
  resource management and data processing in Hadoop. 
 | 
 ||
Data Governance & Operations 
 | 
  
Data
  Integration & Access 
 | 
  
Hue 
 | 
  |
HttpFS 
 | 
  |||
Flume 
 | 
  
a log
  collector because Hadoop jobs produce a large amount of log information about
  job process because the jobs are running batch, so they take time to run 
 | 
 ||
Sqoop 
 | 
  
Transfers bulk
  data between Hadop and Oracle’s DBMS. 
 | 
 ||
Security 
 | 
  
Knox 
 | 
  ||
Sentry 
 | 
  |||
Workflow &
  Data Governance 
 | 
  
Falcon 
 | 
  ||
Oozie 
 | 
  
a Workflow
  scheduler library for Hadoop jobs 
 | 
 ||
Provisioning
  & Coordination 
 | 
  
Savannah 
 | 
  ||
Juju 
 | 
  |||
Zookeeper 
 | 
  
A centralized
  service for maintaining configuration information, naming, providing
  distributed synchronization, and providing group services. 
 | 
 ||
Hadoop
Hadoop is a file system with a processing library.
The file system is HDFS (Hadoop File System).
The api is MapReduce.
The file system is HDFS (Hadoop File System).
The api is MapReduce.
HBase
A nosql db commonly used with Hadoop systems. It is a
wide-columnstore (See Hive)
This is called Schema on Read, which means no schema is imposed as data is written to the file, but a create table will be required when it is read.
ID 
 | 
  
Data 
 | 
 
1 
 | 
  
Name=”Jan”, Location=”Pittsburgh” 
 | 
 
2 
 | 
  
Name=”Heather”, Car=”Volvo” 
 | 
 
3 
 | 
  
Location=”South Hills”, Car=”Ford”, Color=”Red” 
 | 
 
An open-source, distributed, versioned,
non-relational database modeled after Google's Bigtable. 
Tuesday, January 17, 2017
Bytes
1000    kB   kilobyte                                          1,000
10002   MB   megabyte                                      1,000,000
10003   GB   gigabyte                                  1,000,000,000
10004   TB   terabyte                              1,000,000,000,000
10005   PB   petabyte                          1,000,000,000,000,000
10006   EB   exabyte                       1,000,000,000,000,000,000
10007   ZB   zettabyte                 1,000,000,000,000,000,000,000
10008   YB   yottabyte             1,000,000,000,000,000,000,000,000
3 V's & Data Types
- Volume
 
- Variety
 - unstructured data
 - actually very little structure
 
- semi-structured data
 - (log files)
 - tab-delimited
 
- structured data
 - highly stuctured (rdbms)
 
- multi-structured = unstructured / semi-structured / structured
 
- Velocity
 - data with value for an amount of time
 
Subscribe to:
Comments (Atom)
