Examining large amounts of variety types of data to uncover hidden patterns and unknown correlations and useful information.
Big data
Annotations:
General term used to describe the unstructured and semi-structured data.
Data - specify the term is petabyte and exabyte.
Petabyte is a measure of memory or storage capacity & is 2 to the 50th power bytes in decimal approximately a thousand terabytes.
Exabyte(EB) is a large unit of computer data storage , 2 to the sixtieth power bytes.
Approximately one quintillion bytes.
In decimal terms an exabyte is a billion gigabytes.
Unstructured data
Annotations:
It is a general label for describing any corporate information that does not in database.
Two types - Textual and Non-textual.
Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word documents, collaboration software and instant messages.
Non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files
Primary goal
Annotations:
Is to discover the repeatable business patterns.
Primary goal
Annotations:
Is to help companies make better business decisions by enabling data scientists and other users to analyze huge volumes of transaction data as well as other data sources that may be left untapped by conventional business intelligence (BI)programs.
A data scientist is a job title for an employee or business intelligence (BI) consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.
A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding.
They have the ability to explain the significance of data in a way that can be easily understood by others.
Technologies
NoSQL
Annotations:
NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data.
NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud.
the most popular NoSQL database is Apache Cassandra. Cassandra, which was once Facebook’s proprietary database, was released as open source in 2008. Other NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies that use NoSQL include NetFlix, LinkedIn andTwitter.
Hadoop
Annotations:
Hadoop is created by Doug Cutting and Mike Cafarella.
It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
It is part of the Apache project sponsored by the Apache Software Foundation.
MapReduce
Annotations:
MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.
It was developed at Google for indexing Web pages and replaced their original indexing algorithms and heuristics in 2004.
This framework is divided into two parts :
1. Map, a function that parcels out work to different nodes in the distributed cluster.
2. The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes.