Big Data
A Not-only SQL (NoSQL) database is a non-relational database that can be use to store it
is an open-source framework for large-scale data storage and data processing that is mor or less run on commodity hardware
are capable of providing highly scalable, on-demand IT resources that can be leased via pay-as-you-go models
Is a field dedicated to the analysis, processing and storage of large collections of data that frequenty originate from disparate sources
Big Data Solutions
queries can take several minutes or even longer, depending on the complexity of the query and the number of records queried
is a measured for gauging sucess within a particular context
Examples can include EDI, e-mails, spreadcheets, RSS feeds, rss feeds and sensor data
are typically requiered when traditional data analysis, processing and storage technologies and techniques are insufficient
Big Data Addresses
Arrives at such fast speeds that enormous datasets can accumulate within very shorts periods of time
does not conform to a data model or data schema
Data adquired such as via online customer registrations, usually contains less noise
distinct requierements, such as the combining of multiple unrelated datasets, processing of large ammounts of unstructured data and harvesting of hidden information, in a time-sensitive manner
Using Big Data Solutions
are closesly liked with an enterprise's strategic objectives
further use databases that store historical data in multidimensional arrays and can answer complex queries based on multiple dimensions of the data
multiple formats and types of data that need to be supported by Big Data Solutions
complex analysis tasks can be carried out to arrive at deeply meaningful and insightful analysis results for the benefit of the business
Some streams are public. Other streams go to vendors and business directly
Analytics and Data Science
are relevant to big data in that they can serve as both a datas source as well as an data sink that is capable of receiving data
can process massive quantities of data that arrive at varying speeds, may be of many different varieties and have numerous incompatibilities
Data within Big Data
is the process of gaining insights into the workings of an enterprise to improve decision-making by analyzing external data and data generated by its business processes
can have multiple data marts
is a process of loading data from a source system into a target system, the source system can be a database, a flat file or an application, similarly, the target system can be a database or some other information system
accumulates from being amassed within the enterprise (via applications) or from external sources that are then stored by the big datat solution
Data processed by Big Data
does generally require special or customized logic when it comes to pre-processing and storage
Data adquired such as blog posting, usually contains more noise
store historical data that is aggregated and denormalized to support fast reporting capability
can be used by enterprise applications directly, or fed into a data warehouse to enrich existing data.This data is typically analyzed and subjected to analytics
Processed data and analysis results
represents the main operation through which data warehouses are fed data
does often have special pre-processing and storage requierements, especially if the underline format is not text-based
are commonly used for meaningful and complex reporting and assessment task and can also be fed back into applications to enhance their behavior (such as when product recommendations are displayed online)
actionable intelligence
operational optimization
can be human-generated or machine generated, although it is ultimately the responsibility of machines to generate the processing results
Human-generated data
is a subset of the data stored in a data warehouse, that typically belongs to a department, division or specific line of business
each technology is uniquely relevant to modern-day Big Data Solutions and ecosystems
used to identify problem areas in order to take corrective actions
is the result of human interaction with systems, such as online services and digital devices (Ex. Social media, micro blogging, e-mails, photo sharing and messaging)
Machine-generated data
With periodic data imports from accross the enterprise, the amount of data contained will continue to increase. Query response times for data analysis task performed as part of BI can suffer as a result
defined as the usefulness of data for an enterprise
is the result of the automated, event-driven generation of data by software programs or hardware devices (Ex. Web logs, sensor data, telemetry data, smart meter data and appliance usage data
BDS processing results
scientific and research data (large Hadron Collider, Atacama Large Milimeter/Submilimeter Array Telescope)
is crucial to big data processing storage and analysis
identification of new markets
accurate predictions
is directly related to the veracity characteristic
The required data is first obtained from the sources, after which the extracts are modified by applying rules
fault and fraud detection
more detailed records
related to collecting and processing large quantities of diverse data has become increasingly affordable
simple insert, delete and update operations with sub-second response times
improved decision-making
scientific discoveries
Datasets
representing a common source of structured analytics input
The anticipated volume of data that is processed by Big Data solutions is substantial and usually ever-growing
Collections or groups of related data (Ex. Tweets stored in a flat file, collection of image files, extract of rows stored in a table, historical weather observations that are stored as XML Files)
Datum
Shares the same set of attributes as others in the same dataset
Are the data analysis results being accurately communicated to the appropriate decision-makers?
is based on a quantifiable indicator that is identified and agreed upon beforehand
Data analysis
either exists in textual or binary form
is the process of examining data to find facts, relationships, patterns, insights and/or trends. The eventual goal is to support decision-making
helps establish patterns and relationships amog the data being analyzed
Analytics
semi-structured data
Can exist as a separate DBMS, as in the case of an OLAP database
is the discipline of gaininng an understanding of data by analyzing it via a multitude of scientific techniques and automated tools, with a focus on locating hidden patterns and correlations
is usually applied using highly scalable distributed technologies and frameworks for analyzing large volumes of data from different sources
generally involves sifting through large amounts of raw, unstructured data to extract meaningful information that can serve as an input for identifying patterns, enriching existing enterprise data, or performing large-scale searches
may not always be high. For Example, MRI scan images are usually not generated as frequently as log entries form a high-traffic Web Server
attributes providing the file size and resolution of a digital photograph
in the business-oriented environments analytics results can lower operational costs and facilitate strategic decision-making?
scientific domain
is also dependent on how long data processing takes, time are inversely proportional to each other
is a data analysis technique that focuses on quantifying the patterns and correlations found in the data
analytics can help identify the cause of a phenomenon to improve the accuracy of predictions
services-based environments
analytics can help strengthen the focus on delivering high quality services by driving down cost
generally makes up 80% of the data within an enterprise, and has a faster growth rate than structured data
enables data-driven decision-making with scientific backing, so that decisions can be based on a factual data and not on past experience or intuition alone
Business Intelligence
can be used as an ETL engine, or as an analytics engine for processing large amounts of structured, semi-structured and unstructured data
applyes analytics to large amounts of data across the enterprise
can be further utilize the consolidated data contained in data warehouses to run analytical queries
KPI
is mostly machine-generated and automatically appended to the data
ticket reservation systems and banking and POS transactions
used to achieve regulatory compliance
big data solutions particularly rely on it when processing semi-structured and unstructured data
act as quick reference points for measuring the overall performance of the business
primary business and technology drivers
the relational data is stored as denormalized data in the form of cubes, this allows the data to be queried during any data analysis task that are performed later
XML tags providing the author and creation date of a document
Digitization
Affordable Technology & Commodity Hardware
Social Media
Hyper-Connected Communities & Devices
Cloud Computing
Analytics & Data Science
The maturity of these fields of practice inspired and enabled much of the core functionality expected from contemporary Big Data solutions and tools
Digitized data
How well has the data been stored?
is always fed with data from multiple OLTP systems using regular batch processing jobs
The longer it takes for data to be turned into meaninful information, the less potential it may have for the business
Leads to an opportunity to collect further "secondary" data, such as when individuals carry out searches or complete surveys
Colecting secondary data
Extract Transform Load (ETL)
data bearing value leading to meaningful information
can be important to businesses. Mining this data may allow for customized marketing, automated recomendations and the development of optimized product features
Affordable Technology
Tipical Big Data solutions
is typically stored in relational databases and frequently generated by custom enterprise applications, ERP systems amd CRM systems
are based on open-source software that requires little more than commodity hardware
commodity hardware
makes the adoption of big data solutions accessible to businesses without large capital investments
provide feedback in near-realtime via open and public mediums
business are storing increasing amounts of data on customer interaction and from social media avenues in an attempt to harvest this data to increase sales, enable targeted marketing and create new products and service
business are also increasingly interested in incorporating publicly avaliable datasets from social media and other external data source
The broadening coverage of the internet and the proliferation of cellular and Wi-Fi networks has enabled more people to be continuously active in virtual communities
This is either directly through online interaction on indirectly through the usage of connected devices, this has resulted in massive data streams
can also be fed back into OLTPs
have led to the creation of remote environments
Business have the opportunity to leverage the infraestructure, storage and processing capabilities provided by these environments in order to build large scale Big Data Solutions
Can be leveraged for its scaling capabilities to perform Big Data Processing task
have a greater noise-to-signal ratio
can be leased dramatically reduces the requiered up-front investment of big data projects
Technologies Related to Big Data
It also periodically pulls data from other sources for consolidation into a dataset (such as from OLTP, ERP, CRM, and SCM systems).
Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)
Data Warehouses
Hadoop
OLTP
store operational data that is fully normalized
is a software system that processes transaction-oriented data
Online Transaction
the completion on an activity in realtime and not batch-processed
require automated data cleansing and data verification when carrying out ETL processes
Big Data Analysis Results
Queries Supported by OLTP
mostly exist in textual form such as XML or JSON files.
Examples of OLTP
structured data
OLAP
is a system used for processing data analysis queries
form an integral part of business intelligence, data mining and machine learning processes
are using in diagnostic, predictive and prescriptive analysis
Sensor Data (RFID, Smart meters, GPS sensors)
have a less noise-to-signal ratio
Are the right types of question being asked during data analysis?
ETL
online transactions (point-of-scale, banking)
A big data solution encompasses this tool feature-set for converting data of different types
analytics results can lower operational costs and facilitate strategic decision-making
The data is inserted into a target system
Data Warehouse
impose distinct data storage and processing demands, as well as management ans access processes
is a central, enterprise-wide repository, consisting of historical and current data
are heavily used by BI to run various analytical queries
usually interface with an OLAP system to support analytical queries
conforms to a data model or schema
Data pertaining to multiple business entities from different operational systems is periodically extracted, validated, transformed an consolidated into a single database
Usually contain optimized databases called analytical database to handle reporting and data analysis tasks
Analytical Database
Brings challenges for enterprises in terms of data integration, transformation, processing and storage
Data Mart
single version of "truth" is based on cleansed data, which is a prerequisite for accurate and error-free reports
has established itself as a de facto industry platform for contemporary Big Data Solutions
Data Characteristics
Volume, Velocity, Variety, Veracity & Value
Volume
Social Media (Facebook, Tweeter)
Velocity
translates into the amount of time it takes for the data to be processed once it enters the enterprise perimeter
Coping with the fast inflow of data requires the enterprise to design highly elastic and avaliable processing solutions and corresponding data storage capabilities
Variety
Veracity
refers to the quality or fidelity of data
Noise
has a defined level of structure and consistency, but cannot be relational in nature
data carrying no value
Signal
controlled source
uncontrolled source
Degree of noise
Depends on the type of data present
Value
Value Considerations
Has the data been stripped of any valuable attributes?
Data Types
unstructured data
is stored in a tabular form
can be relational
does not generally have any special pre-processing or storage requirements. Examples include banking transactions, OLTP system records and customer records
qualitative analysis
is generally inconsistent and non-relational
cannot be inherently processed or queried using SQL or traditional programming features and is usually an awkward fit with relational databases
metadata
provide information about dataset's characteristics and structure
quantitative analysis
semi-structured data and unstructured data
Types of data analysis
data mining
this technique involves analyzing a large number of observations from a dataset
since the sample size is large, the results can be applied in a generalized manner to the entire dataset
provide more value than any other type of analytics and correspondingly require the most advance skillset, as well as specialized software and tools
are absolute in nature and can therefore be used for numerical comparisons
is a data analysis technique that focuses on describing various data qualities using words
involves analyzing a smaller sample in greater depth compared to quantitative data analysis
the information is generated at periodic intervals in realtime or near realtime
theses analysis results cannot be generalized to an entire dataset due to the small sample size
they also cannot be measured numerically or used for numerical comparisons
policies for data privacy and data anonymization
aim to determine the cause of a phenomenon that occuried in the past, using questions that focus on the reason behind the event
also known as data discovery, is a specialized form of data analysis that targets large datasets
refers to automated, sofware-based techniques that sift through massive datasets to identify patterns and trends
involves extracting hidden or unknown patterns in the data with the intention of identifying previously unknown patterns
forms the basis for predictive analytics and business intelligence (BI)
Analysis & Analitycs
based on the input data, the algorithm develops an understanding of which data belongs to which category
These techniques may not provide accurate findings in a timely manner because of the data's volume, velocity and/or variety
Analytics tools
enables multiple outcomes to be visualized by enabling related factors to be dynamically changed
are often carried out via ad-hoc reporting or dashboards
some realtime data analysis solutions that do exist are proprietary
can automate data analyses through the use of highly scalable computational technologies that apply automated statistical quantitative analysis, data mining an machine learning techniques
Types of Analytics
the adoption of a big data environment may necessitate that some or all of that environment be hosted witin a cloud
descriptive analytics
diagnostic analytics
predictive analytics
prescriptive analytics
policies for data cleansing and filtering
Value and complexity increase as we move from descriptive to prescriptive analytics
This involves identifying patterns in the training data and classifying new or unseen data based on known patterns
is carried out to answer questions about events that have already occurred
Arround 80% of analytics are ________ in nature
refers to the information about the source of the data that helps determine its authenticity and quality. It also used for auditing purposes
provides the least value and requires a relatively basic skillset
The reports are generally static in nature and display historical data that is presented in the form of data grids or charts
Queries are executed on the OLTP systems or data obtained from various other information systems, such as CRMs and ERPs
are considered to provide more value than descriptive analysis, requiring a more advanced skillset
a substancial budget may still be required to obtain external data
usually require collecting data from multiple sources and storing it in a structure that lends itself to performing drill-downs and roll-ups
analytics results are viewed via interactive visualization tools that enable users to identify trends and patterns
can join structured and unstructured data that is kept in memory for fast data access
will be required to control how data flows in and out of big data solutions and how feedback loops can be established to enable the processed data to undergo repeated refinements
the executed queries are more complex compared to descriptive analytics, and are performed on multi-dimensional data held in OLAP systems
are carried out to attempt to determine the outcome of an event that might occur in the future
try to predict the event outcome and predictions are made based on patterns, trends and exceptions found in historical and current data
as big data initiatives are inherently business-driven, there needs to be a clear business case for adopting a big data solution to ensure that it is justified and that expectations are met
Graphically representing data can make it easier to understand reports, view trends and identify patterns
This can lead to the identification of risk and opportunities
involve the use of large datasets (comprised of both internal and external data), statistical techniques, quantitative analysis, machine learning and data mining techniques
may employ machine learning algorithms, such as unsupervised learning to extract previously unknown attributes
is considered to provide more value and required more advance skillset than both descriptive and diagnostic analytics
tool generally abstract underlying statistical intricacies by providing user-friendly front-end interfaces
enables a detailed view of the data of interest by focusing in on a data subset from the summarized view
is the process of teaching computers to learn from existing data and apply the adquired knowledge to formulate predictions about unknown data
incorporate predictive and prescriptive data analytics and data transformation features
build upon the results of predictive analytics by prescribing actions that should be taken. The focus is on which prescribed options to follow, and why and when it should be followed, to gain an advantage or mitigate a risk
rely on BI and data warehouses as core components of big data environments and ecosystems
risk associated with collecting accurate and relevant data, and with integrating the big data environment itself, need to be identified and quantified
various outcomes are calculated, and the best course of action for each outcome is suggested
The approach shifts form explanatory to advisory and can include the simulation of various scenarios
incorporate internal data (current and historical sales data, customer information, product data, business rules) and external data (social media data, weather data, demographic data)
involve the use of business rules and large amounts of internal and/or external data to simulate outcomes and prescribe the best course of action
machine learning
coupling a traditional data warehouse with these new technologies results in a hybrid data warehouse
machine learning types
even analyzing separate datasets that contain seemingly benign can reveal private information when the datasets are analyzed jointly
supervised learning
unsupervised learning
algorithm is first fed sample data where the data categories are already known
having developed an understanding, the algorithm can then apply the learned behavior to categorize unknown data
data categories are unknown and no sample data is fed
Instead, the algorithm attemps to categorize data by grouping data with similar attributes together
unearths hidden patterns and relationships based on previously unknown attributes of data
is not "intelligent" as such because it only provides answers to correctly formulated questions
makes predictions by categorizing data based on known patterns
can use the output from data mining (identified patterns) for further data classification through supervised learning
provide a holistic view of key business areas
Due to the volumes of data that some big data solutions are required to process, performance can sometimes become a concern
this is accomplished by categorizing data which leads to the identification of patterns
has advance BI and data warehouses technologies and practices to a point where a new generation of these platforms has emerged
Traditional BI
queries and statistical formulae can then be applied as part of various data analysis tasks for viewing data in a user-friendly format, such as on a dashboard
utilizes descriptive and diagnostic analysis to provide information on historical and current events
correctly formulating questions requires an understanding of business problems and issues, and of the data itself
BI reports on KPI
ad-hoc reports
dashboards
ad-hoc reporting
is a process that involves manually processing data to produce custom-made reports
the focus is usually on a specific area of the business, such as its marketing or supply chain management.
the generated custom reports are detailed and often tabular in nature
OLAP and OLTP data sources
each iteration can then help fine-tune processing steps, algorithms and data models to improve the accuracy of the result and deliver greater value to the business
Big data solutions require tools that can seamlessly connect to structured, semi-structured and unstructured data sources and are further capable of handling millions of data records
can be used by BI tools for both ad-hoc reporting and dashboards
in-house hardware resources are inadequate
are not turn-key solutions
performing analytics on datasets can reveal confidential information about organizations or individuals
the presentation of data is graphical in nature, such as column charts, pie charts and gauges
OLAP and OLTP
datasets that need to be processed reside in a cloud
BI tools use to display the information on dashboards
data warehouse and data marts
contain consolidated and validated information about enterprise-wide business entities
policies that regulate the kind of external data that can be adquired
cannot function effectively without data marts because they contain the optimized and segregated data requires for reporting purposes
without data marts, data needs to be extracted from the data warehouse via an ETL process on an ad-hoc basis whenever a query needs to be run
Near-realtime data processing can be archieved by processing transactional data as it arrives and combining it with already summarized batch-processed data
uses datawarehouses and data marts for reporting and data analysis, because they allow complex data analysis queries with multiple joins and aggregations to be issued
Big Data BI
each feedback cycle may reveal the need for existing steps to be modified, or new steps, such as pre-processing for data cleasing, to be added
policies for data archiving data sources and analysis results
builds upon BI by acting on the cleansed, consolidated enterprise-wide data in the data warehouse and combining it with semi-structured and unstructured data sources
comprises both predictive and prescriptive analysis to facilitate the development of an enterprise-wide understanding of the way a business works
sound processes and sufficient skillsets for those who will be responsible for implementing, customizing, populating and using big data solutions are also necessary
analyses focus on multiple business processes simultaneously
analyses generally focus on individual business processes
it is important to accept that big data solutions are not necessary for all business
This helps reveal patterns and anomalies across a broader scope within the enterprise
It also leads to data discovery by identifying insights and information that may have been previously absent or unknown
requires the analysis of unstructured, semi-structured and structured data residing in the enterprise data warehouse
requires a "next-generation" data warehouse that use new features and technologies to store cleansed data originating from a variety of sources in a single uniform data format
this type of data warehouse acts as a uniform and central repository of structured, semi-structured and unstructured data that can provide tools with all of the data they require
this eliminates the need for tools to have to connect to multiple data sources to retrieve or access data
A next-generation data warehouse establishes a standarized data access layer accross a range of data sources
Data Visualization
is a technique whereby analytical results are graphically communicated using elements like charts, maps, data grids, infographics and alerts
Traditional Data Visualization
the nature of the business may make external data very valuable. The greater the volume and variety of data, the higher the chances of finding hidden insights from patterns
provided mostly static charts and graphs in reports and dashboards
query data from relational databases, OLAP systems, data warehouses and spreadsheets to present both descriptive and diagnostic analytics results
contemporary data visualization
are interactive and can provide both summarized and detailed views of data
they are designed to help people who lack statistical and/or mathematical skills to better understand analytical results, without having to resort to spreadsheets
generally use in-memory analytical technologies that reduce the latency normally attributed to traditional, disk-based tools
Data Visualization Features
Aggregation
Drill-Down
Filtering
Roll-Up
What-if Analysis
provides a holistic and sumerized view of data across multiple contexts
Big data solutions access data and generate data, all of which become assets of the business
helps focus on a particular set of data by filtering away the data that is not of immediate interest
groups data across multiple categories to show subtotals and totals
adressing concerns can require the annotation of data with source information and other metadata, when it is generated or as it arrives
also, the quality of the data targeted for processing by big data solutions needs to be assessed
advance visualization tools
these tools eliminate the need for data pre-processing methods (such as ETL) and provide the ability to directly connect to structured, semi-structured and unstructured data sources
business justification
clear goals regarding the measurable business value of an enterprise's big data solution need to be set
anticipated benefits need to be weighed against risk and investments
big data frameworks
organizational prerequisites
in order for data analysis and analytics to be successful and offer value, enterprise need to have data management and big data governance frameworks
outdated, invalid or poorly identified data will result in low-quality input which, regardless of how good the big data solution is, will continue to produce low-quality output
the longevity of the big data environment also needs to be planned for
a roadmap needs to be defined to ensure that any necessary expansion or augmentation of the environment is planned out to stay in sinc with the requirements of the enterprise
data procurement
the adquisition of big data solutions themselves can be economical, due to open-source platform availability and opportunities to leverage commodity hardware
external data sources include data markets and the government. Government-provided data, like geo-spatial data may be free
most commercially relevant data will need to be purchased. Such an investment may be on-going in order to obtain updated versions of the datasets
privacy
this can lead to intentional or inadvertent breaches of privacy
adressing these privacy concerns requires an undestanding of the nature of data being accumulated and relevant data privacy regulations, as well as special techniques for data tagging and anonymization
big data security further involves establishing data access levels for different categories of users
some of the components of big data solutions lack the robustness of traditional enterprise solution environments when it comes to access control and data security
securing big data involves ensuring that data networks provide access to repositories that are sufficiently secured, via custom authentication and autorization mechanisms
provenance
maintaining as large volumes of data are adquired, combined and put through multiple processing stages can be a complex task
data may also need to be annotated with the source dataset attributes and processing steps details as it passes through the data transformation steps
Limited Realtime Support
Dashboards and other applications that require streaming data and alerts often demand realtime or near-realtime data transmissions
Many contemporary open-source big data solutions and tools are batch-oriented meaning support for streaming data analysis may either be limited or non-existent
Distinct performance challenges
Distinct governance requirements
governance framework
is required to ensure that the data and the solution environment itself are regulated, standarized and evolved in a controlled manner
what a big data governance framework would encompass
standardizing how data is tagged and the metadata used for tagging
Distinct methodology
upfront capital investment is not available
introduces remote environments that can host IT infrastructure for, among other things, large-scale storage and processing
the project is to be isolated from the rest of the business so that existing business processes are not impacted
the limits of available computing and storage resources used by an in-house Big Data solution are being reached
the big data initiative is a proof of concept