Thursday, February 19, 2015

Big Unstructured Data v/s Structured Relational Data



Big data has opened doors never before considered by many businesses. The idea of utilizing unstructured data for analysis has in the past been far too expensive for most companies to consider. Thanks to technologies such as Hadoop, unstructured data analysis is becoming more common in the business world.

Business owners may be wondering if the current use of data warehousing could give them insights as versatile as big data. To understand the current scenario and future possibilities lets starts with understanding the difference between structured and unstructured data.

Structured Data


Data that resides in a fixed field within a record or file is called structured data. This includes data contained in relational databases and spreadsheets. Although data in XML files are not fixed in location like traditional database records, they are nevertheless structured, because the data are tagged and can be accurately identified. Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions on the data input. Structured data has the advantage of being easily entered, stored, queried and analyzed. 



Unstructured Data

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated in documents. Some examples of unstructured data are photos and graphic images, videos, streaming instrument data, webpages, pdf files, PowerPoint presentations, emails, blog entries, wikis and word processing documents.

Present state of data
Today, multinational companies and large organizations have operations in places that are scattered around the world. Each place of operations may generate large amount of both structured and unstructured type of data. They need very rapid access to more insights and they cannot afford to wait—else they lose a competitive edge. For IT organizations, this means delivery of relevant, timely insights faster than ever before. Thus, data creation, storage, retrieval and analysis varies in terms of volume, variety and velocity.
Volume:
Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, organizations store any and all data that may seem relevant at the moment. For example, insurance companies may have data from thousands of local and external branches, large retail chains have data from hundreds or thousands of stores and so on. Corporate decision makers require access of information from all such sources. But it is not so simple because it is not easy to understand and use this huge volume of data.
Variety:
Today data isn't just numbers, dates, and strings. It is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media. Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. As applications have evolved to serve large volumes of users, and as application development practices have become agile, the traditional use of the relational database has become a liability for many companies rather than an enabling factor in their business.
Velocity: Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.

Data Warehouse


Data warehouse is defined as a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process. In this definition the data is:
• Subject-oriented as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than major application areas (such as customer invoicing, stock control, and product sales). Date warehouse is designed to support decision making rather than application oriented data.
• Integrated because of the coming together of source data from different enterprise-wide applications systems. The source data is often inconsistent using, for example, different formats. The integrated data source must be made consistent to present a unified view of the data to the users.
• Time-variant because data in the warehouse is only accurate and valid at some point in· time or over some time interval.
• Non-volatile as the data is not updated in real time but is refreshed from on a regular basis from different data sources. New data is always added as a supplement to the database, rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data.

Interesting things to note from the definition are:



Limitations of Data Warehouse from data perspective

While data warehouse works perfectly with structured data, it is far from handling unstructured data such as images, videos, emails, webpages, etc. Some of the data comes in forms of Excel spreadsheets or PowerPoint presentations. There is no easy way to get access to the data and it requires intensive manual processing to gather the data and create reports. Also, with the excitement about big data in the market, when organizations are leaving no stone unturned to gain even a tiny portion of competitive edge, data warehousing is at a disadvantage.

Furthermore, data is hosted on various systems which make silos of information. Fulfilling warehouse with data requires extracting, transforming and loading - processes which are quite time consuming. Thus, a data warehouse is not suitable to process real time instantaneous data.

Other Limitations

One of the problem with data warehouses is their cost. Like all advanced technology, when data warehouses were first introduced, only the truly wealthy companies could afford them. Even today, most data warehouses are outside the price range of many companies. While vendors in recent years have begun tailoring their products towards small to medium sized businesses, many of these companies may not see the need of using a system that is overly complex. 
Another problem is that in the past, it wasn’t uncommon for a data warehouse project to take many months for implementation. Most firms today want results, and they want them fast. They don’t see the need for waiting months on a system and it will take time before a company begins seeing a return on their investment. Many firms simply don’t have the patience to wait for these returns. 

Future of Data warehouse

Automation

Data warehouses is facing strong competition from the rising “data lake” architecture based on Hadoop. Data lakes provide cost savings on software and storage. Newer organizations are adopting this strategy for economic reasons. However, data lakes specifically and Hadoop in general has the downside of “time to implementation”. Data warehouse will face huge changes from the world of data warehouse automation. Just like we no longer hand code ETL scripts, we can see productization of data modeling and database administration to speed up time to implementation in the future, increase efficiency and optimize use of resources. 

Data warehouse with real time dashboards

Today’s data warehouses are not moving at the speed of the business. It takes forever to integrate a new data source into your data warehouse. You have to figure out what reports you’re going to want so you can pre-define data dimensions for aggregation. You have to figure out a schema that can accommodate all the data you’re going to include. You have to set up ETL to translate your operational data into that analytic schema, and you have to maintain separate technology stacks at the operational, analytic, and archive tiers. This kind of traditional data warehouse is resistant to change. The trend is moving towards operationalizing the data from the data warehouse. This means building data services that can combine data from multiple sources and provide that data securely and performant to an operational process so that process can complete in real time. Fraud detection, eligibility for benefits, and customer onboarding are all examples of use cases that used to be performed offline but now need to be performed online in real-time.

References
http://www.webopedia.com
http://www.pcmag.com/encyclopedia
http://en.wikipedia.org
http://ecomputernotes.com
http://www.exforsys.com/tutorials/data-warehousing
http://www.bisoftwareinsight.com

No comments:

Post a Comment