Big Data

The Potentials and Difficulties of Big Data

Data analytics

As described in the originating postOpens in new window, the term big data refers to data sets that exceed the ability of traditional tools to capture, manipulate or process them—typically, those in the high terabyte range and beyond.

Big data needs to be considered in terms of how the data will be manipulated. The size of the data set will impact data capture, movement, storage, processing, presentation, analytics, reporting, and latency.

Traditional tools quickly can become overwhelmed by the large volume of big data. Latency—the time it takes to access the data—is as an important a consideration as volume.

Suppose you might need to run an ad hoc query against the large data set or a predefined report. A large data storage system is not a data warehouse, however, and it may not respond to queries in a few seconds. It is, rather, the organization-wide repository that stores all of its data and is the system that feeds into the data warehouses for management reporting.

One solution to the problems presented by very large data sets might be to discard parts of the data so as to reduce data volume, but this isn’t always practical.

Regulations might require that data be stored for a number of years, or competitive pressure could force you to save everything. Also, who knows what future benefits might be gleaned from historic business data? If parts of the data are discarded, then the detail is lost and so too is any potential future competitive advantage.

Instead, a parallel processing approach can do the trick—think divide and conquer. In this ideal solution, the data is divided into smaller sets and is processed in a parallel fashion.

What would you need to implement such an environment? For a start, you need a robust storage platform that’s able to scale to a very large degree (and at reasonable cost) as the data grows and one that will allow for system failure.

Processing all this data may take thousand of servers, so the price of these systems must be affordable to keep the cost per unit of storage reasonable.

In licensing terms, the software must also be affordable because it will need to be installed on thousands of servers. Further, the system must offer redundancy in terms of both data storage and hardware used.

It must also operate on commodity hardware, such as generic, low-cost servers, which helps to keep costs down. It must additionally be able to scale to a very high degree because the data set will start large and will continue to grow.

Finally, a system like this should take the processing to the data, rather than expect the data to come to the processing. If the latter were to be the case, networks would quickly run out of bandwidth.

Requirements for a Big Data System

This idea of a big data system requires a tool set that is rich in functionality. For example, it needs a unique kind of distributed storage platform that is able to move very large data volumes into the system without losing data.

The tools must include some kind of configuration system to keep all of the system servers coordinated, as well as ways of finding data and streaming it into the system in some type of ETL-based stream.

ETL, or extract, transform, load—is a data warehouse processing sequence.

Software also needs to monitor the system and to provide downstream destination systems with data feeds so that management can view trends and issue reports based on the data.

While this big data system may take hours to move an individual record, process it, and store it on a server, it also needs to monitor trends in real time.

In summary, to manipulate big data, a system requires the following :

  • A method of collecting and categorizing data
  • A method of moving data into the system safely and without data lose
  • A storage system that
    • Is distributed across many servers
    • Is scalable to thousands of servers
    • Will offer data redundancy and backup
    • Will offer redundancy in case of hardware failure
    • Will be cost-effective
  • A rich tool set and community support
  • A method of distributed system configuration
  • Parallel data processing
  • System-monitoring tools
  • Reporting tools
  • ETL-like tools (perfectly with a graphic interface) that can be used to build tasks that process the data and monitor their progress
  • Scheduling tools to determine when tasks will run and show task status
  • The ability to monitor data trends in real time
  • Local processing where the data is stored to reduce network bandwidth usage.
    Adapted from the manual Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset Authored by Michael Frampton.