Карта сайта
Версия для печати

Большие данные и их влияние на статистический анализ

1 августа 2012 Популярная в англоязычном мире шутка-загадка «Почему курица перешла дорогу?» актуальна и для бизнес-аналитики (правильный ответ: "Чтобы попасть на ее другую сторону" - загадочный английский юмор). Найти варианты ответа на этот вопрос не сложно, если имеется всего одна курица. Но если этих куриц миллионы? И у каждой курицы есть мобильное устройство? И она «твитит» каждое свое действие, мнение, пожелание, фотографии, описания своего завтрака?!? И да, что если дорога оснащена миллионами сенсоров, которые контролируют каждый шаг каждой курицы?!
Даже при имеющихся нынче аналитических инструментах, понять, почему пернатые решили перейти дорогу – достаточно сложно. Не говоря уже о составлении прогнозов когда, куда и зачем курица пойдет опять.

Эксперт компании IBM Jing Shyr утверждает, что индустрия бизнес-аналитики и статистики в ближайшие несколько лет столкнется с  настоящими испытаниями: недостаток знаний, упрощенные аналитические решения, мобильность, большие данные. 

Что именно она имеет в виду, читайте ниже (материал опубликован на английском языке):

Lack of Skills
Having been around the analytics industry for many years, it is refreshing to see that businesses are taking statistics and data mining results and injecting them directly into the business (and directly into the business process itself). The Catch-22 is that while more and more organizations are realizing the benefits of analytics, finding those professionals with an understanding of how to not only capture and analyze the tsunami of data created daily still requires training and a unique skill set.
A recent McKinsey Global Institute report indicates that over the next seven years the need for highly skilled business intelligence workers in the U.S. alone will dramatically exceed the available workforce – by as much as 60 percent.
It's nice to see that many universities around the world are expanding and strengthening analytics curricula (many with IBM's help) to meet the growing demand of skilled analytics professionals. Read more about IBM's work with Northwestern University, Yale School of Management and DePaul University, among others.
Consumable Analytics
I often imagine a business analyst presenting results to an executive the same way I present to my students. When teaching a lesson on modeling, I often ask, "Do you see what I see?" Everyone stares with blank looks on their faces and says, "No! What do you see?"
Herein lies part of the problem. To help counteract the skills shortage, we have to make the software easier to use and force the software to be consumable versus strictly scientific. Communicating results is just as important as the results themselves. I strongly believe that statistical software needs to go through a revolution of its own and become as intuitive as a smartphone.
And speaking of smartphones...
Most statistical software produces an incredible amount of very large tables and charts, making it extremely difficult to comprehend in a mobile environment. I torture my eyes every time I try to read a report on my Blackberry.
Consumability means anywhere, anytime and through any device. It's time we hold statistical software to a higher standard.
Big Data
Let me get back to the chickens for a moment.
The volume, velocity and variety of data today is seemingly overwhelming traditional statistical software. Not to be cliché, but Big Data is giving the statistics industry big problems.
Previously, if we wanted to analyze any data, we would follow the same logical flow: decide what we want to predict or classify and build a model by bringing in all the predictors (independent variables). The size of predictors are often well below 100.
Today, however, we are dealing with thousands of different variables making traditional statistical analysis a serious hurdle. The machine capacity is no longer capable and many algorithms have been outpaced by data capacity.
The challenge calls for a new process of data reduction before modeling and new computation algorithms are required to handle millions of records and fields quickly in a distributed environment without passing the data back and forth multiple times.
Most importantly, we don't need to be chicken when it comes to Big Data.
Creating new statistical techniques for Big Data will get us all to the other side of the road, and you'll never have to ask why.

Source: ibm.com