Карта сайта
Версия для печати

«Double counting»: Устранение ошибок при анализе данных

12 ноября 2013 Типичная ошибка, избежать которую иногда не могут даже профессионалы, - это использование дублирующихся данных (“double counting”). Читайте статью Тины Грувз, менеджера проектов по работе с Большими Данными для бизнес-аналитики компании IBM и узнайте, как вовремя заметить и предотвратить появление повторяющейся информации. (Материал опубликован на английском языке)
A common mistake when analyzing operational data is double counting. By operational data, I mean data which contains redundancies and dependencies which haven’t been normalized -- in other words, data commonly found in Excel extracts or application tables. Even accountants and those with the training to navigate this problem can be challenged by double counting, particularly when the data being analyzed joins data from different sources or when the business context requires interpretation.

How serious is double counting? In a world being measured more and more, contracts which reference metrics as part of a guarantee or implied promise are causing corporations to assess their liability and risk. For example, a ruling of the Federal Energy Regulatory Commission, Feb 2012, upheld PJM’s assertion that EnerNOC double-counted its customers’ participation in certain demand response programs.  This dispute lasted months, battering EnerNOC’s stock price and resulting in resignations by EnerNOC’s COO and CFO.

What is Double Counting?
Double counting describes a situation in which results are overstated due to referencing a transaction more than once. Double counting does not imply that duplicate records exist.

In my experience, double counting typically occurs when:
  1. Data values are repeated in a column  e.g., the same customer number is found in multiple rows.
  2. Multiple identifiers for the same “noun” or entity exist -- e.g., one customer has multiple customers numbers, possibly due to different applications, a system migration or different locations being assigned different numbers.
  3. A contextual element or business rule is overlooked or not represented in the data.
  4. A complex relationship in the data exists, making relating or joining data in calculations challenging.
  5. Most of these problems are related to the technical aspects of data management and, as a result, requires a technical approach to solve them.

Because the unique identifier for each row is the invoice number on the far left, we can state that this data is about Invoices.  As a result, answering questions about invoices and only invoices is straight-forward.  Here, the scope of the question matches the scope of the data.

Where the trouble begins is when one asks questions about invoices in the context of another entity such as Customer, Product or Location.  

The Double Counting Challenge
Using these six rows, I’ll illustrate common double counting challenges in trying to answer “how many customers?” in the next series of posts. Providing accurate counts, in context of the business question, is necessary to calculate business metrics such as average deal size, commission payouts or market growth.

Lesson 1:  Repeated Values
If we are interested in understanding Customers rather than Invoices, how does changing which entity referenced affect how the answer is calculated?

Let’s ask “How many customers placed orders in July and August?”  Commonly used expression:

COUNT (Customer Number) = 6


COUNT ALL (Customer Number) = 6 
In general, whether it’s SQL or Excel, the Count() expression returns the number of ALL non-null values.  Using the data in Table 1, all 6 instances of Customer Number contain a value; therefore, the result is 6. 

This expression is really answering the question “How many Invoices in July and August have customers?” Because the data is uniquely identified by invoice number, not customer number, the implied context is Invoices.

But, that’s not the question we asked!  Our question is “How many Customers placed invoices in July and August?”

When examining the data by customer, we see that one customer, 353535, has two invoices, one in July and another in August.  The answer of ‘6’ is an overstated result, a clear example of double counting.  

To shift the context from Invoices to Customers, a qualifier is required:

COUNT DISTINCT (Customer Number) =5
COUNT DISTINCT (or sometimes COUNT UNIQUE) returns the number of unique, non-null values.  COUNT DISTINCT may impact performance.  See Tips for more info.

When the entity referenced in the question (e.g., Customers) differs from what uniquely identifies the rows (e.g., Invoices), always use the DISTINCT or UNIQUE qualifier with summary or aggregate functions such as COUNT().

  • If DISTINCT or UNIQUE qualifiers are not available, then alternatives are COUNTIF() or AVERAGEIF()expressions, a HAVING/OVER clause, or a conditional construct such as IF() or CASE().
  • For large or complex data sets, queries which require uniqueness may impact performance.  Often, adjustments in the data structure specification are required to optimize processing.  
Lesson 2:  Multiple Identifiers for the Same Entity
Consider Lesson 1.  ‘5’ is definitely a better answer than ‘6’ to the question “how many customers?”, but is it the right answer?  Let’s look at the same invoice data supplemented with customer name and sales region:

Table   Invoice Table with Customer Name and Sales Region:

Invoice   Invoice DateCustomer NumberCustomer Name*Billing AddressSales RegionProduct CdProduct UnitsUnit Price (USD)
A1002013/07/15121212 Acme USANABLK-40-51508.50
A1012013/07/1862241BPartners PlusKoreaAsia3A3,00018.00
B1012013/08/12353535Acme AGGermanyEuropeBLK-40-5859.50
B1302013/08/16151515Partners PlusUSANA3B87022.00
C1012013/09/10353535Acme AGGermanyEuropeBLK-40-545

* Note: Acme AG has its headquarters in Germany with offices in the USA and UK.  The different spellings reflect local naming conventions for corporations.

With this new information, let’s revisit our question “How many customers placed orders in July and August?”  The results now differ:
  • Counting distinct customer numbers = 5
  • Counting unique customer names = 3
Technically, both answers are accurate.  That is to say, the results can be mathematically validated.  Achieving accurate results is the first step.  Providing contextual results is the next one.

Align the data to the answer before counting or aggregating.   In our example, most business users would think of Customers by Customer Name, not Customer Number.  

Source:  ibm.com