What's your granularity?

21st October 2012

Jeans image by Muffet on FlickrJean fibre image by Aeyoun on Flickr

When working with data, one thing gets me over and over again - granularity. Settling on the right level of granularity means striking a balance between detailed reporting and faster queries all while trying to maintain the elusive quality of simplicity and it's not an easy task.

So, what do I mean by granularity? In any data set, data points are collected at some level that is meaningful for their purpose. For example, in a bank we'd expect to keep the following information:

  • For each customer, we have one or more accounts
    • In each bank account, the customer can conduct multiple transactions
      • Each of these transactions can be with another customer or an external party.

In this example we have three levels of granularity: Customer, Account and Transaction.

The challenge comes when we want to prepare this data for analysis. Do we want to understand the movement of individual transactions, account balances, or net customer assets? If we construct an analysis environment built upon transactions we will have the most detailed level of information possible in the model. However, we will also have to maintain the largest data set and if we perform analysis over account balances we will have to repeatedly calculate this.

Ah-ha! I hear you cry, if we're interested in account balances, we simply add an aggregated field 'Account Balance' to the transaction field, or the account field. In some cases this will be the right answer, yet be aware that we now we have a value field that is at a different granularity from the rest of the transaction table and so queries designed to sum over fields will have to be designed such that they do not sum over these fields. In addition we're presented with new problems, like how to calculate the Account Balance, should it be the average/min/max balance over the last day/week/month?

So what's to do?

Granularity is not an insurmountable problem, it's just something that should be considered thoroughly when preparing data for analysis.

Some of the data analysis projects I've worked on require working on three key granularities which align to clusters of records and their matches against a set of search criteria. Essentially a data set of many records is scanned to identify certain characteristics that are referred to as matches. These records are then clustered into groups where they potentially relate to one another and analysis is performed on the clusters.

As the project progresses, the emphasis on reporting shifts analysis through the levels of granularity.

  1. Project start, ETL, management are interested in the number of records extracted and the properties of those records
  2. Feature extraction commences, management are interested in what types of features are most prevalent and the number of records that are have those features.
  3. Clustering begins, management are interested in how those matched records are clustering together and the analysis that has been performed on those clusters.

To assist with reporting and analysis and address this granularity shift, we developed key reporting tables that bridge these granularities (acting similarly to that 'Account Balance' field mentioned earlier).

Record Match Status

A row for each record, showing the number of each type of match. All match information is rolled up into the record but this is less of an issue as if detailed match level information is required, it would make more sense to query the match tables directly anyway.

Record Identifier # Match Type A # Match Type B # Match Type C # Match Total
12345 0 2 1 3
45678 1 0 0 1

Group Status

Again a row for each record, but this time the records carry the same values as all of the other records in the same group. While this is redundant, for small collections of flags and aggregate values as below it can make querying simpler and easier to communicate to team members as information about the flags and members of a group are stored together. Similarly to above, if more detailed match information was required, it would make sense to query the record or match tables instead.

Group Identifier Record Identifier # Primary Match # Group Category
ABCD 12345 A 2
ABCD 45678 A 2

Roundup

While not carrying a large amount of data, these tables were the most used tables for analysis and reporting across the project. With them it was quick to aggregate upwards or delve downwards and so they provide an ideal starting point for most reports.

Figuring out the granularity for your project will always be difficult and will be almost guaranteed to change over time so it's important to think about how you design your reporting solution and maintain flexibility.

If you have any insights into reporting to share, please let me know; it'd be great to share tips.

By Niall Napier

Back to the blog

comments powered by Disqus

This Article's Tags

View the whole tag cloud