When working with data, one thing gets me over and over again - granularity. Settling on the right level of granularity means striking a balance between detailed reporting and faster queries all while trying to maintain the elusive quality of simplicity and it's not an easy task.
So, what do I mean by granularity? In any data set, data points are collected at some level that is meaningful for their purpose. For example, in a bank we'd expect to keep the following information:
In this example we have three levels of granularity: Customer, Account and Transaction.
The challenge comes when we want to prepare this data for analysis. Do we want to understand the movement of individual transactions, account balances, or net customer assets? If we construct an analysis environment built upon transactions we will have the most detailed level of information possible in the model. However, we will also have to maintain the largest data set and if we perform analysis over account balances we will have to repeatedly calculate this.
Ah-ha! I hear you cry, if we're interested in account balances, we simply add an aggregated field 'Account Balance' to the transaction field, or the account field. In some cases this will be the right answer, yet be aware that we now we have a value field that is at a different granularity from the rest of the transaction table and so queries designed to sum over fields will have to be designed such that they do not sum over these fields. In addition we're presented with new problems, like how to calculate the Account Balance, should it be the average/min/max balance over the last day/week/month?
Granularity is not an insurmountable problem, it's just something that should be considered thoroughly when preparing data for analysis.
Some of the data analysis projects I've worked on require working on three key granularities which align to clusters of records and their matches against a set of search criteria. Essentially a data set of many records is scanned to identify certain characteristics that are referred to as matches. These records are then clustered into groups where they potentially relate to one another and analysis is performed on the clusters.
As the project progresses, the emphasis on reporting shifts analysis through the levels of granularity.
To assist with reporting and analysis and address this granularity shift, we developed key reporting tables that bridge these granularities (acting similarly to that 'Account Balance' field mentioned earlier).
A row for each record, showing the number of each type of match. All match information is rolled up into the record but this is less of an issue as if detailed match level information is required, it would make more sense to query the match tables directly anyway.
|Record Identifier||# Match Type A||# Match Type B||# Match Type C||# Match Total|
Again a row for each record, but this time the records carry the same values as all of the other records in the same group. While this is redundant, for small collections of flags and aggregate values as below it can make querying simpler and easier to communicate to team members as information about the flags and members of a group are stored together. Similarly to above, if more detailed match information was required, it would make sense to query the record or match tables instead.
|Group Identifier||Record Identifier||# Primary Match||# Group Category|
While not carrying a large amount of data, these tables were the most used tables for analysis and reporting across the project. With them it was quick to aggregate upwards or delve downwards and so they provide an ideal starting point for most reports.
Figuring out the granularity for your project will always be difficult and will be almost guaranteed to change over time so it's important to think about how you design your reporting solution and maintain flexibility.
If you have any insights into reporting to share, please let me know; it'd be great to share tips.
By Niall Napiercomments powered by Disqus