Data Warehousing whens and whys
(mildly obfuscating for confidentiality) I maintain a database of retail location information. Stuff like address, phone number, etc. for said retail locations. I like to think of data warehousing possibilities since that is a field I may be immersed in soon (and because I find it to be interesting), even though this particular database is not in a data warehouse. So here's the thing I'm contemplating at the moment, though I'm having trouble getting it all into words:
First, lets assume that my table looks like this, and I'll load some data in it:
(Note above that I've learned a couple of ways to capture time periods: one is to put NULLs in the enddate column, the other is to put a far-flung date (preferably the largest the system can handle). I do the latter, because with the former you end up CASEing the nulls so you can do comparisons anyway. I've seen arguments for both, but the arguments eventually just say, pick one and use it consistently.)
Say a location changes their address. It seems to me that there are two facts I need to capture with three circumstances. First of all, we need to change the address. All right, in this system we just add a new row. We'll move location 20 to a new suite within their mall, and they did it on May 1st of 2004.
Peachy! Assuming you query this table with a where clause that has the current date (i.e. CURRENT_TIMESTAMP), the next query after that row is inserted will show the new address. But here's the thing I'm wondering: do you need to capture when the change was made to the database in addition to when the new address was valid, and if so what's the best way to do it? For instance, what if this data wasn't entered until October 1st? That means that the address was being incorrectly reported all that time. But if you just look at the data in November, it looks like it's been fine all that time.
It seems that capturing when the change was made would be valuable at minimum to answer the question, "why have all the shipments been going to the wrong address?" It seems intuitive to me that there are other questions that could be answered, though I can't think of any at this second.
This is an easy example for illustration. I'm thinking that this would be a much bigger deal with something like sales. So you report to your investors that your retail location sold four thousand dollars worth of goods in a month, say from a table that has an entry of location sales by day. But the fact is that your cash registers were screwed up and your sales didn't get transmitted for a full week. It seems like you need to be able to say that "hey, this was what we said but this is what it really is," or "we're going to report those lost sales in the next month" or something like that. It seems like you have to set everything up so that you can both query what was "real" at a given point in time and also query what the database "said" was real at a given point in time.
Is that so? If so, what's the best way to do it? I'm going to find that out.
First, lets assume that my table looks like this, and I'll load some data in it:
| LeaseId | Address | AddressStartDate | AddressEndDate |
| 10 | 1234 Main Street | 2000-01-01 | 9999-12-31 |
| 20 | 2345 Your Mall Blvd Suite 100 | 2000-01-01 | 9999-12-31 |
(Note above that I've learned a couple of ways to capture time periods: one is to put NULLs in the enddate column, the other is to put a far-flung date (preferably the largest the system can handle). I do the latter, because with the former you end up CASEing the nulls so you can do comparisons anyway. I've seen arguments for both, but the arguments eventually just say, pick one and use it consistently.)
Say a location changes their address. It seems to me that there are two facts I need to capture with three circumstances. First of all, we need to change the address. All right, in this system we just add a new row. We'll move location 20 to a new suite within their mall, and they did it on May 1st of 2004.
| LeaseId | Address | AddressStartDate | AddressEndDate |
| 10 | 1234 Main Street | 2000-01-01 | 9999-12-31 |
| 20 | 2345 Your Mall Blvd Suite 100 | 2000-01-01 | 2004-04-30 |
| 20 | 2345 Your Mall Blvd Suite 205 | 2004-05-01 | 9999-12-31 |
Peachy! Assuming you query this table with a where clause that has the current date (i.e. CURRENT_TIMESTAMP), the next query after that row is inserted will show the new address. But here's the thing I'm wondering: do you need to capture when the change was made to the database in addition to when the new address was valid, and if so what's the best way to do it? For instance, what if this data wasn't entered until October 1st? That means that the address was being incorrectly reported all that time. But if you just look at the data in November, it looks like it's been fine all that time.
It seems that capturing when the change was made would be valuable at minimum to answer the question, "why have all the shipments been going to the wrong address?" It seems intuitive to me that there are other questions that could be answered, though I can't think of any at this second.
This is an easy example for illustration. I'm thinking that this would be a much bigger deal with something like sales. So you report to your investors that your retail location sold four thousand dollars worth of goods in a month, say from a table that has an entry of location sales by day. But the fact is that your cash registers were screwed up and your sales didn't get transmitted for a full week. It seems like you need to be able to say that "hey, this was what we said but this is what it really is," or "we're going to report those lost sales in the next month" or something like that. It seems like you have to set everything up so that you can both query what was "real" at a given point in time and also query what the database "said" was real at a given point in time.
Is that so? If so, what's the best way to do it? I'm going to find that out.
