I get to do random weird stuff in my job. Sometimes I’m working from really high level customer satisfaction data and making recommendations about how to improve the customer experience on Microsoft.com. Other times I have to roll up my sleeves and get into the rawest of raw data before I can get to the recommendations. I just got done wading through some pretty low level log data with SQL and thought I’d share how I was able to munge it to get usable information. If you’re not a SQL geek, feel free to skip.
I have a raw data set that consists of information about clusters and tests. There are one or more tests that run periodically against each of a set of clusters. Each time the test runs, a set of information is recorded: the server and test number, the date and time, and the status returned by the test (essentially a pass-fail), among other parameters.
Here’s a sample:
Server ID | Test ID | Date and Time | Status |
1 | 1 | 9/15/2003 00:01:23.456 | Pass |
1 | 1 | 9/15/2003 00:01:24.540 | Fail |
1 | 1 | 9/15/2003 00:01:25.006 | Pass |
1 | 1 | 9/15/2003 00:01:28.456 | Pass |
This is pretty useful, except what I really want to know is, if a test fails, how long does it take before the test starts passing again? And how many times does it happen a day? a week?
So I started trying to aggregate the data into something that would look like this:
Server ID | Test ID | Event Start | Event End | Duration |
1 | 1 | 9/15/2003 00:01:24.540 | 9/15/2003 00:01:25.006 | 0.454 |
It turns out to be trickier than I thought. What I ended up having to do was to join the table to itself to get the beginning and end date (and therefore duration), then use a NOT EXISTS clause to screen out lots and lots of cases where one failure might have multiple rows afterward with normal statuses—because if you don’t, the table above would show two events, both starting at the same time but one ending at 00:01:25.006 and the other at 00:01:28.456.
Here’s the query I used to make it all work:
select eh1.server_id,
eh1.TestID,
eh1.DateAndTime as DateStart1,
eh2.DateAndTime as DateEnd2,
DateDiff(ss,eh1.DateAndTime,eh2.DateAndTime) as Duration
from EventHeap eh1 INNER JOIN EventHeap eh2
ON ( eh1.server_id = eh2.server_id AND eh1.TestID = eh2.testid)
INNER JOIN EventHeap eh3
ON ( eh1.server_id = eh3.server_id AND eh1.TestID = eh3.TestID )
WHERE eh1.EventHeapID < eh2.EventHeapID AND
eh1.Status = 'Fail' AND eh2.Status = 'Pass'
AND NOT EXISTS
( select * from EventHeap eh4 WHERE eh4.server_id = eh1.server_id AND
eh4.TestID = eh1.TestID
AND eh4.EventHeapID < eh2.EventHeapID AND
eh4.EventHeapID >= eh1.EventHeapID AND eh4.Status = 'Pass' )
GROUP BY eh1.server_id, eh1.TestID, eh1.DateAndTime,eh2.DateAndTime
ORDER BY eh1.server_id, eh1.TestID, DateStart1
The biggest problem I have now is performance; running the aggregation on 57,000 rows takes a while. But the end result is much more usable data.
Mandatory disclaimer: This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.