Skip to content

Mining Google Voice, Part II: The Data

March 27, 2012

A few months ago, I got an interesting idea: what if I took all of the texts I’d sent in the last year and tried to analyze that data. Now that I’m on break again, I decided to pick up this project. Basically I’ve written a pretty basic parser (in Python!) for Google Voice’s XML format which allows me to pull some basic findings out of the massive (well, if 10 MB can be called massive) dataset.

But before I get into what I’ve learned about myself and my friends, I thought I’d present some issues I encountered in dealing with the data.

History Limitations

I had always thought that Google Voice was like Gmail – messages lived forever in your inbox. Turns out they don’t, the inbox is limited to 100 pages, so when I mined the inbox in the last post, I only got about 1000 conversations, about 15 months worth.

(Edit: After a little snooping, turns out GV has a history tab. Time to go mine that guy, all 5300 conversations.)

Conversations

GV has a notion of conversations, which is pretty cool. The thing is, unlike the iPhone, it chooses to cut the conversations according to some algorithm. So you might have a conversation that stops for a day or two and then picks back up. I’m not sure if there’s an arbitrary cut-off, or whether it scales with the number of texts or their frequency. It might be interesting to reverse-engineer the algorithm!

Timing

The conversations thing would’t be so annoying if not for the way that GV handles reporting the time. Specifically, each text is tagged with a timestamp but no date. Each conversation is tagged with a full datetime of the last message in the conversation. Thus, without more knowledge of the conversation algorithm, we can’t determine when a specific message was sent. Moreover, because we have the date for the last message, we can’t simply assign datestamps to individual messages in a single pass; a more complicated routine is needed. Until this is solved, I can’t get granularity beyond a single month, which isn’t all that problematic since there isn’t that much data on a daily/weekly level anyway, and what little data I do have is mostly noise.

Unusual characters

Finally, there are no limitations on the characters you can include in texts, including the ‘<‘ and ‘>’ characters (which are popular both in talking about HTML or other markup languages and as part of the ❤ heart). This makes parsing the XML non-trivial.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: