Skip to content

Nerd Alert: Mining Google Voice

January 8, 2012

I’ve been home on break for the past month. Most of my time has been spent in complete relaxation. I’ve read way too many books, watched all 140 episodes of How I Met Your Mother (true story), and slept a lot. This relaxation must have been good for me, however, since I’ve come up with a few interesting project ideas. One I decided to try was to visualize my texts over the past few years.

I use an amazing service called Google Voice for the majority of my texts. I figured it wouldn’t be too hard to pull all of my texts out of their website, parse it in Python, and create some pretty graphs:

  • Number of texts by person over time
  • Of my friends, who sends the most verbose texts?
  • With whom do I have the largest text deficit/surplus? (Who texts me more than I text them, or vice versa?)
  • Do I text differently in response to different people?

And so on. (Suggestions definitely appreciated!)

But first, there’s a little problem of grabbing the data out of the internet.

First attempt: Search for an API

It exists, but doesn’t look to friendly. Someone on the internet suggests just forwarding all texts to your email and grabbing them from there. Sure, that’s a much better forward-looking solution, but I am much more interested in historical data.

Second attempt: Basic cURL

cURL is a great tool for grabbing pages off the internet. Surely I can just grab some nicely-numbered pages and I’ll be all done with this. Right?

Not so fast. Looks like Google isn’t too eager to get mined. Just hitting them up with a
curl -A Mozilla https://www.google.com/voice/
gives a pretty bland “Moved Temporarily” page. Not helpful.

Third attempt: Advanced cURL

After some more Googling, I came across a link that suggested the following: go to the login page, let it leave a cookie. Then submit a fake form using all of the hidden and non-hidden fields (and the cookie) to the form’s action URL and pray. I ended up with this monstrosity of a command:
curl -b cjar -c cjar --cookie cjar --cookie-jar cjar --data 'continue=https://www.google.com/voice/' --data 'followup=https://www.google.com/voice/' --data 'service=grandcentral' --data 'dsh=...' --data 'ltmpl=open' --data 'GALX=...-npE' --data 'pstMsg=0' --data 'Email=...@gmail.com' --data 'Passwd=...' --data 'signIn=Sign in' --location --output ~/Documents/page.html -A Mozilla https://accounts.google.com/ServiceLoginAuth
And, what do you know, it didn’t work. Apparently my browser’s cookie functionality is turned off. (Who would’ve guessed.) That was enough to get me to take a nice, long break.

Fourth attempt: Basic Apple Automator

Automator is an often-forgotten but fairly powerful piece of macro software that ships with Mac OS X. It’s designed to allow non-programmers to automate their repetitive tasks, either by recording macros or by dragging boxes to build workflows in which data flows from block to block. I figured that using Google Voice’s keyboard shortcuts would be the best way to go here, so I created a new “Watch Me Do” macro by hitting the record button in Automator. I switched to Chrome, grabbed the source of the page using alt-command-U, selected it all with command-A, copied it with command-C, switched to TextEdit, opened a new file with command-N, pasted with command-V, then switched back to Chrome and hit my right arrow to bring up the next page of my inbox. Needless to say, it didn’t work, instead giving me an obscure and indecipherable error message.

Fifth attempt: Java Robot

I love the Java Robot class. (java.awt.robot) Since what I was trying to do in Automator was just pounding some keys on the keyboard, I figured I would fire up IntelliJ and use a Robot. On a PC, no problem. On a Mac, however, this gets interesting. How the heck do you get Java to hit the command/apple/squiggle key? Googling this issue just led me to some insensitive forum posters / PC users explaining how you can hit the command keys just like any others. Of course they were referring to the F1-F12 keys, which aren’t a problem. Checking out the Wikipedia article, I saw that Macs recognize the Windows key as the command key, but a quick test showed that not to work.

Sixth attempt: Brute force & Epiphany

Dejected, I decided to download a bit of the data and then start mining it. If the results were interesting, I could invest some more time in gathering data or automating that process down the line.

After copying off the source for about ten pages, I realized I might as well check out what I had. Specifically, I was wondering whether Google Voice’s “14 more messages” links were polling the server or just revealing some hidden divs. Well, it turns out that it was… nothing. In fact, there was no content at all in the source I had been downloading (except for the names and phone numbers of each one of my contacts). Dejected, I hopped into the ‘Network’ tab of Chrome’s Inspect interface. Goldmine.

It turned out that every time I hit the next or previous buttons, it was doing an AJAX request, and the server was returning a beautiful XML document with everything I could ever want. Moreover, it was a GET request, which meant that I could just ping the server at that URL

Seventh attempt: Ain’t pretty, but it works

Here’s the final solution that worked for me:

From the Network tab in Chrome’s Inspect interface, I found a URL: https://www.google.com/voice/inbox/recent/?page=p3&v=something (not sure what the v field does or whether I should keep it confidential, but it’s important that it be there and it’s important that it be right). Using a for loop, I generated an html file that contained links to all of those pages from p1 to p165. I then opened this page in Safari, and used Automator to 1. Get the open URL in Safari; 2. Get Link URLs from that page; and 3. Display webpages. Running that workflow opened up 165 tabs (in Chrome, interestingly enough), which then downloaded the corresponding XML file. All that was left was for me to manually approve each download. I ended up with about 90 files, a lot fewer than the 165 I expected, but understandable since Chrome can’t necessarily download 165 files at once (and I might have hit Discard a few times in the chaos). Certainly I could go back in smaller slices and get all of them if the analysis turns out to be interesting.

Next up, parsing, analysis, and visualization!

Part II of this series is now live: data issues.

Edit: I just found the History tab in GV, which means I can grab all of the data, not just the last 1000 conversations.

Edit II: I just found Google Takeout, which allows you to download all of your Google data (including Voice). Well that pretty much negates all of my work here, but I still think that my data is in a better format for parsing.

Advertisements
One Comment leave one →
  1. Guy Davidson permalink
    January 10, 2012 11:54 pm

    It sounds like once you’ve figured out it’s an HTTP / AJAX request, you could write a short python script – see (http://www.devx.com/opensource/Article/41509/1763/page/4) for example. I mean, you already have the data, but that would not require manual intervention and make it far easier to throw it all in one big file (SQLlite databse, for example?).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: