Toying around with Google BigQuery Python CLI

BigQuery is an API developed by Google for querying big data using an SQL like language. For beginners I can recommend this video presentation on BigQuery. It also covers the open source version (Apache Drill) of the software (Dremel) underlying BigQuery. There is also a Google I/O 2012 video.

Install the Python CLI (here I’m using PIP):

pip install bigquery
# this installs a program called 'bg', the Python CLI to BigQuery

Authorize the bq client (follow instruction in the README.txt: “Authorizing bq to access your BigQuery data”):

bq init # go to the url, paste the key you get in end

This creates a token in your home directory, e.g. ~/.bigquery.v2.token.

Do a query (on one of the public tables):

bq query 'select count(*) as count from publicdata:samples.shakespeare'

This returns the following result

+--------+
| count  |
+--------+
| 164656 |
+--------+

See the first 10 rows of data from the shakespeare table (the same as “select * from R limit 10”):

bq head -n 5 publicdata:samples.shakespeare

This returns the following result:

+--------------+------------+--------------+-------------+
|     word     | word_count |    corpus    | corpus_date |
+--------------+------------+--------------+-------------+
| brave        |          6 | 1kinghenryiv |        1597 |
| profession   |          1 | 1kinghenryiv |        1597 |
| treason      |          2 | 1kinghenryiv |        1597 |
| Ned          |          9 | 1kinghenryiv |        1597 |
| answered     |          1 | 1kinghenryiv |        1597 |
+--------------+------------+--------------+-------------+

Let’s find the 10 most used words in the works of Sharespeare:

bq query "select word, sum(word_count) as count from publicdata:samples.shakespeare group by word order by count desc limit 10"

Which returns (no surprises there):

+------+-------+
| word | count |
+------+-------+
| the  | 25568 |
| I    | 21028 |
| and  | 19649 |
| to   | 17361 |
| of   | 16438 |
| a    | 13409 |
| you  | 12527 |
| my   | 11291 |
| in   | 10589 |
| is   |  8735 |
+------+-------+

Pretty easy huh?

The data in this table looks like the output from a MapReduce job. Indeed Dremel is not a replacement for MapReduce, but a complementary technology. Produce data with MapReduce (Hadoop, Amazon MR). Drill down on it with Dremel (BigQuery, Apache Drill).

Joins

BigQuery only supports “small” joins, where one table is fairly small. If a huge table is joined with a small table, the huge table is sharded across the workers, and the small table is made available (over the network) to every worker to do the join. Obviously this wouldn’t work if a huge table had to be communicated over the network to every worker.

Spatial data?

One question I had was, does BigQuery support spatial queries? That answer is no.

Leave a Reply