antontarasenko / smq
- вторник, 12 апреля 2016 г. в 03:11:54
366 stars today
A collection of SQL queries to social media datasets.
A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the list of queries and how to use them below
Queries are written for Google BigQuery free public datasets (requires a Google account) and stored in .sql
files, organized by social media outlet (folder hackernews
and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content.
Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with.
Some examples (see hackernews
directory for the full list):
imdb.com/title/tt[0-9]+/
itunes.apple.com/book/id[0-9]+
play.google.com/store/apps/details?id=.+
For simple queries, use Hacker News' Algolia search:
All Hacker News queries can be applied to Reddit after minor edits. Examples:
Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use TABLE_QUERY
:
FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'"))
.
Beware, this can quickly exhaust the free 1TB limit.
You can export up to 16,000 rows or 10MB as a csv
file via web interface. Larger output can be exported through console. Add LIMIT <n>
to queries to control the number of rows in output.
See web UI quickstart by Google.
bq
bq query `cat <path>`
, where <path>
leads to the .sql
fileimport gcp.bigquery as bq
bq.Query()
functionSee Felipe Hoffa's Hacker News notebook for example.
See BigQuery API Quickstart for examples in Java, Python, C#, PHP, Ruby. You'll need a credentials file to run it locally.
Pull requests are welcomed. Suggestions:
.sql
files related to Hacker News for Reddit and Wikipedia databasesThe reference for BigQuery's SQL dialect.