github

antontarasenko / smq

  • вторник, 12 апреля 2016 г. в 03:11:54
https://github.com/antontarasenko/smq

366 stars today
A collection of SQL queries to social media datasets.



Social Media Queries

A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the list of queries and how to use them below

Queries

Queries are written for Google BigQuery free public datasets (requires a Google account) and stored in .sql files, organized by social media outlet (folder hackernews and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content.

Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with.

Hacker News

Some examples (see hackernews directory for the full list):

For simple queries, use Hacker News' Algolia search:

Reddit

All Hacker News queries can be applied to Reddit after minor edits. Examples:

Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use TABLE_QUERY:

FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'")).

Beware, this can quickly exhaust the free 1TB limit.

Wikipedia

  • Edits made from an IP address - Wikipedia records IP addresses of anonymous editors. With respect to privacy, some uses of this data:
    • Edits by organization. Many organizations reserve static IPs. One famous example is US Congress' edits. This query is unlikely to return many edits done by a particular organization because the sample table contains only 300M edits. Too diluted to have a representative subset.
    • Edits by region. The sample is sufficient for statistics by region and other broad characteristics.

Usage

Web Interface

You can export up to 16,000 rows or 10MB as a csv file via web interface. Larger output can be exported through console. Add LIMIT <n> to queries to control the number of rows in output.

  1. Locate a query in the repo's folder
  2. Login at https://bigquery.cloud.google.com/welcome
  3. Press "Compose query" in the top left corner
  4. Copy-paste the query and run it

See web UI quickstart by Google.

Command line: bq

  1. Install Google Cloud SDK
  2. Initialize your account for command line tools
  3. Run bq query `cat <path>`, where <path> leads to the .sql file

Python in clouds: Jupyter, IPython notebooks

  1. Get a Google Cloud account (free trial)
  2. Create a Jupyter notebook in Datalab
  3. Do import gcp.bigquery as bq
  4. Run queries with bq.Query() function

See Felipe Hoffa's Hacker News notebook for example.

BigQuery API

See BigQuery API Quickstart for examples in Java, Python, C#, PHP, Ruby. You'll need a credentials file to run it locally.

Contributing

Pull requests are welcomed. Suggestions:

  • Adding new data mining queries
  • Rewriting .sql files related to Hacker News for Reddit and Wikipedia databases

The reference for BigQuery's SQL dialect.

Acknowledgements

  • Discussions on Hacker News and Reddit
  • Felipe Hoffa for publishing the datasets