igorbrigadir / awesome-twitter-algo
- среда, 5 апреля 2023 г. в 00:14:48
The release of the Twitter algorithm, annotated for recsys
Curated by Igor Brigadir and Vicki Boykis.
An annotated look through the release of the Twitter algorithm, through the context of engineering and recsys, with notes from repo creators on significance of specific parts of the code. Since it can be hard to parse through so much code and derive meaning and context, we do it for you!
This code focuses on the services used to build the Home timeline For You
feed, the algorithmic tab that is now served first on both web and mobile next to the Following
feed.
We're happy to take changes that add and contextualize Twitter's recommendations algorithm as it's been released over the past week. To contribute, please submit a PR with good formatting and grammar and lots of links to references where relevant. We're especially happy for feedback from tweeps or former tweeps who can tell us where we got it wrong.
One thing that's immediately obvious is that this is not the entire codebase or even a working majority of it. Missing from this codebase are
An important high-level concept discussed in the Spaces releasing this code was in-network and out-of-network. In-network tweets are those from people you follow, out-of-network is everyone else. A blend of 50%/50% are offered in the daily ~1500 tweets run through rankers.
What was released? The majority of the code and algorithms, but not the data or parameters or configurations or build tools of the Recommneder Systems behind "For You" timeline recommendations. The Candidate Retrieval code was also not released, and neither was the Trust and Safety components, and the Ads components - those remain closed off. No User Data or credentials were inside the repositories and code comments were sanitized (or at least, none were obviously there on first look).
Twitter Algo Repo || Twitter ML Algo Repo || Blog Post
There is a very, very old post from 2013 on High Scalability which gives some context to how these systems were initially constructed.
As context, Twitter initially ran all workloads on-prem but has been moving to Google Cloud.. In 2019, Twitter began by migrating to BigQuery and DataFlow from a data and analytics perspective. Before the move to BigQuery, much of the data was stored in HDFS using Thrift. It currently lives in BigQuery and is processed for many of the pipelines described below using DataFlow, GCP's Spark/Scalding-processing equivaent platform.
The released code comes in a variety of languages. The most common languages used at Twitter are:
The typical recommender system pipeline has four steps: candidate generation, ranking, filtering, and serving. Twitter has many pipelines for performing verious parts of this this across the overall released codebase.
For You
feed in the Twitter timeline.The input data comes from:
In migrating to GCP, the current data ingest looks something like this:
That data is then made available to the candidate generation phase. There is not much about the actual data, even what a schema might look like, in the repo.
(also called "features" in the chart)
They describe the reasons specifically for creating an in-memory DB in the GraphJet paper:
In terms of recommendation algorithms, we have found that random walks, particularly over bipartite graphs, work well for generating high-engagement recommendations. Although conceptually simple, random-walk algorithms define a large design space that supports customization for a wide range of application scenarios, for recommendations in different contexts (web, mobile, email digests, etc.) as well as entirely unrelated applications (e.g., social search). The output of our random-walk algorithms can serve as input to machine-learned models that further increase the quality of recommendations, but in many cases, the output is sufficiently relevant for direct user consumption.
In terms of production infrastructure for generating graph recommendations, the deployed systems at Twitter have always gone "against the grain" of conventional wisdom. When many in the community were focused on building distributed graph stores, we built a solution (circa 2010) based on retaining the entire graph in memory on a single machine (i.e., no partitioning). This unorthodox design decision enabled Twitter to rapidly develop and deploy a missing feature in the service (see Section 2.1). Later, when there was much activity in the space of graph processing frameworks rushing to replace MapReduce, we abandoned the in-memory system and reimplemented our algorithms in Hadoop MapReduce (circa 2012). Once again, this might seem like another strange design decision (see Section 2.2). Most recently, we have supplemented Hadoop-based recommendations with custom infrastructure, first with a system called MagicRecs (see Section 2.3) and culminating in GraphJet, the focus of this paper.
The precursor to GraphJet was WTF, Who to Follow, which focused only on recommending users to other users., using Cassovary, an in-memory graph processing engine built specifically for WTF, also built on the JVM.
GraphJet implements two random walk algorithms:
+A large portion of the traffic to GraphJet comes from clients who request content recommendations for a partic- ular user.
Graphjet includes CLICK, FAVORITE, RETWEET, REPLY, AND TWEET as input node types and keeps track of left (input) and right(output) nodes.
Based on the blog, a total of 1500 candidates are retrieved. However, only some of them will be served to your Twitter feed.
Twitter would want to show the tweets that you are most likely to positively engage with. Therefore Twitter will predict probabilities of whether you will engage with the tweet, and use these probabilities to score the tweets.
To reduce computation cost, tweets are first ranked with a light ranker (which is just a logistic regression) and then a heavy ranking (a neural network model).
This is their documentation
Twitter has separate models for ranking in-network and out-network tweets, with different features
The model for the Light Ranker TensorFlow model is trained using TWML which is said to be deprecated, but the code is in deepbird project.
The Earlybird Light Ranker has some feature weights but as suggested in the code, they are read in as run time parameters and these are most likely different in practice.
All of these are combined and weighted into a score. Hyperparameters for the model and weighting are here.
For more details on the model, see the Architecture overview.
After the model predicts the probability of the actions, weights are assigned to the probability. The tweet with the highest score is likely to appear at the top of your feed.
These are the actions predicted, and their corresponding weights
feature | weight |
---|---|
probability the user will favorite the Tweet | (0.5) |
probability the user will click into the conversation of this tweet and reply or like a Tweet | (11*) |
probability the user will click into the conversation of this Tweet and stay there for at least 2 minutes | (11*) |
probability the user will react negatively (requesting "show less often" on the Tweet or author, block or mute the Tweet author) | (-74) |
probability the user opens the Tweet author profile and Likes or replies to a Tweet | (12) |
probability the user replies to the Tweet | (27) |
probability the user replies to the Tweet and this reply is engaged by the Tweet author | (75) |
probability the user will click Report Tweet | (-369) |
probability the user will ReTweet the Tweet | (1) |
probability (for a video Tweet) that the user will watch at least half of the video | (0.005) |
The score of the tweet is equal to
P(favorite) * 0.5 + max( P(click and reply), P(click and stay two minutes) ) * 11 + P(hide or block or mute) * -74 + ... etc
The tweet with the highest score is likely to appear at the top of your feed. (There is still a part on boost where multipliers will be applied to the score). However, filtering is applied afterwards, and this could change what tweets you actually see.
There are some interpretations we can make from the scoring plan
The release does not describe how the weights are chosen. We expect the weights to be tuned with A/B testing. We are also curious about what Twitter measures and optimizes when they tune the weights.
Usually, filtering happens before ranking to avoid the need to rank candidates that will be filtered later. However, on Twitter, the blog implies that filtering happens after ranking.
Remove out-of-network competitor site URLs from potential offered candidate Tweets
There are some reasons why we might not want to order the tweets strictly by the scoring plan. The scoring plan scores tweets independent of other Tweets. However, we might want to consider other tweets when presenting the tweets on the feed, for example, avoid showing tweets from the same author consecutively or maintain some other form of diversity in the tweets.
These are the heuristics mentioned in the blog
Author Diversity: Avoid too many consecutive Tweets from a single author.
score * ((1 - 0.25) * Math.pow(0.5, position) + 0.25)
Content Balance: Ensure we are delivering a fair balance of In-Network and Out-of-Network Tweets.
Feedback-based Fatigue: Lower the score of certain Tweets if the viewer has provided negative feedback around it.
Social Proof: Exclude Out-of-Network Tweets without a second degree connection to the Tweet as a quality safeguard. In other words, ensure someone you follow engaged with the Tweet or follows the Tweet’s author.
ScaleFactor = 0.75
is applied to out-of-network tweets (exactly second degree connection?), in-network retweets of out-of-network tweets should not have this multiplier appliedTwitter Blue boost: (Not listed in blog)
These are Twitter specific terms and names that keep coming up across different code bases and blog posts.
1288834974657L
, which is a timestamp for 2010-11-04T01:42:54Z
the date that Twitter introduced the Snowflake ID system, used as Twitter's own "Unix Epoch"Cases of potential bias, manipulation, favouritism, hacks, etc. The focus on this repository is on the concrete, techincal aspects of the code, not speculating on anything twitter may or may not have done. That exercise is left to the reader, however, there are some technical aspects that should still be described about these popular accusations, this is a section for those. Unfortunately, much of the configuration that would contain specific instances of interventions is not in the code.
It was long speculated youtube links get massively deboosted, and Spaces links massively boost Tweets in recommendations. There are no specific references to this in the code. However, there are filters that could be configured for this, referencing OutOfNetworkCompetitorURLFilter for example.
The Elon Musk / Democrat / Republican Code: Now Removed. One of the first widely shared cases, falsely assuming this is something that directly affects Recommendations when it was actually for internal A/B testing, to monitor for effects (DDG is Duck Duck Goose, the A/B Testing Platfom). It was also mentioned in the space and denied there. However, a former twitter employee also offered an alternative explanation (A/B Testing measures behavior, so one way or another Twitter is tuning your TL, indirectly).
There are two mentions related to Ukraine in the Twiter Algo repo. Whereas one of them is a flag for Ukraine-related misinformation used for moderation, or warning labels, there is another safety label for Twitter Spaces called UkraineCrisisTopic. Here are some facts about these labels and their function:
SafetyLabel
results in tweet interstitial or notice, are publicly documented here previously and specifically for Armed Conflicts here.@karlhigley's blog and thread of threads are very accessible things about Recommender Systems in practice.
A good Recommender Systems entry point is the Google Machine Learning for Recommender Systems course, it also has a good glossary of terms.
The biggest academic recsys community is ACM Recsys and state-of-the-art recommender systems research is usually openly available in the proceedings. A lot of the presentations are on youtube.
Admittedly out of date, but still useful, is the RecSys Wiki.
The latest edition of the Recommender Systems Handbook is also a good book that covers the field well.
I find it very helpful to break down recommendation systems into four stages - retrieval, filtering, scoring, and ordering.