Building a Social Media Recommendation Algorithm from Scratch

Nov 1, 2025 · 25 min

TL;DR

Built a TikTok-style recommendation engine from scratch to finally demystify how social media algorithms understand content. The entire system is engineered around Google Gemini Embeddings to semantically map videos and FAISS vector search for fast, similarity-based recommendations. I even threw in a sliding window for user preferences (because tastes change!) and a mandatory 70/30 exploitation-exploration mix to keep your feed interesting.

Here is the live demo if you wanna have a look : Live Demo. Its is hosted on Render Free Server- so yeah the instance might take a while to spin up.

You can check out the chaotic code here : GitHub

Tech Stack: Python, Flask, FAISS, NumPy, Pandas, Google Gemini Embeddings

Introduction

I never really understood how the social media algorithms worked, and.. Why does it know me so well? What makes anything go viral? and why do I spend hours on it?

On the internet, they just say it’s all about how we interact and how much we interact, likes, shares, how many times or seconds we watch the content, which.. makes sense, cos it tells that the user is interested in this type of content but that is not exactly what I am curious about, I am curious how does the algorithm KNOW what the type of content is. I mean I can just say it’s Machine Learning or “AI” like everyone else, but eh.. What even is AI or ML? And do they even use machine learning?

To put it simply, say my “for you” page is filled with cute cats and recipes, why? cos that is the type of content I interact with the most. Occasionally the algorithm will show me some political content, but I just skip past it, or maybe see it once if it catches my attention, hence it comes in my feed occasionally to test if I am still interested in “political content” or not. The interaction tells the algorithm if this thing interests me or not, but how does the algorithm know if a video is cat video or political one, or does it even know if it is cat video or political video?

And to answer this, I need to build an algorithm first..

Building a Basic Algorithm

The Dumb Algorithm

Let’s take “interactions” as a starting point,

For easier understanding lets say only three types of content exist: cat videos, dog videos and space videos and the user can either like it or dislike it.. for now..

Each video would be assigned a score, if the user likes it, the score would be 1, if not then -1. The score is then what would be used to suggest content to the user.

Let’s start off by generating some dummy data, this in the real world would be generated by users on the app or website. And also for now we will assume that we or the system already know “what” the content is about, which is cat, dog or space.

In this rudimentary program/app of ours, we are shown content randomly, similar to when you create a new account on any social media it will show the most popular content curated by mods or will throw random content at you and then it will learn based on how you interact. This is called the Cold Start Problem.

For start and for now we are going to use the random content strategy to solve this.
The user will be told what the video is about, it will either input y or skip to dislike.

After interacting with about 10 videos, we have a log or history of what the user is doing on our app.

The user interacted with 3 cat videos, out of which it liked two, it did not like both the space videos it was shown, and 5 dog videos from which the user liked 4. If we take the mean of the scores per topic, the user is interested in dog videos, then in cats, and least interested in space.

But this system is a bit too dumb and easy- like yeah i mean this is pretty much the algorithm and interaction works but it is not smart and has few things which i do not like.

  • It’s really simple, and I mean the scoring can either be just plus one or minus one, which is alright to rank the topics. We can make it more comprehensive and close to how modern day social media works by adding more topics and interactions, to get a better scoring system, ie better idea of what user is interested in.
  • It’s not smart, where is the ML? Where is the AI everyone keeps on talking about?
  • And most importantly it’s incomplete- it still suggests content randomly- we haven’t implemented a system which utilises the “interests” of the user, i.e., it’s not personalised. Let’s tackle this point first.

Adding Personalization

Considering the mean as preference_score of the categories, our next step is to rank the videos.
We have about 100 videos in our content, and obviously we wont rank all 100 of them, i mean 1000s of videos and posts are uploaded on social media, they won’t rank each n everything per user right?.. I mean I guess. So what we are gonna do is, we now know what the user prefers, we are just gonna pick videos which resemble that topic.
In our case the user likes dogs and somewhat cats, and does not like space so there is no point in showing them space videos cos the goal is to maximise the time on app- cos thats how they make money.

One thing I would like to mention is the concept of “Exploration and Exploitation Trade-off”.
Exploitation is when the algorithm shows what we know the user likes and Exploration is to try to recommend new content to learn more about the user. Real algorithms balance both. I will currently focus on the exploitation part, i.e., make the algorithm show what the user likes.

So the first thing to do is reduce the number of videos to rank. Out of 100 videos we have in our “database” 42 are cat videos, 35 are dog videos and remaining 23 videos are of space. Since the user is only interested in dogs and cats, we don’t need space videos.

Now the number of videos to rank is down to 77. Which I think, still is a lot to rank in one go. Just imagine in real life even after sorting and reducing, the number of dog videos will be still in hundreds of thousands

We will implement something called batch processing and lazy loading,
Basically take few of the 77 videos, show them to user, load more in background when user is about to finish engaging with the content served before

We will take a batch size of 20, i.e., we will show 20 videos to the user, and I will split the ratio as per preference_score of category and for now we wont load more, mostly because it is not strictly relevant to the topic of the video.

\[\text{current category}\_i = \text{batch size} \cdot \frac{\text{preference score}_i}{\sum \text{preference score}\_j}\]

With the help of this simple equation, we need around 13 videos of dogs and 7 videos of cats in batch 1 to keep users hooked to the app.

Basically we divided the 77 videos into a separate table based on category,
Iterated over the tables to get the videos required per category, and lastly diversified that user wont get consecutive videos of topics.

And voila! Now our content to suggest is personalised.

Here we can now think of adding a bit of exploring.. Say make 10% of the next batch random topics. But I am going to skip it for now because If you haven’t already noticed, this algorithm and recommendation system has a huge issue… and in order for us to see that we need to scale this algorithm.

Scaling Up: More Comprehensive System

So after coding for a while here is a more complex algorithm and a proper interface which mimics tiktok/reels/ytshorts etc using flask.

The User can now like, unlike, share, comment, save and now we also track how much time the user spent on the video/content and how many times the user watched the video.

And based on that we now have different scoring for different interaction:

interactions = {
    "view_time": {"skip": -1,"short": -0.5, "long": +1.5,}, #where skip = <2 seconds, short = 7 seconds or less, long is for more than 7 seconds
    "like": +1,
    "comment": +2,
    "share": +3,  # we can further classify it as{"private": +2, "public": +3} where private = posting to story, public = sharing it in the DMs, but for simplicity we are keeping one score.
    "save": +3,
    "rewatch_count" : 0

    #some other examples of interaction can be:
    # "follow": +5,
    # "dislike": -1,
    # "comment_like": +0.5,
    # "comment_reply": +1
    # "comment_sentiment"
}

Also, we have more than cats, dogs and space.

content_categories_dict = {
    "Animals & Nature": [
        "cats", "dogs", "birds", "wildlife", "aquatic life", "nature", "landscapes"
    ],
    "Entertainment": [
        "movies", "tv shows", "gaming", "anime", "sports", "memes", "music", "dance"
    ],
    "Lifestyle": [
        "travel", "food", "fashion", "fitness", "photography", "beauty", "home decor"
    ],
    "Education & Knowledge": [ 
        "science", "space", "history", "psychology", "technology", "coding", "finance"
    ],
    "Inspiration & Personal Growth": [
        "motivation", "self-care", "mindfulness", "productivity", "philosophy"
    ],
    "Trends & Modern Culture": [
        "AI", "crypto", "gadgets", "cars", "current events", "startups"
    ],
    "Art & Creativity": [
        "painting", "design", "writing", "theater", "crafts", "literature"
    ],
    "Misc & Social": [
        "relationships", "family", "humor", "news", "politics"
    ]
}

Note : We are assuming each video is 15 seconds long.<

Based on these new ways to interact and tracking parameters i.e. view duration and number of rewatches, we have a much better score per video/topic… and now the videos can also have more than one categories and sub categories.. making this a much closer to real life algorithm than the previous one.

We start again with the random content strategy, mimicking a fresh account, and after interacting with the content for a while, the history of the user looks like this..

We can now use this data to personalise the algorithm and this… is where the issue begins…

Making the Algorithm Learn

Why Machine Learning?

With just 3 categories it was easy to keep track of preference_scores and then utilise it while ranking. Imagine having 1000s of categories, keeping track of preference scores per categories, filtering, ranking, sorting, serving videos would be a humongous task..even with a more efficient way of doing everything i did.. And even if we somehow do it, how do we keep track of trends? New topics? New categories?

There has to be a better way, and this is where Machine Learning comes into picture… sort of… Let’s make this algorithm smarter

We don’t have to do much per say, it’s all how we use the history of the user, ie data and train the machine learning model… and mind you, in real life we can collect much much more data like what content user shares to who, who shares what content to user, source data, what ads we interacted with, and one obvious thing, what user comments, what is the sentiment behind the comment

Now it is all up to what ML model to choose and make a couple of additions to our flow, and what do we mean by training the model?

We have been doing these 2 steps so far

  1. Retrieving the videos from “millions” of videos in our database, which the user is/might be interested in.
  2. Ranking the videos to keep users hooked to the app.

But what even is an ML model, and how do we train it?

Think of it like teaching a child through examples, telling the child what action or thing is good, what action is bad, what are do’s and dont’s.

For example:

  • Do: say “thank you” when someone gives you candy.
  • Don’t: throw the candy on the floor.

Or:

  • Do: pet the dog gently.
  • Don’t: pull the dog’s tail.

The child doesn’t need to understand why these rules exist, they just follow them because they keep hearing the same feedback from parents and teachers or rather, the parents and teachers keep “labeling” the actions. Over time, the repetition trains them: some actions get praise, others get corrected, and they eventually learn and recognise the do’s and dont’s.

Similarly, If I show you 100 videos and tell you which ones I liked and which ones I skipped, over time you’d start noticing patterns like ”oh, he usually likes videos with dogs, and upbeat music.”, which can be translated as DO recommend him dog videos, DO NOT recommend him cat videos, and that’s pretty much what machine learning does.

Instead of me, we show the model looking at huge amounts of past data of what the user watched, skipped, replayed, commented on and it finds patterns. Once it learns those patterns, it can make predictions: “The User will probably like this next video.”

Training the model basically means: feed it a bunch of history (data), let it find patterns (math), and then test if it’s predicting correctly. If it is, great. If not, we tweak it and keep going.

At the core, ML is just pattern recognition at scale using math and statistics to spot relationships that humans might miss.

Retrieving Videos with ML

Machine learning models do not understand texts, i.e. we cannot just directly feed the ML model this is “cat” video or “dog” video. We need a way to make it into numbers.

One way to do it can be, say, assigning each category and subcategory a number, but it would lose the linguistic meaning.

Introducing Embeddings.

Understanding Embeddings

Think of embeddings like giving every word (or item, or video) its own address in a multi-dimensional world, a giant coordinate system where similar things live close to each other.

For example in a XYZ coordinate system or 3D coordinate system:

  • “dog” might be at (0.7, 0.1, 0.3)
  • “cat” might be at (0.6, 0.2, 0.4)
  • “car” might be way off somewhere like (−0.2, 0.9, 0.8)

See the idea? “Dog” and “cat” are neighbours, but “car” lives far away in this space, which tells the model that dog and cat are related, and car is unrelated.

To explain, or make it intuitive, let’s dive deep in the example.

Our 3D coordinate system isn’t just random X, Y, and Z, each axis means something. Say:

  • The X-axis represents how “alive” something is (from lifeless objects to living beings).
  • The Y-axis represents how “cute or friendly” something feels.
  • The Z-axis represents how “mechanical or natural” it is.

Now, if we plot a few words:

  • Dog → (0.7, 0.9, 0.2) => it’s very alive, very friendly, and not mechanical.
  • Cat → (0.8, 0.7, 0.3) => also alive and somewhat friendly, still natural.
  • Car → (0.1, 0.2, 0.9) => not alive, not particularly friendly, and very mechanical.

Suddenly the numbers mean something.
You can see why “dog” and “cat” would land near each other, both alive, both natural, while “car” would drift far away in this 3D space.

In our case, we can do this for videos too. Each video can be converted into an embedding of a set of numbers that capture what the video is about. Maybe it encodes the sound, the visuals, the description, the tags all of it gets packed into one neat numerical vector.

Then, when the model wants to find “similar” videos, it just looks for the ones whose embeddings are closest to each other in this multi-dimensional world.

How Embeddings Are Created

So how does one even figure out those numbers or what axes represent? Who decided “dog” should be at (0.7, 0.1, 0.3) and not somewhere else?

The answer is… you guessed it.. Machine Learning Models! Neural Networks to be specific. There are other methods, but I believe Neural Network is the leading one.

The ML Model learns them by showing billions of sentences and the system slowly starts realising that these words belong together and these don’t and it begins placing them accordingly in its coordinate space. It looks at all these examples, adjusts itself a little each time it’s wrong, and gradually learns what kind of words tend to show up together. That’s how embeddings form.

We put our objects (ex: text, video, audio, etc) into a suitable (trained) embedding model and the (trained) model outputs the embeddings.

Luckily, we didn’t have to make our Embeddings Model from scratch. We used Google Gemini Embeddings (which has 3072 dimensions!).

So, does the algorithm know what the video is about?

So to answer the question we began with, does it even know if it is cat video or political video?

If it is not clear yet, the answer is.. well- yes and no. The model knows what the video is about, but in some multidimensional vector form, it does not explicitly know that the video is about cats or politics.

In our case, I just simply used texts and pre assigned categories, sub categories and then converted to embeddings, but in an actual social media platform, the images we post, or the videos are converted into embeddings. So if I were to use actual images or videos instead of the texts, I would have converted them into embeddings with the help of a suitable embedding model.

(A purely text embedding model, cannot be used to convert images into embeddings)

For the new smarter algorithm, I made a new “database” of 500 videos. Each video now has categories, sub categories, description (to mimic captions) and embeddings.

User Embeddings

After the cold start, Say user interacted with a couple of memes and funny videos, now for the next batch we need to retrieve videos based on what the user liked, i.e. we need to retrieve videos which are similar to “memes” and “funny videos”.

The way we find similar content is by similarity search method ( we will talk about them later), basically we compare two vectors (i.e. embeddings) by performing some math to check if they are similar or not.

So one of the two vectors is obviously embedding of videos, and the other is what we will call “user_embeddings”.

From the user’s history, to create user_embeddings, we will take the weighted average of embeddings, using scores as weight, of the videos which the user had interacted with. In other words, we will take the average of embeddings in such a way that, embeddings of videos with higher scores will have higher relevance in the user_embeddings and we will get one vector (embedding) which will encapsulate all the preferences of the user based on what user interacted with.

\[\text{User Embedding} = \frac{\text{(Embedding Of Video 1} \cdot \text{Score Of Video 1) } + \text{(Embedding Of Video 1} \cdot \text{Score Of Video 1) } \dots \text{(Embedding Of Video N} \cdot \text{Score Of Video N) } }{\text{Sum Of All Scores}}\]

Once we get the user embedding, we search the database for videos which are “near” or “similar” to the user_embeddings.

The Search Problem

The easiest way we can do here is to iterate through the database and perform “similarity search”… but that would be incredibly inefficient. In real world scenarios, social media platforms have millions and millions of videos and contents, they can not brute force the search it would take forever to fetch similar content every time.

Introducing Indexed Database & FAISS

Indexed Databases & FAISS

An index is like a super-organized map of all embeddings that allows the system to quickly locate the closest vectors instead of checking every single one. Think of it like a library. Instead of reading every book to find the one about cats, you use the catalog to go straight to the shelf where cat books are.

FAISS is a library made by Facebook AI Research specifically for this. It stands for Facebook AI Similarity Search. FAISS helps us build a fast, searchable index of embeddings so we can find videos similar to a user’s preferences in milliseconds, even if we have millions of them.

Building the FAISS Index

Here’s a simple example of building an index using FAISS.

# step 1: to index data

dimension = 3072

# build index using L2 distance (Euclidean)  
index = faiss.IndexFlatL2(dimension)

for row in videos_db.itertuples():  
    # embeddings are 1D, we convert to 2D for FAISS  
    np_arr = np.array(row.embeddings)  
    np_arr_2d = np_arr.reshape(1, -1)  
    index.add(np_arr_2d)

print(index.ntotal) # check how many vectors are in the index  

Now the index is ready. Let’s try searching for similar videos:

# pick a video from DB as query  
x_emb = np.array(videos_db.iloc[2].embeddings)  
x_emb_2d = x_emb.reshape(1, -1)  
k = 5 # number of similar videos to retrieve

# sanity check  
dist, similar_item_index = index.search(x_emb_2d, k)  
print(f"Distances = {dist}nIndices in Index/DB = {similar_item_index}")  
Distances = [[0. 0. 0. 0. 0.]]  
Indices of Them in Index/DB = [[  2 237 281 389 414]]

Here you see the top match has distance 0, meaning there are 5 videos in our database which are identical to the query. The other results are the closest videos in the embedding space.

Using the Index for Recommendations

Once the index is built, we can use it in our video recommendation pipeline. Every time a user interacts with videos, we update their user embedding based on their history:

def update_user_embedding_simple():  
   # i am pretty sure this feels illegal, this all can be done in better way if used database.  
   global user_embedding

   if len(history) == 0:  
       return

   history_df = pd.DataFrame(history).T  
   embedding_matrix = []  
   weights = []  
   for row in history_df.itertuples():  
       embedding_matrix.append(row.embeddings)  
       weights.append(int(row.score) + 1)

   print(weights)

   # weighted mean of weights:  
   weights = np.array(weights) / np.sum(weights)

   # reshape for FAISS (needs 2D array)  
   user_embedding_1d = np.average(np.array(embedding_matrix), axis=0, weights=weights)  
   user_embedding = user_embedding_1d.reshape(1, -1).astype("float32")  
   print("user_embedding updated")

Then, when fetching new videos:

  1. Take the current user embedding.
  2. Query the FAISS index for the top k nearest video embeddings.
  3. Return videos corresponding to these indices.
  4. Update user embedding periodically, e.g., every 5 videos watched.

Current Issues

A. Videos often appear consecutively from the same topic, which can bore the user.
B. After showing about 20 videos, the algorithm ran out of similar videos and simply refuses to load more.

Understanding Search Methods

IndexFlatL2 & K-Nearest Neighbors

This is the simplest type of FAISS index. It works on the principle of Euclidean distance, which is basically the straight-line distance between two points in multi-dimensional space.

For two vectors $a$ and $b$ of dimension ddd, the Euclidean distance is:

\[d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

IndexFlatL2 uses k-nearest neighbors (kNN) to find the closest vectors. In simple terms, it calculates the distance from the query vector to all vectors in the database and picks the top k closest ones.

kNN is not smart, it’s brute force. It works well for small datasets but becomes slow with millions of items.

A smarter approach is to cluster embeddings first and search inside relevant clusters, which is where K-Means and IndexIVF come in.

IndexIVF & K-Means Clustering

FAISS provides IndexIVFFlat to combine K-Means clustering with a flat index. K-Means creates clusters of similar videos by finding k centroids. Every embedding is assigned to the nearest centroid. Later, when searching for similar videos, you only check embeddings in the nearest cluster(s), saving time while still giving good approximations.

It requires more parameters than IndexFlatL2, because we need to specify:

  • The number of clusters (num_clusters)
  • A quantizer (to store centroids)
  • The dimensionality of embeddings
dimensions = 3072  
num_clusters = 10 # arbitrary, 500 videos => ~50 per cluster

# quantizer to store centroids  
quantizer = faiss.IndexFlatL2(dimensions)

# create IndexIVF  
index_v2 = faiss.IndexIVFFlat(quantizer, dimensions, num_clusters)

# collect embeddings  
embeddings = []  
for row in videos_db.itertuples():  
    embeddings.append(row.embeddings)

np_emb = np.array(embeddings).astype('float32')

# train the IVF index  
index_v2.train(np_emb)

# add embeddings  
index_v2.add(np_emb)

Testing the New Index

Now we can test retrieving similar videos:

x_emb = np.array(videos_db.iloc[4].embeddings)
x_emb_2d = x_emb.reshape(1, -1)

k = 5

dist, i = index_v2.search(x_emb_2d, k)
print(f"Distances = {dist}nIndices of Them in Index/DB = {i}")
Distances = [[0.         0.12388569 0.12388569 0.12753543 0.15515578]]
Indices of Them in Index/DB = [[  4  22 369  19 289]]

Trying a higher k:

k = 10

dist, i = index_v2.search(x_emb_2d, k)
print(f"Distances = {dist}nIndices of Them in Index/DB = {i}")
Distances = [[0.         0.12388569 0.12388569 0.12753543 0.15515578 0.15524048

  0.15922126 0.22454357 0.2858926  0.31604165]]
Indices of Them in Index/DB = [[  4  22 369  19 289 353 241  37 232  58]]

If we increase k beyond the estimated number of items per cluster:

k = 100

dist, i = index_v2.search(x_emb_2d, k)
print(f"Indices of Them in Index/DB = {i}")
Indices of Them in Index/DB = [[  4  22 369  19 289 353 241  37 232  58  87 381 473 340 377 119 304 322
  224 100 488  74 429  51 335 410 132 307 407  16  10 174 475 401 270 433
  273 378  66 463 469 114 141 342  38 447 436 424 115  97 344  68 101   9
  124  59 127 110 476  13  34 294 288  12 370 484 240  61  63 327 376  77
  430 103  88 159 209 395 262 419  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1
   -1  -1  -1  -1  -1  -1  -1  -1  -1  -1]]

Notice the -1 at the end, which means the index ran out of videos in the searched cluster. By default, nprobe = 1, meaning only one cluster is considered.

Tuning with nprobe

To get more varied results, we can increase nprobe, the number of clusters to search:

k = 100
index_v2.nprobe = 3

dist, i = index_v2.search(x_emb_2d, k)
print(f"Indices of Them in Index/DB = {i}")
Indices of Them in Index/DB = [[  4  22 369  19 289 353 241  37 232  58  87 381 473 340 377 119 248 304
  322 255 286 338 399 224 100 260 236 104 207 446 206 488 264  74 429  51
  335  99 410 362  28  30 391 190 229 132 307 407  16  10 174 475 401 155
  270 433  80  49 273 378  73  96  66   1 463 357 297 406 329 252 142 469
  412 114 141 342 478 140 107 315 200 150 411 205 380  38 456 447  92  43
  44 212 247 277 420 436 424 225 115  97]]

Now we have a more varied output.

Refining the Algorithm

After switching to IndexIVFFlat, the system worked better, faster searches, more variety. But we still had those nagging issues from earlier:

  • Videos often appeared consecutively from the same topic, which bored users.
  • After about 20 videos, the feed would just… stop. Run out of recommendations.

Time to fix both.

The Sliding Window Approach

The first thing I tackled was making recommendations feel more dynamic, more responsive to what the user was currently interested in.

Think about it. If you binged cat videos last 10 reels but currently you’re showing preference for cooking content, why should cat preference still dominate your feed? It shouldn’t.

So I implemented a sliding window. Instead of using the user’s entire watch history to compute their embedding, I only used the last 5 videos they genuinely engaged with. Not just any videos, only ones where they scored 2 or higher. This means they watched it all the way through, maybe even rewatched it or interacted in someway.

Here’s what that looks like:

def update_user_embedding_simple_v2():
    global user_embedding
    
    if len(history) == 0:
        return
    
    history_df = pd.DataFrame(history).T
    embedding_matrix = []
    weights = []
    
    # Keep only the last 5 videos with score >= 2
    filtered_df = history_df[history_df["score"] >= 2].tail(5)
    
    for row in filtered_df.itertuples():
        embedding_matrix.append(row.embeddings)
        weights.append(int(row.score) + 1)
    
    print(weights)
    
    weights = np.array(weights) / np.sum(weights)
    user_embedding_1d = np.average(np.array(embedding_matrix), axis=0, weights=weights)
    user_embedding = user_embedding_1d.reshape(1, -1).astype("float32")
    
    print("user_embedding updated")

This change made the recommendations feel more flowy. The feed adapted faster to shifting interests. It mimicked something real social media platforms do called time decay, where recent behavior matters more than old behavior.

The proper way to implement time decay would be to add timestamps to each interaction and give exponentially decreasing weight to older videos. But since I wasn’t tracking timestamps, the sliding window gave me a simpler, good-enough solution.

Solving the “Running Out” Problem

Here’s what was happening. The algorithm would find 10 similar videos and if the video was already shown to the user, it would skip that video. But after a few batches, all the similar videos had already been shown and no video used to be returned.

So instead of skipping already-shown videos, I replaced them with random ones:

if any(rand_vid_id == video[0] for video in feed):
    print(f"{rand_vid_id} already shown, showing random video instead")
    rand_vid_id = video_keys[random.randint(0, len(video_keys) - 1)]

This served two purposes:

1. It prevented the feed from dying. We’d always return 10 videos per batch, no matter what.

2. It added even more variation naturally. If the algorithm kept finding the same similar videos, it would automatically inject random ones instead, keeping things fresh. This allowed user to “explore”

The 70-30 Mix: Exploitation vs Exploration

Now, onto problem A. When I first implemented the IndexIVFFlat search, I’d request 10 similar videos at once. Great for relevance, terrible for variety.

What happened was this: all 10 videos would be super similar. If you watched one meme video, boom, here’s 10 more memes in a row. Then 10 more. Then 10 more. It got repetitive fast.

The solution? Remember exploitation-exploration from earlier section? A 70-30 split.

I changed the algorithm to fetch only 7 similar videos based on the user embedding, then threw in 3 completely random videos for variation. This kept the feed relevant while breaking up monotony.

if user_embedding is not None:
    faiss_index.nprobe = 5
    
    # Fetch 7 similar videos instead of 10
    dist, similar_item_index = faiss_index.search(user_embedding, 7)
    
    print(f"Distances = {dist}\nIndices of Them in Index/DB = {similar_item_index}")
    video_keys = list(all_videos.keys())
    
    for i in similar_item_index[0]:
        if i >= len(video_keys):
            continue
        
        rand_vid_id = video_keys[i]
        
        # Skip if video already in feed and show random video instead
        if any(rand_vid_id == video[0] for video in feed):
            print(f"{rand_vid_id} already shown, showing random video instead")
            rand_vid_id = video_keys[random.randint(0, len(video_keys) - 1)]
        
        feed_data = [rand_vid_id]
        for val in list(all_videos[rand_vid_id].values()):
            feed_data.append(val)
        
        feed.append(feed_data)
        new_videos.append({
            "video_id": feed_data[0],
            "description": feed_data[3] if len(feed_data) > 3 else "",
        })
    
    print(f"Returning {len(new_videos)} new similar videos")
    
    # Add 3 random videos for variation
    for i in range(3):
        vid_for_variation_id = video_keys[random.randint(0, len(video_keys) - 1)]
        temp = [vid_for_variation_id]
        for val in list(all_videos[vid_for_variation_id].values()):
            temp.append(val)
        
        feed.append(temp)
        new_videos.append({
            "video_id": temp[0],
            "description": temp[3] if len(temp) > 3 else "",
        })
    
    print(f"Returning {len(new_videos)} new videos total")

After implementing this, the feed felt much more natural. You’d get a cluster of related content, then something unexpected, then back to relevant stuff. It kept users engaged longer. This also allowed the algorithm to understand if user has any other interests.

Shuffling

One last touch. Even with the 70-30 split, the videos were still coming in a predictable order: 7 similar, then 3 random. Users could potentially notice the pattern.

So I shuffled the final list:

random.shuffle(new_videos)
return jsonify(new_videos)

Now the similar and random videos were interspersed randomly. To the user, it all felt organic. They couldn’t tell which videos were algorithmically chosen and which were random serendipity. And that’s exactly how it should feel.

The Complete Pipeline

Putting it all together, here’s what the final video fetching logic looked like:

@app.route("/fetch_more", methods=["GET"])
def fetch_more_videos_v2():
    try:
        global all_videos, feed
        new_videos = []
        
        if user_embedding is not None:
            faiss_index.nprobe = 5
            dist, similar_item_index = faiss_index.search(user_embedding, 7)
            
            print(f"Distances = {dist}\nIndices of Them in Index/DB = {similar_item_index}")
            video_keys = list(all_videos.keys())
            
            for i in similar_item_index[0]:
                if i >= len(video_keys):
                    continue
                
                rand_vid_id = video_keys[i]
                
                if any(rand_vid_id == video[0] for video in feed):
                    print(f"{rand_vid_id} already shown, showing random video instead")
                    rand_vid_id = video_keys[random.randint(0, len(video_keys) - 1)]
                
                feed_data = [rand_vid_id]
                for val in list(all_videos[rand_vid_id].values()):
                    feed_data.append(val)
                
                feed.append(feed_data)
                new_videos.append({
                    "video_id": feed_data[0],
                    "description": feed_data[3] if len(feed_data) > 3 else "",
                })
            
            print(f"Returning {len(new_videos)} similar videos")
            
            for i in range(3):
                vid_for_variation_id = video_keys[random.randint(0, len(video_keys) - 1)]
                temp = [vid_for_variation_id]
                for val in list(all_videos[vid_for_variation_id].values()):
                    temp.append(val)
                
                feed.append(temp)
                new_videos.append({
                    "video_id": temp[0],
                    "description": temp[3] if len(temp) > 3 else "",
                })
            
            print(f"Returning {len(new_videos)} total videos")
        
        else:
            print("Not enough history yet, returning random videos")
            new_videos = get_random_videos()
        
        if len(new_videos) == 0:
            new_videos = get_random_videos()
            print("All similar videos shown, returning random videos")
        
        random.shuffle(new_videos)
        return jsonify(new_videos)
    
    except Exception as e:
        print("Error in /fetch_more:", e)
        import traceback
        traceback.print_exc()
        return jsonify([]), 500

The algorithm now had:

  • Sliding window for adaptive, recent-preference-focused recommendations
  • 70-30 mix of similar and random videos for variety
  • Smart replacement of duplicate videos to prevent feed death
  • Shuffling for organic feel
  • IndexIVFFlat with nprobe=5 for fast, varied similarity search

Why We Didn’t Rank (And Why We Didn’t Need To)

Throughout this project, you might’ve noticed something. We spent a lot of time on retrieval, getting the right videos from the database using embeddings and FAISS, but we never really talked about ranking them.

In the earlier, simpler algorithms, ranking was everything. We’d calculate preference scores per category, sort videos by those scores, and serve them in order. But once we moved to the ML-based approach with embeddings, ranking became… less necessary. And here’s why:

The retrieval itself is doing the ranking.

When we query FAISS with the user embedding, it returns videos sorted by similarity distance. The closest videos come first, the furthest come last. That is the ranking. The videos most similar to what the user likes are already at the top.

In real social media algorithms, ranking is a separate step because they need to factor in way more than just similarity:

  • Recency - newer content gets a boost
  • Engagement rate - videos with higher like/share rates rank higher
  • Advertiser priorities - promoted content needs to be injected
  • Diversity - prevent too many similar videos in a row
  • User state - time of day, device type, session length
  • Social signals - content from accounts you follow or interact with

They use separate ML models (often gradient boosted trees or neural networks) trained specifically for ranking. These models take the retrieved videos and re-order them based on predicted engagement probability.

But for our algorithm? We kept it simple. FAISS retrieval + a 70-30 mix of similar/random videos + shuffling was enough to create a decent feed. The similarity distance from FAISS was a good enough proxy for “how much the user will like this.”

Could we add a ranking layer? Absolutely. We could train a model to predict the user’s score for each video based on features like:

  • Similarity distance from FAISS
  • Video’s overall engagement rate in the database
  • How long ago the user watched a similar video
  • The user’s average watch time

But honestly? For a 500-video database and a proof-of-concept project, it felt like overkill. The retrieval was doing its job well enough.

That said, if I were to scale this to millions of videos and thousands of users, ranking would become necessary. You can’t just rely on FAISS distance alone when you need to balance trends, monetization, and user retention. But for now, we’re good.

Conclusion

So.. does the algorithm know me? Well, sort of. It doesn’t “know” me in the way a friend does, but it knows my patterns, my preferences encoded in 3072-dimensional space. It knows that when I watch a video till the end, rewatch it, or share it, I’m telling it something. And over time, those signals add up.

What started as a simple question, “how does the algorithm know what content is?” turned into building one from scratch. And honestly? It’s not magic. It’s not some black box that’s impossible to understand. It’s just math, patterns, and a lot of clever optimization.

The algorithm I built is nowhere near as sophisticated as what TikTok or Instagram uses. They have teams of engineers, way more data, way more compute power, and years of refinement. But the core ideas? They’re the same. Track interactions. Convert content to embeddings. Find similar stuff. Serve it fast. Keep the user hooked.

And that’s kinda the scary part. These algorithms are really good at what they do. They learn quickly, adapt constantly, and they’re designed with one goal: maximize engagement, which translates to maximize time on app, which translates to.. well, money.

But now I get it. I understand why my feed knows me so well. Why I can lose hours scrolling. Why that one random video I watched once keeps showing me similar content days later.

It’s not magic. It’s just really, really good pattern recognition.

Future Improvements

This algorithm works, but there’s still a lot of room to make it better, more realistic, and more efficient. Here are some things I’d like to implement:

1. Using ML Models for User Embeddings

Right now, I’m creating user embeddings by taking a weighted average of video embeddings from the user’s history. It’s simple, it works, but it’s not exactly smart.

A better approach would be to train a neural network that learns how to combine video embeddings based on the user’s interaction patterns. Instead of just averaging, the model would learn things like:

  • Which videos should have more influence on the user embedding
  • How different interaction types (like, share, rewatch) should be weighted
  • How to decay the influence of older videos naturally
  • How to detect when a user’s interests are shifting vs. just exploring

The model would take in the user’s history and output an optimized user embedding that better captures their current preferences. This is what real recommendation systems do, they don’t just average embeddings, they learn how to represent users.

I could use something like a simple feedforward network or even an LSTM if I wanted to capture temporal patterns in viewing behavior. The training data would be: given a user’s history up to time T, predict which videos they’ll engage with at time T+1.

2. Better Indexing: HNSW

Right now I’m using IndexIVFFlat, which works by clustering embeddings and searching within clusters. But there’s an even better method: HNSW (Hierarchical Navigable Small World).

HNSW builds a graph-based index where embeddings are connected in layers, kind of like a highway system. When searching, it starts at the top layer (the highways) for fast, broad navigation, then zooms into lower layers (local roads) for precise results. It’s faster than IVF for high-dimensional data and gives better accuracy.

But here’s the kicker: HNSW is also way better at handling trends and fresh content.

With IndexIVFFlat, once the index is built, it’s static. If a new viral video gets uploaded, it won’t show up in recommendations until you rebuild the entire index, which is expensive and slow. HNSW, on the other hand, allows for dynamic updates. You can add new videos to the index without rebuilding everything, which means trending content can surface faster.

Right now, my algorithm doesn’t handle trends at all. A video that’s blowing up globally would be treated the same as any other video unless I manually boosted it. With HNSW + dynamic updates + some recency weighting, the algorithm could actually detect and surface trending content in real-time.

FAISS has IndexHNSWFlat built in, and switching to it would probably make searches faster, more varied, and way more responsive to what’s happening right now on the platform.

3. Time Decay

Right now my sliding window approach is a hack. I only look at the last 5 videos with high scores, but I don’t actually track when the user watched them.

A proper implementation would add timestamps to every interaction and apply exponential time decay. Videos watched 10 minutes ago should have way more influence on the user embedding than videos watched 3 days ago. The math would look something like:

\[\text{weight}_i = \text{score}_i \cdot e^{-\lambda \cdot \text{time elapsed}_i}\]

where λ controls how fast older interactions lose relevance. This would make the algorithm way more responsive to shifting interests.

4. Diversity Penalty

Even with the 70-30 mix, the feed can still get repetitive. Real algorithms use something called a diversity penalty or MMR (Maximal Marginal Relevance).

The idea is: when picking the next video to show, don’t just pick the most similar one. Instead, pick videos that are similar to the user embedding but different from videos already in the current batch. This forces variety without sacrificing relevance.

5. Using Actual Videos and Images

Right now I’m just using text descriptions and pre-assigned categories, then converting them to embeddings. But in real social media, the algorithm analyzes the actual video content, the visuals, the audio, the captions, everything.

I’d love to implement this by:

  • Using a vision model like CLIP to extract embeddings from video frames
  • Using an audio model to analyze sound/music
  • Combining text, visual, and audio embeddings into one unified representation

This would make the algorithm actually “see” what’s in the video instead of relying on descriptions.

6. Collaborative Filtering

Right now the algorithm only looks at what I like. But real algorithms also use collaborative filtering, which looks at what people similar to me like.

If 100 users who watch the same cat videos as me also watch a specific cooking video, the algorithm should suggest that cooking video to me too, even if I haven’t shown interest in cooking yet. This is how you discover content outside your usual bubble.

I’d need to track user-to-user similarity and blend it with content-based recommendations, which would require way more data and a different architecture, but it’d be a huge improvement.

7. A/B Testing and Feedback Loops

Right now I have no way to measure if the algorithm is actually getting better. Real companies run A/B tests constantly, trying different ranking strategies, different weights for interactions, different exploration rates and they measure everything. Engagement rate, session length, retention, etc.

I’d love to build a simulation where I can test different algorithm tweaks against simulated user behavior and see which version performs best. That’s how you iteratively improve.

8. Handling Fresh Content (Exploration)

My algorithm is decent at exploitation (showing what the user likes) but pretty basic at exploration. The 3 random videos per batch is better than nothing, but it’s not smart.

I would like to explore more approaches to make the model better at exploration.

This would also help with the cold start problem for new videos. Right now, if a video has no interactions yet, the algorithm has no idea if it’s good or bad.


Building this algorithm was one of the most fun projects I’ve done. It’s not perfect, but it works, and more importantly, it taught me how these systems actually function under the hood.

If you made it this far, thanks for reading. Now go touch some grass. The algorithm can wait.