More than 80 per cent of the TV shows people watch on Netflix are discovered through the platform’s recommendation system. That means the majority of what you decide to watch on Netflix is the result of decisions made by a mysterious, black box of an algorithm. Intrigued? Here’s how it works.

Netflix uses machine learning and algorithms to help break viewers’ preconceived notions and find shows that they might not have initially chosen. To do this, it looks at nuanced threads within the content, rather than relying on broad genres to make its predictions. This explains how, for example, one in eight people who watch one of Netflix’s Marvel shows are completely new to comic book-based stuff on Netflix.

The complexity of the infrastructure and relationships that runs the network is no simple task. Let us take a look at the network architecture. We will break it down to 6 parts in order to explain it more clearly.

User Management:

Netflix operates based on Amazon cloud computing services (AWS) and Open Connect, its in-house content delivery network ([1]). Both systems must work together seamlessly to deliver high quality video streaming services globally. From the software architecture point of view, Netflix comprises three main parts: Client, Backend and Content Delivery Network (CDN).

Client is any supported browsers on a laptop or desktop or a Netflix app on smartphones or smart TVs. Netflix develops its own iOS and Android apps to provide the best viewing experience for each and every client and device. By controlling their apps and other devices through its SDK, Netflix can adapt its streaming services transparently under certain circumstances such as slow networks or overloaded servers.

Backend includes services, databases, storages running entirely on AWS cloud. Backend basically handles everything not involving streaming videos. Some of the components of Backend with their corresponding AWS services are listed as follows:

  • Scalable computing instances (AWS EC2)
  • Scalable storage (AWS S3)
  • Business logic microservices (purpose-built frameworks by Netflix)
  • Scalable distributed databases (AWS DynamoDB, Cassandra)
  • Big data processing and analytics jobs (AWS EMR, Hadoop, Spark, Flink, Kafka and other purpose-built tools by Netflix)
  • Video processing and transcoding (purpose-built tools by Netflix)

Open Connect CDN is a network of servers called Open Connect Appliances (OCAs) optimized for storing and streaming large videos. These OCAs servers are placed inside internet service providers (ISPs) and internet exchange locations (IXPs) networks around the world. OCAs are responsible for streaming videos directly to clients.

Isolation And Privacy:

In the late 2000s, Netflix ran a competition to develop a better film recommendation algorithm. To drive the competition, they released an “anonymized” viewing dataset that had been stripped of identifying information. Unfortunately, this de-identification turned out to be insufficient. In a well-known piece of work, Narayanan and Shmatikov showed that such datasets could be used to re-identify specific users — and even predict their political affiliation! — if you simply knew a little bit of additional information about a given user.

So we now understand that changing actual values with certain integer values is not a solution to the problem of preserving privacy while performing data analysis. Differential privacy comes into play here, it says that even if a related data is present out there, we can still prevent any adversary to extract out the information from this dataset and learn anything new about an individual.

Privacy-Preserving Data Analysis :

Now we know that data anonymization is not the solution to the problem of preserving privacy while doing data analysis. But that is not the only concern here, let’s see a few more.

  • Data can’t be anonymized and remain useful. The problem with anonymization is that it leaves scope for linkage attacks, as we saw above in the case of the Netflix database. Differential privacy is robust to these type of attacks because it is a property of data access mechanism, and is unrelated to the presence or absence of auxiliary information available to the adversary.

Which means it doesn’t matter if you are the part of this database or not, as long as your presence or absence in the data does not change the output to the queries to the database.

  • Queries Over Large Sets are Not Protective. Another major concern is that correctly answering repeated queries will exploit privacy rapidly. You can’t expect to answer accurately every time and still talk about privacy. Let’s talk about Mr. X for now.

Suppose it is known that Mr. X is in a certain medical database. Taken together, the answers to the two large queries “How many people in the database have the sickle cell trait?” and “How many people, not named X, in the database have the sickle cell trait?” yield the sickle cell status of Mr. X.

  • Query Auditing Is Problematic. An obvious thought that comes to mind is what if I can find out all the sequence of queries that compromise the privacy of my database, I can somehow make my database more secure.

But there are problems with this approach, first is that this task is computationally infeasible, and secondly refusing to answer a query is itself disclosive, refusing to answer a query suggest the adversary that he is on the right path to compromise the privacy.

Differential privacy proposes a methodology that can help us to perform analytical studies on data without LEAKING any private information from it.