Why Zoom’s Success Is Not a Coincidence — On Distributed Video Conferencing Architecture

Intel Chen
7 min readNov 21, 2021

--

TL;DR:

  1. Zoom uses distributed architecture and client-side processing to optimize for scaling
  2. Thus, Zoom could provide a better experience for less price→ pricing advantage
  3. And a higher capacity → perfect for the pandemic when the demand skyrocketed

What is Video Chat?

Recall the last time you FaceTimed your best friend. What was the experience like? You would

  1. find their contact
  2. click the magical FaceTime button
  3. (once they pick up)
  4. Voila! You are able to see and hear each other through the onboard camera and interact with almost no delay.

This sounds like a straightforward technical implementation. We have two clients, and all we need to do is stream the video/ audio signal to each other continuously.

Video Conferencing→ Video chat with three or more people.

How about three people. Same Idea, the video stream would be broadcasted between the 3 participants. Each client will send streams to 2 other clients while receiving two streams like the triangle below.

How about 100 people?

We can… hold up; you are not gonna send 100 streams to the peer clients, are you?

(This is a Completed Graph)

Since your video feed is the same for everyone (is it really? We will discuss it later.) Not only is the graph much, much messier here, but each client is also broadcasting 100 streams of the same content, albeit to a different receiver. This is inefficient and very unfeasible as the average 720p feed takes up 1Mbps of bandwidth; 100 would mean 100Mbps, which is not available everywhere in the world. Even a five-person meeting would be out of reach for 80% of internet users.

(Yes, I know upload speed would be an even more significant bottleneck)

Is there a better way?

How to Scale Up

Let’s first try to eliminate the issue of upload duplication. How about having the client upload their stream to a server and have the server(with enterprise-level network bandwidth) handle the up-link to all the clients that need to receive the video?

Great, that works for the upload issue, but now that everything is connected to a single server, any listener of the n client network needs to pull the other (n-1) video streams from the server. That will require a lot of download bandwidth for the client and astronomical upload bandwidth for the server (n²).

How about compression then? Since the user would never be viewing the full resolution feed of the n participants simultaneously, the server can compress the feeds into a single stream and then send it to users.

Very quickly, even with the most optimized architecture MCU, you will soon hit a wall on how many participants the server can handle. Since MCU requires the server to process multiple feeds into a single feed (not to mention that layout switching would also burden the server), to maintain the real-time nature of video conferencing, a single server hosting the call will have difficulty handling too many participants. Additionally, as more participants join the call, the server’s available resources quickly deplete. It is often the case that the call quality suffers and server latency increases.

If we hit a wall on scaling up (adding hardware to a single server), we are left with scaling out (adding more servers) and adopting a distributed architecture. And this is where Zoom comes in.

How to Scale Way… Way Up.

Zoom realized this constraint on the existing mainstream video conferencing solution at its founding in 2011 and decided to design a fully distributed video conferencing architecture. Not only does this mean

  1. More participants per meeting
  2. More meetings
  3. More robust connections

They can also leverage commodity hardware (Oracle cloud, AWS) to offload server constraints as they grow rapidly. So how do they do it?

Beyond MCU

Zoom has chosen a unique technology called the Multimedia Router. Essentially, it is a distributed version of SFU, where Zoom takes one uplink from each client and multiple downlinks to each client. However, instead of putting all the bandwidth pressure on a single machine, Zoom uses its fleet of servers to optimize the transmission of steams between any two clients, potentially using some Shortest Path algorithm.

Beyond Compression

You say, how about the download bandwidth problem for clients? If zoom compresses the feed, it’ll have to provide additional processing; If not, the client cannot download all the feeds in a large meeting.

As a compromise, Zoom offloads the bulk of encoding and decoding to the clients rather than its servers. While this means that your machine does run (very) hot during a zoom meeting, it liberates the duty from Zoom’s servers so that they can focus on routing streams.

Here is another innovation Zoom introduced→ MBE (multi-bitrate encoding), a single feed that adjusts to user need on-demand, thus the bandwidth required. MBE addresses both efficient use of bandwidth for different layouts, as well as users with poor internet connections:

  1. A stream in a gallery view will automatically take up less bandwidth than a spotlighted stream.
  2. Similarly, the quality of the stream scales down automatically when a poor connection is detected.

Additionally, Zoom has a dedicated mechanism that monitors its users’ connection and adjusts the quality of streams on demand. Even if the internet is truly helpless, Zoom prioritizes audio latency over video quality, such that real-time communication can continue. (Check out the 150ms rule https://www.protocol.com/zoom-videoconferencing-history-profit)

Of course, there’re also bells and whistles like optimized compression and other quality of service optimizations, which complete the circle of Zoom’s superior video experience.

Then the Pandemic Comes

As the graph shows, even before COVID has reached the majority of the world, Zoom is approaching a leadership position in the video conferencing segment.

Who are zoom’s competitors? They are legacy incumbents like Skype, Cisco Webex that paid little attention to user experience and operated on a hardware infrastructure designed for the last decade. Instead of jumping onto another technology S-curve, they have improved their top line by selling physical video conferencing set-ups to corporate clients for years.

While some have realized the challenge from Zoom and proceeded to provide more customer-centric features such as free group video calls (Skype-2014) and one-click group call (Skype-2020), the architectural advantage truly showed its power once covid hits.

No longer is video conferences reserved for meetings in Fortune 500 company board rooms; no longer is it used by the casual calls between friends with only a handful of participants. This is the new world where all activities have moved online.

Both the scale (number of participants) and volume (number of meetings) of video calls skyrocketed. As a student, I witnessed firsthand how BlueJeans’ call quality deteriorated as more students joined the classroom (MCU), and in other classes where Zoom is used, the call held up to the class size extending beyond 100 people.

Most Zoom competitors simply weren’t designed for this new reality. Before they realize it, all that’s left in the competitor’s conference room is, “Hey, do you want to switch to Zoom?”

And the rest is history.

Wrapping Up

Distributed Architecture is the king for large-scale applications. Zoom made the right bet on the scale of video conferencing today.

Because of Zoom’s distributed nature and the acceptance of generic hardware, it was able to quickly expand to cloud vendors to offload the pressure on its own clusters and provide stable service to users all over the world.

Additionally, Zoom’s flexible and low requirement for internet quality brought large conference calls outside of dedicated video conferencing rooms, and video calling is more accessible and accessed than ever.

Even today, Zoom still holds an edge over its competitors with its offering of up to 1000 interactive participants in a meeting, with the closest competitor being the 300 interactive participants offered by MS Teams, a stark 3x difference.

References

https://testrtc.com/different-multiparty-video-conferencing/ **

https://blog.zoom.us/zoom-can-provide-increase-industry-leading-video-capacity/

https://trueconf.com/blog/wiki/multipoint-control-unit

https://explore.zoom.us/docs/doc/Zoom%20Connection%20Process%20Whitepaper.pdf

https://trueconf.com/blog/wiki/sfu

http://highscalability.com/blog/2020/5/14/a-short-on-how-zoom-works.html

https://support.zoom.us/hc/en-us/articles/201363113-Meeting-connector-core-concepts

https://www.lavivienpost.com/how-zoom-works/

https://www.nextplatform.com/2021/08/02/thought-experiment-how-did-zooms-infrastructure-keep-us-connected/ **

https://explore.zoom.us/docs/doc/Zoom_Global_Infrastructure.pdf ***

--

--