How to Integrate Audio and Video Calls into Mobile Apps

build video call application

WebRTC allows real-time, peer-to-peer, media exchange between two devices. A connection is established through a discovery and negotiation process called signaling. This tutorial will guide you through building a two-way video call.

WebRTC is a fully peer-to-peer technology for the real-time exchange of audio, video, and data, with one central caveat. A form of discovery and media format negotiation must take place, as discussed elsewhere, in order for two devices on different networks to locate one another.

This process is called signaling and involves both devices connecting to a third, mutually agreed-upon server. Through this third server, the two devices can locate one another, and exchange negotiation messages.

Things you need to know

  1. Signaling
  2. SDP – Session Description Protocol
  3. ICE Candidates
  4. STUN & TURN
  5. Peer Connection

1. Signaling using signal server

Establishing a WebRTC connection between two devices requires the use of a signaling server to resolve how to connect them over the internet. A signaling server’s job is to serve as an intermediary to let two peers find and establish a connection while minimizing exposure of potentially private information as much as possible.

How do we create this server and how does the signaling process actually work?

First we need the signaling server itself. WebRTC doesn’t specify a transport mechanism for the signaling information. 

You can use anything you like, from WebSocket to XMLHttpRequest to carrier pigeons to exchange the signaling information between the two peers.

It’s important to note that the server doesn’t need to understand or interpret the signaling data content. Although it’s SDP, even this doesn’t matter so much: the content of the message going through the signaling server is, in effect, a black box. 

What does matter is when the ICE subsystem instructs you to send signaling data to the other peer, you do so, and the other peer knows how to receive this information and deliver it to its own ICE subsystem. All you have to do is channel the information back and forth. The contents don’t matter at all to the signaling server.

2. Session Description Protocol

The signaling mechanism (methods, protocols, etc.) is not specified by WebRTC. We need to build it ourselves. (Although this seems to be a complicated task, believe us — it is not. In this series, we will use Socket.IO for signaling, but there are many alternatives).

WebRTC only requires the exchange of the media metadata mentioned above between peers as offers and answers. Offers and answers are communicated in Session Description Protocol (SDP)

format which looks like the following:-

v=0
o=- 7614219274584779017 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE audio video
a=msid-semantic: WMS
m=audio 1 RTP/SAVPF 111 103 104 0 8 107 106 105 13 126
c=IN IP4 0.0.0.0
...

If you are wondering what each line means in the above format, don’t worry. WebRTC creates this automatically according to the audio/video device present on your laptop/PC.

3. ICE Candidates

As well as exchanging information about the media (discussed above in Offer/Answer and SDP), peers must exchange information about the network connection. This is known as an ICE candidate and details the available methods the peer is able to communicate (directly or through a TURN server). 

Typically, each peer will propose its best candidates first, making their way down the line toward their worse candidates. 

Ideally, candidates are UDP (since it’s faster, and media streams are able to recover from interruptions relatively easily), but the ICE standard does allow TCP candidates as well.

4. STUN & TURN

STUN(Session Traversal Utilities for NAT(Network Address Translator)) server: returns the IP address, port, and connectivity status of a networked device behind a NAT.

The STUN server  simply provides a meeting space for computers to exchange contact information. Once the information is exchanged, a connection is established between the peer computers and then the STUN server is left out of the rest of the conversation.

Connections established via STUN servers are the most ideal and cost-effective type of WebRTC communication. There’s hardly any running cost incurred by the users. 

Unfortunately, the connection may fail to establish for some users due to the type of NAT device each peer is using.

In such a situation, the ICE protocol requires you to provide a fallback, which is a different type of signaling server known as a TURN server.

TURN(Traversal Using Relays around NAT) server: a protocol that enables devices to receive and send data from behind a NAT or firewall.

A TURN server is an extension of the STUN server. Where it differs from its predecessor is that it handles the entire communication session.

Basically, after establishing a connection between the peers, it receives streams from one peer and relays it to the other, and vice versa. 

This type of communication is more expensive and the host has to pay for the processing and bandwidth load required to operate a TURN server.

5. Peer Connection

The RTCPeerConnection interface represents a WebRTC connection between the local computer and a remote peer. 

It provides methods to connect to a remote peer, maintain and monitor the connection, and close the connection once it’s no longer needed.

EventTarget  ← RTCPeerConnection

Integration Workflow

Step 1: Exchanging session descriptions

When starting the signaling process, an offer is created by the user initiating the call. This offer includes a session description, in SDP format, and needs to be delivered to the receiving user, which we’ll call the callee. The caller responds to the offer with an answer message, also containing an SDP description. 

Our signaling server will use WebSocket to transmit offer messages with the type “video-offer”, and answer messages with the type “video-answer”.

singalling session

Step 2: Exchanging ICE candidates

Two peers need to exchange ICE candidates to negotiate the actual connection between them. Every ICE candidate describes a method that the sending peer is able to use to communicate. Each peer sends candidates in the order they’re discovered, and keeps sending candidates until it runs out of suggestions, even if media has already started streaming.

An icecandidate event is sent to the RTCPeerConnection to complete the process of adding a local description using pc.setLocalDescription(offer).

Once the two peers agree upon a mutually-compatible candidate, that candidate’s SDP is used by each peer to construct and open a connection, through which media then begins to flow. If they later agree on a better (usually higher-performance) candidate, the stream may change formats as needed.

Though not currently supported, a candidate received after media is already flowing could theoretically also be used to downgrade to a lower-bandwidth connection if needed.

Each ICE candidate is sent to the other peer by sending a JSON message of type “new-ice-candidate” over the signaling server to the remote peer.

stun server

Making a call

The invite() function can be invoked as the calle clicks on another user whom they are trying to connect for the list of users they have.

var mediaConstraints = {
  audio: true, // We want an audio track
  video: true // ...and we want a video track
};

function invite(evt) {
  if (myPeerConnection) {
    alert("You are already in another call.");
  } else {
    var clickedUsername = evt.target.textContent;

    targetUsername = clickedUsername;
    createPeerConnection();

    navigator.mediaDevices.getUserMedia(mediaConstraints)
    .then(function(localStream) {
      document.getElementById("local_video").srcObject = localStream;
      localStream.getTracks().forEach(track => myPeerConnection.addTrack(track, localStream));
    })
    .catch(handleGetUserMediaError);
  }
}

Creating the peer connection

The createPeerConnection() function is used by both the caller and the callee to construct their RTCPeerConnection objects, their respective ends of the WebRTC connection. It’s invoked by invite() when the caller tries to start a call, and by handleVideoOfferMsg() when the callee receives an offer message from the caller.

function createPeerConnection() {
  myPeerConnection = new RTCPeerConnection({
      iceServers: [     // Information about ICE servers - Use your own!
        {
          urls: "stun:stun.stunprotocol.org"
        }
      ]
  });

  myPeerConnection.onicecandidate = handleICECandidateEvent;
  myPeerConnection.ontrack = handleTrackEvent;
  myPeerConnection.onnegotiationneeded = handleNegotiationNeededEvent;
  myPeerConnection.onremovetrack = handleRemoveTrackEvent;
  myPeerConnection.oniceconnectionstatechange = handleICEConnectionStateChangeEvent;
  myPeerConnection.onicegatheringstatechange = handleICEGatheringStateChangeEvent;
  myPeerConnection.onsignalingstatechange = handleSignalingStateChangeEvent;
}

When using the RTCPeerConnection() constructor, we will specify an object providing configuration parameters for the connection. We use only one of these in this example: iceServers. This is an array of objects describing STUN and/or TURN servers for the ICE layer to use when attempting to establish a route between the caller and the callee. These servers are used to determine the best route and protocols to use when communicating between the peers, even if they’re behind a firewall or using NAT.

Starting negotiation

Once the caller has created its RTCPeerConnection, created a media stream, and added its tracks to the connection as shown in Starting a call, the browser will deliver a negotiationneeded event to the RTCPeerConnection to indicate that it’s ready to begin negotiation with the other peer.

Here’s our code for handling the negotiation needed event:

function handleNegotiationNeededEvent() {
  myPeerConnection.createOffer().then(function(offer) {
    return myPeerConnection.setLocalDescription(offer);
  })
  .then(function() {
    sendToServer({
      name: myUsername,
      target: targetUsername,
      type: "video-offer",
      sdp: myPeerConnection.localDescription
    });
  })
  .catch(reportError);
}

To start the negotiation process, we need to create and send an SDP offer to the peer we want to connect to. This offer includes a list of supported configurations for the connection, including information about the media stream we’ve added to the connection locally (that is, the video we want to send to the other end of the call), and any ICE candidates gathered by the ICE layer already.

Session negotiation

Now that we’ve started negotiation with the other peer and have transmitted an offer, let’s look at what happens on the callee’s side of the connection for a while. The callee receives the offer and calls handleVideoOfferMsg() function to process it. Let’s see how the callee handles the “video-offer” message.

Receiving Incoming call

When the offer arrives, the callee’s handleVideoOfferMsg() function is called with the “video-offer” message that was received. This function needs to do two things. First, it needs to create its own RTCPeerConnection and add the tracks containing the audio and video from its microphone and webcam to that. Second, it needs to process the received offer, constructing and sending its answer.

function handleVideoOfferMsg(msg) {
  var localStream = null;

  targetUsername = msg.name;
  createPeerConnection();

  var desc = new RTCSessionDescription(msg.sdp);

  myPeerConnection.setRemoteDescription(desc).then(function () {
    return navigator.mediaDevices.getUserMedia(mediaConstraints);
  })
  .then(function(stream) {
    localStream = stream;
    document.getElementById("local_video").srcObject = localStream;

    localStream.getTracks().forEach(track => myPeerConnection.addTrack(track, localStream));
  })
  .then(function() {
    return myPeerConnection.createAnswer();
  })
  .then(function(answer) {
    return myPeerConnection.setLocalDescription(answer);
  })
  .then(function() {
    var msg = {
      name: myUsername,
      target: targetUsername,
      type: "video-answer",
      sdp: myPeerConnection.localDescription
    };

    sendToServer(msg);
  })
  .catch(handleGetUserMediaError);
}

This code is very similar to what we did in the invite() function back in Starting a call. It starts by creating and configuring an RTCPeerConnection using our createPeerConnection() function. Then it takes the SDP offer from the received “video-offer” message and uses it to create a new RTCSessionDescription object representing the caller’s session description.

ICE candidates exchange

Sending ICE Candidates

The ICE negotiation process involves each peer sending candidates to the other, repeatedly, until it runs out of potential ways it can support the RTCPeerConnection’s media transport needs. Since ICE doesn’t know about your signaling server, your code handles transmission of each candidate in your handler for the icecandidate event.

Your onicecandidate handler receives an event whose candidate property is the SDP describing the candidate (or is null to indicate that the ICE layer has run out of potential configurations to suggest). The contents of candidate are what you need to transmit using your signaling server.

Here’s our example’s implementation:

function handleICECandidateEvent(event) {
  if (event.candidate) {
    sendToServer({
      type: "new-ice-candidate",
      target: targetUsername,
      candidate: event.candidate
    });
  }
}

This builds an object containing the candidate, then sends it to the other peer using the sendToServer() function previously described in Sending messages to the signaling server.

Receiving ICE candidates

The signaling server delivers each ICE candidate to the destination peer using whatever method it chooses; in our example this is as JSON objects, with a type property containing the string “new-ice-candidate”. Our handleNewICECandidateMsg() function is called by our main WebSocket incoming message code to handle these messages:

function handleNewICECandidateMsg(msg) {
  var candidate = new RTCIceCandidate(msg.candidate);

  myPeerConnection.addIceCandidate(candidate)
    .catch(reportError);
}

This function constructs an RTCIceCandidate object by passing the received SDP into its constructor, then delivers the candidate to the ICE layer by passing it into myPeerConnection.addIceCandidate(). This hands the fresh ICE candidate to the local ICE layer, and finally, our role in the process of handling this candidate is complete.

Ramesh Raja
Ramesh Raja is the Head of Engineering at MirrorFly.He is a trend-spotter in programming and technology, with profound interest in Android, iOS, React, Node, Java Springboot, Docker and Kubernetes.