Voice and Video


Calls is a voice feature designed for developers and available via API (opens in a new tab) only.

You can use Calls to create any application to handle inbound and outbound calls. Your application implements the voice scenario logic and uses the Calls API to control and execute actions in calls and conferences, and does so via whichever voice channel (PSTN, webRTC, SIP) you choose.

Unlike Number Masking, Click-to-Call, Voice Message, or Interactive Voice Response, your application is not bound to any particular use case scenario: the Calls API is the building block here to help you build just about anything.

To use any of the following features with the Calls API, you need to have these activated for your account:

  • Recording
  • Automated machine detection (AMD)
  • Conferencing
  • Media streaming
  • SIP trunking

Contact your dedicated Account Manager to have these features activated.


Calls has four main concepts:

Application configuration

In this section, we guide you through the process of creating a Calls Configuration for applications that use our Calls API. This configuration is a vital part of integrating the diverse API methods and events that our platform offers.

Creating a Calls Configuration

A Calls Configuration is a required declaration for applications using the Calls API methods. It includes a unique identifier, the callsConfigurationId. This ID can be provided by the developer at the time of creation or, if not specified, will be generated by the system. Developers also have the option to assign a descriptive name to their Calls Configuration, aiding in easier management. To create a Calls Configuration, start by declaring your application (opens in a new tab) for the Calls API. If you have a specific ID in mind, specify it; otherwise, the system will assign one automatically. Adding a descriptive name is also recommended but optional.

Event Subscription Association

Each new Calls Configuration needs an associated event subscription. This subscription outlines which events from the Calls API your application will receive. The choice of events is crucial as it determines how your application interacts with and responds to the API. This section guides you through the steps to create an event subscription and link it to your Calls Configuration.

  • Create a Subscription: the first step is to create at least one event subscription:
    1. Specify the Channel Type: When setting up your event subscription, specify the channel type as VOICE_VIDEO. This defines the nature of the events your application will handle.
    2. Define the Profile: The profile section of your subscription should include details of your application's webhook. This includes the URL and security specifications of the webhook, ensuring secure and directed communication.
    3. List the Desired Events: In the events array of your subscription, list all the Calls API events you want to be sent to your webhook. These should be provided as a comma-separated list, encompassing all the events relevant to your application's functionalities. You can see the list of all possible in Calls API event (opens in a new tab).
    4. Criteria Object Configuration: The final step is to specify the callsConfigurationId in the criteria object of your subscription. This links the event subscription directly to your Calls Configuration, ensuring that the right events are routed to your application.

Note that it is possible to create multiple subscriptions with the same callsConfigurationId criteria. This setup allows for a more segmented and organized handling of events. This setup enables developers to direct specific event types to different webhooks, enhancing the application's efficiency and responsiveness. When configuring multiple subscriptions for the same callsConfigurationId, it's important to ensure that there is no overlap in the event types listed across these subscriptions. Each subscription should have a unique set of events.

For more information about subscription management, see the CPaaS X documentation.


Inbound voice scenarios, where your application must answer or route incoming calls, require your account to own at least one Infobip voice number. This number must be associated with your application so Infobip knows where to route events related to these inbound calls.

You can link an Infobip voice number to one application only, but you can link several phone numbers to the same application.

Outbound voice scenarios, where your application initiates calls to PSTN, webRTC, or SIP destinations, may require your account to own at least one Infobip phone number.

You can specify this number in your application as a caller ID for new outbound calls. Note that this number does not need to be linked to your application, meaning that the same number could be displayed as callerID for outbound calls generated by different applications.

You can search for and lease available voice numbers both via API (opens in a new tab) and in the web interface (opens in a new tab).

To link an Infobip voice number to the application use API (opens in a new tab).

Calls API

Any action that you need your application to perform on and within calls and conferences is done via REST API. API will respond synchronously with an HTTP status code and payload to confirm the reception of the requested action.

There are several methods (opens in a new tab) available to:

  • Create calls and retrieve call details
  • Create conferences and retrieve conference details
  • Perform actions in calls and conferences
  • Manage audio files and recordings


Most actions performed in calls and conferences using the Calls API will trigger an event in your application to confirm the completion of the action's execution or to raise an error. New incoming calls to your application will also result in an event being sent to your application, including all call information (TO, FROM, and so on).

In the diagram below, the client application or platform is exposing two different webhooks, one to receive only CALL_RECEIVED events and the other to receive all other event types.

Voice and Video - Event in calls

Understanding Calls API

Calls, conferences, and dialogs

Every inbound connection (TO) or outbound connection (FROM) the Infobip platform is designated as a call leg.

In the remainder of this documentation as well as in our API documentation, we shall refer to a call leg as being a call.

Any voice or video application will always handle a minimum of one call. For instance, a voice messaging application that calls a user to deliver a voice message will create an outbound call, whereas an interactive voice response application responds to an inbound call.

Applications can connect multiple calls together, regardless if these are inbound or outbound, and regardless of the endpoint type (PHONE, WEBRTC, SIP, and so on) for each of these calls.

Calls can be connected in multiple ways:

  1. Conference: You can create a conference and add or remove participants using these conference methods (opens in a new tab) explicitly. Using conferences results in a potentially large number of events related to the status of the conference as well as the status of each participant. Conferences are limited to a maximum of 15 participants, regardless of the involved endpoint types. Conferences should best be used when you expect to connect more than two call legs or participants together, or plan to add and remove participants on the fly.
  2. Connect: You can use the connect methods (opens in a new tab) to quickly join two call legs in a conference without having to explicitly manipulate conference objects. Using the connect methods results in the implicit creation of conferences but simplifies the overall implementation. As an implicit conference is created when connecting two calls, it means that you can manipulate this conference to add and remove additional participants at any point in time.
  3. Dialog: A dialog allows you to connect two calls together but does not result in the implicit creation of a conference. As such, two calls connected over a dialog cannot ever be joined by additional participants and the overall connection flow will result in far fewer events than if you use connect methods to create a conference. Using Dialogs is the recommended method for scenarios where only two call legs (or participants) will ever need to be joined.

The other main differences between a dialog and connect/conference are:

  • Early media propagation: When connecting a new PHONE call to an existing call, the destination network might provide in-band tones or announcements that inform the caller of the call progress. When using dialog to connect calls, such early media, if any, will be propagated between the participants.
  • Media bypass: When connecting two calls over PHONE/PSTN and if both calls use the same codec, the Infobip platform only handles the signaling part of calls while media (RTP) flows directly between the connected endpoints, which means that media is going over the shortest path between endpoints with minimum latency. Note that the RTP flow will be re-captured by the Infobip platform if a recording of the dialog is requested or if actions such as DTMF collection are requested. Similarly, if a dialog is being recorded and that recording is stopped, then the Infobip platform will release the RTP media and switch back to media bypass.
  • Answering the inbound call: If your scenario using dialog is about bridging an inbound call to a new outbound call when using dialog, your application does not need to explicitly answer the inbound call, like it would with connect and conferences. When using dialog, if the new outbound call is answered, the Infobip platform will automatically answer the inbound call and bridge them together.

Call states

Outbound call states

The following diagram shows various states for an outbound call and the events that represent these state transitions.

Voice and Video - Outbound call states diagram

Early dialog states such as PRE_ESTABLISHED and RINGING, including their respective events, either optional or presented in any order, depend on telco operators' implementations. Calls to WebRTC and SIP never go through the PRE_ESTABLISHED states.

CALLINGThe call creation request has been accepted and queued for processing.
PRE_ESTABLISHEDThe call is in the early media state.
RINGINGThe called destination is ringing.
ESTABLISHEDThe call is connected and established, the connection is active.

Outbound calls will always end up in one of the following final call states:

FINISHEDThe call which was previously active is completed and hung up.
BUSYThe call could not be completed as we received a busy signal from the destination.
NO ANSWERThe destination did not answer before the connectTimeOut parameter value has been reached.
CANCELLEDThe outbound call was canceled prior to being answered or prior to reaching the connectTimeOut parameter value.
FAILEDThe call could not be established to the destination.

For these final call states, the preceding CALL_FINISHED or CALL_FAILED events will include the reason for hung up or call failure.

Inbound call states

The following diagram shows various states for an inbound call and the events that represent these state transitions.

Voice and Video - Inbound call states diagram
RINGINGA new inbound call is received by the platform and presented to your application.
PRE_ESTABLISHEDYour application has requested to pre-answer the call to handle early media.
ESTABLISHEDThe call is connected and established, the connection is active.

Inbound calls will always end up in one of the following final call states:

FINISHEDThe call which was previously active is completed and hung up.
BUSYThe call could not be completed as we received a busy signal from the destination.
NO ANSWERThe destination did not answer before the connectTimeOut parameter value has been reached.
CANCELEDThe outbound call was canceled prior to being answered or prior to reaching the connectTimeOut parameter value.
FAILEDThe call could not be established to the destination.

For these final call states, the preceding CALL_FINISHED or CALL_FAILED events will include the reason for hung up or call failure.

State of participants

When dealing with multiparty calls (whether one-on-one or true conferencing), your application can subscribe to multiple events to properly follow the state of participants:

  • PARTICIPANT_JOINING, PARTICIPANT_JOINED, PARTICIPANT_JOINED_FAILED, and PARTICIPANT_REMOVED - As their names suggest, these events will let your application know about the joining state of participants in a conference.
  • PARTICIPANT_MEDIA_CHANGE will tell your application when the media session of a participant has changed, such as camera or screen share being turned on/off (webRTC endpoint), microphone being turned on/off (webRTC endpoint) or participant being explicitly (un)muted.
  • PARTICIPANT_STARTED_TALKING and PARTICIPANT_STOPPED_TALKING - These events will let your application know when any participant, identified by the conferenceId and callId, starts or stops talking.


When creating calls or conferences, you need to designate the endpoint or a list of endpoints to connect to.

The Infobip platform supports the following types of endpoints:

  1. Phone: the PHONE endpoint type is always associated with a phoneNumber in E.164 (opens in a new tab) format. Note that numbers in E.164 format do not contain any leading "+" or "00".
  2. WebRTC: the WEBRTC endpoint type requires specifying at the very least the identity*,*a unique identifier designating the end user who will be called.
  3. SIP: applications can call users connected to your office PBX (on-premise or on the cloud) using the Session Initiated Protocol (SIP). Calling a SIP endpoint requires to at least declare the username, *host,andport.*Before creating calls towards SIP endpoints, you need to define a SIP trunk between Infobip and your office PBX using the Calls API SIP Trunk methods.
  4. VIBER: inbound calls coming from Viber users will be of type VIBER and include the MSISDN of the calling Viber user. See Viber Business Calls for more information about receiving calls from Viber users.

When placing calls toward PHONE and SIP endpoints, you can specify the callerID that will be displayed to the called party. The callerID should ideally be the voice number that you are leasing from Infobip.

When placing calls towards WEBRTC endpoints, you can specify a fromDisplayName which can be set to any alphanumeric string.

Call flows

Inbound call flow

Before your application can receive events about inbound calls and answer these, you need to link it to your incoming number on the Infobip platform.

To link an Infobip DID number to an application, setup a voice action on your DID number:

  • Via API:
  • Via the web interface:
    • Go to the Numbers application (opens in a new tab) and select your number
    • Select the Voice tab
    • Create an inbound configuration where the Forward action is Forward to subscription and specify your callsConfigurationId

After receiving a new call, the application will receive a CALL_RECEIVED event including the identification of the call (callId) as well as to and from phone numbers. The caller will hear no ringing tone unless you decide to explicitly send these using the send-ringing method. Next, based on its own logic, your application can decide to reject the call, pre-answer, or answer it. If it decides to accept the call, it will use the accept method.

After receiving a CALL_ESTABLISHED event, your application has the confirmation from the Infobip platform that the call is live and your application can go on with its next steps.

In case the application receives a CALL_FAILED or CALL_FINISHED event, it will inspect the payload of that event to retrieve more details about the call status and the cause of call termination or failure.

In the diagram below, the application exposes two different webhooks to receive Calls API events.

Voice and Video - Inbound call flow diagram

Outbound call flow

When your application requests the Infobip platform to create a new outbound call, it will specify the type of endpoint to be called (PSTN, webRTC, SIP). The Infobip platform will return the identifier of this new call (callId), then send events with the status of that callId to your application's event webhook.

After receiving the CALL_ESTABLISHED event, your application has confirmation from the Infobip platform that the call is live and the application can proceed with the next steps. In case your application receives a CALL_FAILED or CALL_FINISHED event, it will inspect the payload of that event to retrieve more details about the call status and the cause of call termination or failure.

Voice and Video - Outbound call flow diagram

Connecting two calls with connect/conference

To connect multiple calls together so that end users can talk to each other, use the connect method or conference-related methods. The connect methods allow you to connect two existing calls together or connect an existing call to a new call. These methods implicitly use conferencing capabilities but remove the need for a developer to explicitly manipulate a conference object.

The call flow below depicts an application that starts by creating two calls, where each call will have its unique callId identifier. After receiving both events confirming the calls are live ( CALL_ESTABLISHED events), the application connects the calls using the connect method, specifying the unique callId of both calls.

After receiving that request, the Infobip platform will:

  • Create a conference room and confirm it with a CONFERENCE_CREATED event
  • Add each of the specified calls as participants, confirmed by PARTICIPANT_JOINING and PARTICIPANT_JOINED events

Events always include the reference to the callId and conferenceId they are related to.

A developer can choose to have their application listen to all CONFERENCE_CREATED, PARTICIPANT_JOINING, and PARTICIPANT_JOINED events, or only to wait for the PARTICIPANT_JOINED event to confirm call bridging.

Voice and Video - Connecting two calls diagram

Connecting two calls with dialog

If you plan to connect only two calls together and not have the possibility to manipulate participant states (add/remove participants), you might prefer to use our dialog method, as explained earlier.

The call flow below shows an application that starts by creating one outbound call to End user A. After receiving the event confirming the call to End user A is live ( CALL_ESTABLISHED event), the application connects End user A to End user B using the dialog method, specifying the callId of End user A's call and the endpoint data to connect to End user B. As we can see in this example, the scenario results in a simplified event flow between the Infobip platform and the customer's application.

Voice - Connecting two calls with dialog

Conference flow

With Calls API conferencing, your application can add a maximum of 15 participants in the same conference room. Conferences support multiple endpoints simultaneously, meaning that participants from the same conference can join via phone (PSTN), webRTC (with or without video), and SIP.

There are multiple ways to add participants to a conference:

  • Existing (live) calls can be moved into a conference
  • New outbound calls can be started and immediately moved to an existing conference using a single API method

In the conference flow below, we are adding existing calls to a conference and assume that the calls to end user A and end user B are already live. The conference first needs to be created, and it is confirmed by a CONFERENCE_CREATED event including a unique conferenceId. Both conferenceId and unique callId from end user A and end user B's calls are needed to bring in the participants.

Voice and Video - Conference call flow diagram

When the last participant leaves the conference, this automatically ends the conference ( CONFERENCE_FINISHED event). A closed conference cannot be reopened. You can create a new conference with the same name, however, it will contain a new unique conferenceId.

Transfer calls between applications

As we have explained previously, a call always belongs to an application so the Calls API platform knows to which webhook it needs to send events related to that call's status or to actions executed on that call. Calls API makes it possible to change the application ownership of a call using the application transfer methods.

For example, let's assume you have one application that implements an IVR scenario, and another application that represents your home-grown call center platform. An inbound call to your IVR application will be owned by it, but following the end-user's choices in the IVR scenario that call has to be transferred from your IVR to your call center. In this scenario, your IVR application will request an application transfer of that call toward your call center application.

Your call center application will receive this transfer request as an incoming event (APPLICATION_TRANSFER_REQUESTED) and either accept or reject that transfer using the corresponding API method. The requesting application (IVR) will receive events confirming that final status (APPLICATION_TRANSFER_FAILED or APPLICATION_TRANSFER_FINISHED).


Your application can use the say method to perform text-to-speech actions in any call managed by that application. Infobip supports more than 100 languages and accents.

Refer to this text-to-speech table when defining your say request payload, where:

  • language-code is the two-letter abbreviation of your chosen language
  • voiceGender is MALE or FEMALE
  • voiceName is the name of the voice
      "text": "text that should be spoken",
      "language": "en",
      "speechRate": 1.0,
      "loopCount": 1,
      "preferences" : {
        "voiceGender": "FEMALE",
        "voiceName": "Joanna"
      "stopOn": {
        "type": "DTMF",
        "terminator": "#"

The Infobip platform will send a SAY_FINISHED event to your application's event webhook:

  • When the complete text has been transformed to the chosen voice and played in the call, or
  • When the payload of the say method includes the stopOn clause and the end user presses a key (DTMF) on their phone while the speech synthesis is playing. In this case, the SAY_FINISHED event will include the DTMF input in its payload.

Note that capturing DTMF during the say method is limited to one DTMF input only. If the terminator is set to " any", any DTMF that the end user presses on their phone will be shown in the SAY_FINISHED event. If the terminator is set to # and the end user presses 1# on their phone, only the # will be shown in the SAY_FINISHED event

If you need your application to capture a longer DTMF input, use the captureDTMF method.


Speech to text technology is available under the Calls API platform with two different approaches:

  • Capture speech: intended for short duration interactions, such as when building a voice-based IVR or chatbot
  • Transcription: intended for long duration interactions or typically for transcribing complete calls

Whether you choose capture or transcription, these operations can only be executed on single call legs. In case you wish to get the transcription of a conference call or dialog with multiple call legs, transcription would need to be started separately on every call leg participating to this conference or dialog.

Speech capture

The speech capture action on call legs is made to transform spoken words to text in real-time and aims at short interaction types, typically a few seconds in length such as user interactions in IVR or voice bot scenarios. It is required to always specify the language in which words are being spoken. See this table as a reference of all currently supported languages.

With the combination of timeout and maxSilence parameters, you can state how long a speech capture action will be waiting to capture user input as well as the total amount of silence (in seconds) to consider that the interaction should be closed.

The outcome of speech capture is reported in the SPEECH_CAPTURED event and include:

  • The full text of the captured speech
  • The key-phrase that was matched during speech capture, if any such key phrase was defined
  • The confidence level of the speech recognition results
  • The reason why speech capture was terminated.

Speech transcription

Call transcriptions are not limited in time, except by the call duration itself (that is, when a call ends or when call is moved to Conference). Transcription is started and stopped via API methods. Note that your application must have a subscription that includes the event of type TRANSCRIPTION.

When starting the transcription of a call, you have the choice to receive both INTERIM and COMPLETE transcripts, or only the COMPLETE ones.



INTERIMThese transcriptions are produced swiftly using a blend of syllables, individual words, and short phrases to interpret spoken language. They are presented in real-time, appearing as the words are spoken, providing immediate but less precise results compared to the COMPLETE transcription.
COMPLETERefers to the more accurate and complete output generated by the speech recognition engine after it has processed the entire phrase or sentence. Unlike interim results, final results are produced after considering the full context of the spoken content, thereby offering higher accuracy. This makes them suitable for applications where precision is paramount, though they are less useful for real-time feedback due to the processing delay.

Play audio files

Your application can play audio files at any point during individual calls or conferences. When a file is played during a conference, all participants will hear it.

Audio files can be retrieved from a URL at playtime or can be uploaded first to Infobip servers. To play an audio file from Infobip servers, you must first upload that file (.wav or .mp3) using the POST /calls/1/file/upload method. The upload action will return a fileId that you will need to specify in your play action.

Mind that when playing an audio file from a URL, the first playback of that file might start with a slight delay due to the time it may take to download your file from Infobip servers. Subsequent playbacks will not have this delay as the file will have already been cached.

While you can define a loopCount (number of times the file will be played) for playing files both in calls and conference, playing files in calls offers additional controls such as:

  • timeout: The duration, in milliseconds, of the file to be played. If the timeout is not defined, the file will be played until its end.
  • offset: The starting point, in milliseconds, from which the file will be played. If the offset is not defined, the file will be played from the beginning.

Both timeout and offset apply to the first time an audio file is played. If you specify any value for these two parameters while specifying a loopCount higher than 1, subsequent loops of your file will play from the beginning until the end of that file.

A PLAY_FINISHED event is always generated:

  • When the audio file has finished playing entirely (including loopCount, offset and timeout effects).
  • When your application explicitly requests for the audio file playback to stop playing.

The playback of audio files in individual calls can be interrupted at any time by the end user when they press any DTMF key, from the moment you set the optional stopOn parameter in the POST /calls/1/call/:id/play API call.

In this case, the PLAY_FINISHED event will include, in its property attributes, the indication that the file was not played in full (playedCompletely:false) as well as the DTMF sent by the end user (capturedDtmf:1).

Note that capturing DTMF during the play method is limited to 1 DTMF input only. If the terminator is set to " any", any DTMF that the end user presses on their phone will be shown in the SAY_FINISHED event.

If the terminator is set to # and the end user presses 1# on their phone, only the # will be shown in the SAY_FINISHED event. If you need your application to capture a longer DTMF input, use the captureDTMF method.

    {"conferenceId": null,"callId": "945261b4-0bae-4ff3-9b1d-10485d2dbee8","timestamp": "2022-04-15T15:34:23.884Z","applicationId": "62273b76cc295c00d67f99c3","properties": {	"duration": 14336,	"playedCompletely": false,	"capturedDtmf": "12#"},"type": "PLAY_FINISHED"

Audio file playback is stopped when a call is moved into a conference.

Capture and send DTMF

To interact with users or remote systems via DTMF (Dual-Tone Multi-Frequency), use the related capture (opens in a new tab) and send (opens in a new tab) methods.

There are multiple ways to collect DTMF input from a user:

  1. Explicitly, during playback of text-to-speech or audio files. Read about the usage of the stopOn parameter in the above sections about using text to speech or playing audio files. In this case, the maximum length of the DTMF collection is 1 digit. In this scenario, the collected DTMF will be returned in the corresponding PLAY_FINISHED or SAY_FINISHED events.
  2. Explicitly, using the capture DTMF (opens in a new tab) method. You can collect DMTF input of any size, and optionally define a terminating character. If you only define the maxLength parameter, the platform will wait until the user has entered an input of that size or reaches the timeout. When setting the terminator parameter, the platform might return a user input shorter than the defined maxLength, if that terminator character was entered by the end user. In this scenario, the collected DTMF input will be returned in a DTMF_COLLECTED event.
  3. Unsolicited: the end user is entering DTMF inputs while you have no pending capture DTMF (opens in a new tab) nor ongoing Say or Play action with stopOn defined. In this scenario, the platform will send a DTMF_COLLECTED event for each individual DTMF sent by the user.

End users can send DTMF inputs which include the following characters only: 0-9, w, W.

Automated machine detection

When creating a new outbound call to a PSTN destination, request that your application performs automated machine detection on this call by setting the machineDetection to true in the call request method. When the call is answered, Infobip sends a MACHINE_DETECTION_FINISHED event to your application's eventUrl webhook to show whether the call was answered by a machine or a human, or a MACHINE_DETECTION_FAILED event including failure cause.

When enabling automated machine detection, set the messageDetectionTimeout parameter so that the system detects the end of the message announcement. This results in the sending of additional events: MACHINE_MESSAGE_DETECTION_FINISHED or MACHINE_MESSAGE_DETECTION_FAILED in case of failure.

Unlike our other voice APIs, such as voice messages or click-to-call, Infobip takes no specific action if a call is detected as being answered by a machine. After receiving the MACHINE_DETECTION_FINISHED and/or MACHINE_MESSAGE_DETECTION_FINISHED events, your application logic needs to determine how to proceed further with that call.


Recordings are available for calls, conferences, and dialogs, and are mutually exclusive.

Recording calls

You can record calls:

  • When a new call is created. Set the optional recording options in the call creation API method.
  • When a call is answered. Set the optional recording options in the call answer API method.
  • At any point during the call. Use the call recording API.

The recording of a call will end in one of these ways:

  1. When the call ends
  2. When the call joins a conference call
  3. When you use the stop recording API method, at any point after a recording has started

Always remember that conferences are used as soon as two calls are connected together.

Record conferences and dialogs

You can record conferences and dialogs using one of the following steps:

When you start a new recording, besides choosing whether only audio or both audio and video must be recorded, you can also choose whether the recording of all participants must be composed:

  • If you choose composition, all participants will be merged into a single audio or video file.
  • If you do not choose composition, all participants will have their own individual audio or video file.

The recording will end:

  • When the conference or dialog is terminated (hangup).
  • When you use the stop recording API method for conferences or dialogs.

On-demand recording composition

When you record conferences or dialogs without explicitly asking to compose the recording (i.e., record only one single file where all participant tracks are mixed), your recording will result in multiple audio or video files (one per participant for each Start/Stop recording action while the conference or dialog is active).

You can compose these individual files (opens in a new tab) at any point in time after they are recorded, as long as they are still available on Infobip storage. Infobip cannot compose individual files after they have been transferred to your SFTP server.

View and download recordings

To find and download a specific audio or video file via API:

  1. Retrieve the fileId using any of the GET recording methods (i.e., relative to calls, conferences or dialogs). You can search for recordings by callId, conferenceId, dialogId or retrieve all known recordings for all calls, conferences and dialogs.
  2. Use the GET /calls/1/recording/file/:file-id method to download a bytestream representation of your file. Audio files are always rendered in a . wav format, and video files in a . mp4 format.

To find and download a your recordings from the Infobip Portal:

  1. Go to the recording tab under the Voice channel application (opens in a new tab)
  2. Select Calls, Conferences or Dialogs to see the list of your recordings.
  3. When expanding a particular recording entry, you will find the list of related files, whether composed or non-composed. Files that are stored on our cloud storage can be downloaded as well as their related metadata json file

Setting custom metadata on your recordings

When starting to record a call, conference or dialog you have the optional ability to set a custom data json object in which you can define any key-value pair(s) that may help you save relevant contextual data for that recording based on your use case. As recording can be started and stopped multiple times during the existence of a call, conference or dialog, and each recording action can have its own defined custom metadata and thereby this custom data will be reflected at a file level when retrieving your recordings. This custom data can't be used as a query element when retrieving the list of call, conference or dialog recordings.

Recording filename convention

The filename for recordings, whether Calls, Conference or Dialogs and whether composed or non composed is always fileId.ext, where ext can be wav or mp4 depending whether you are recording only audio or video.

Transfer recordings via SFTP

If you prefer to have your recordings be pushed to your server via SFTP, you can do so by defining your SFTP configuration from the (opens in a new tab)from the Infobip UI (opens in a new tab). Files that are successfully transferred to your SFTP server are deleted from Infobip storage but will remain referenced when retrieving the list of all your Calls API recordings.

By default, the naming convention for Calls API recordings that will be pushed to your SFTP server is: fileId.zip. The zip file includes both the media file (wav or mp4) and a corresponding json file with the metadata related to that recording. The files in the zip archive are named by the fileId.

When using Calls API recording methods, it is possible to influence the name of the resulting zip file that will be pushed to your server by specifying the optional filePreFix parameter on the relevant start recording method. If you specify this parameter but do not use SFTP, the parameter will have no effect. If you do have an active SFTP configuration and would set filePrefix to be "myCustomName", the zip file name will always be myCustomName.zip. You should ensure to use unique prefixes when using this feature to avoid zip archives being overwritten when pushed to your server.

Media streaming

Calls API allows you to stream outbound call media from your application to an arbitrary host using a websocket. Currently, only audio streaming is supported.
Typically, you would use this feature when:

  • Streaming audio to a transcription or sentiment analysis service (streaming without media replacement).
  • Streaming audio to an external audio filtering service, and reinject the transformed audio into the conference to which this call would belong (streaming with media replacement).

Audio streaming is configured on a per-call leg basis. Before you initiate the stream, you need to create at least one new media stream configuration. Then use the configuration ID within a call to start/stop streaming media. Media is streamed as a series of raw bytes.

Streaming without media replacement

This is what streaming without media replacement would look like. Let's consider 2 participants are speaking with each other over a conference bridge, and the audio of participant A must be routed to an external transcription service while participant B must still hear participant A as is. Streaming without media replacement simply streams (forks) the outbound media to another listener.

Calls API - Streaming media without media replacement

Streaming with media replacement

Now, let's use the same example as above, where the external host's role is to offer services such as audio filtering (voice changer, profanity filter, and so on). In this case, the modified audio is injected into the conference and this is the audio that all participants of this conference will hear.

Calls API - Streaming media with media replacement

Set up audio streaming

First, create a media stream configuration object. Within this object you need to specify the URL of the websocket host, as well as the authorization required to access it (if any):

      "url": "wss://example-host.com:3002",
      "securityConfig": {
        "type": "BASIC",
        "username": "example-username",
        "password": "example-password"
        "id": "63493678a2863268520c0038",
       	"url": "wss://example-host.com:3002",
      	"securityConfig": {
        	"type": "BASIC",
        	"username": "example-username",
        	"password": "example-password"

Both ws and wss are supported. The response contains the ID of the newly created MediaStreamConfig object.

To start streaming media during a call, create a start-media-stream request. Within the request, specify the ID of the previously created configuration and specify whether the host will replace the media or not:

    {"mediaStream": {	"audioProperties": {		"mediaStreamConfigId": "63493678a2863268520c0038",		"replaceMedia": true	}}

If everything was successful, the first message that your host will receive is:

    {"callId":	"callIdPlaceHolder","sampleRate":	48000,"packetizationTime":	20,"customData":	{	"message":	"customDataPlaceHolder"}

This message contains the following fields:

  • callId - The corresponding callId. Useful when your host might be dealing with multiple calls.
  • sampleRate - The sampling rate of audio being streamed. Expressed in units of kilo-hertz (kHz). Default is 48kHz.
  • packetizationTime - The time elapsed between two successive packages being sent to your host. Expressed in units of milliseconds (ms). Default is 20ms.
  • customData - in development (not supported completely at the moment)

Parsing incoming audio streams

After sending the initial message upon establishing a connection, the Infobip platform will now continue to send audio packets to your host. Packets are sent every packetizationTime seconds (the value you entered in this field). The packets contain only pure audio.
For 48kHz sampled audio, 20ms of audio contains:

  • number_of_samples = 48kHz x 20ms = 960 samples

Audio is streamed raw, meaning each audio sample is encoded as a 16-bit signed integer, which is represented as a 2-byte value. This means every incoming message should ideally contain 1920 bytes (960x2). However, if there are any network issues, it can happen that more than one packet is sent within a message. These packets are guaranteed to be multiples of 1920 bytes (3840, 5760, 7680, and so on).

Media replacement

If your media stream request is configured to replace media, the Infobip platform expects you send back packets of 1920 bytes. Note that even when network errors occur and multiple packets are sent as a single cluster, your host should always send back packets of 1920 bytes. These packets are injected into the call and distributed to other participants. Therefore, when media replacement is active, you only need to send back a single stream of media, the Infobip platform will deliver it to other participants in the call.

If media replacement is not set, the Infobip platform ignores any incoming messages from the host.

Testing with our media streaming showcase

Infobip has created a simple media streaming server project you can download from Github (opens in a new tab) and run on your own computer, to help you test media streaming easily.

This project supports media streaming with media replacement only. The server implements a bandpass filter from 300-3400Hz (Butterworth filter). The filter will significantly attenuate frequencies outside the bandpass so you can expect to hear lower quality audio than what you normally get from Infobip and a nostalgic reminder of old telephony calls.

This showcase is available for both Java and Python and requires that you have:

  • An Infobip account with voice features enabled, including media streaming.
  • A defined Calls API application that already processes voice calls - a Calls API showcase (opens in a new tab) is also available on Github to help you get started.
  • A computer with Java or Python and tht can expose websocket port to the public internet (usage of ngrok (opens in a new tab) is advised).
  • A media streaming configuration created, pointing to the address exposed by ngrok on your computer.

Bulk calls

The bulks API methods allow you to create multiple calls with a single request and manage the scheduled bulks. Calls generated with the bulk methods support the same options as singular calls, such as automated machine detection, recording, and support of multiple endpoint types (phone, webRTC, SIP, or Viber).

Bulk calls support additional parameters such as:

  • *Scheduling:*When to start the call generation, and what are the calling time windows for these calls.
  • Validity period: For how long should the Infobip platform try to generate these calls. Use this parameter when defining calling time windows.
  • Call rate: The number of calls you should start during the specified time unit (such as 15 calls per minute or 60 calls per hour).

You can bundle multiple bulks, each targeting multiple destinations with their own schedule and validity, in a single request.

You can pause, resume, cancel, or reschedule bulks. Each new call within a bulk will result in the same stream of event status updates as individually created calls (call_started, call_pre_established, call_ringing, and so on), giving your application full visibility and control over how each individual call needs to be handled.

Using WebRTC with Calls

Although you can use Infobip webRTC SDKs independently from the Calls API, combining webRTC with Calls API comes with the following advantages:

  • Comprehensive and granular control of webRTC calls from your backend application
  • Easily join and remove webRTC calls to and from conferences
  • Mix webRTC calls into the same conferences with any other supported endpoints (phone, SIP)
  • Implement routing logic for inbound webRTC calls in your backend applications
  • Play text-to-speech, collect and send DTMF, or play audio files in webRTC calls

To use webRTC in Calls, you need to:


WebRTC's Dynamic Destination Resolving is becoming a deprecated feature, superseded by the Calls platform. When using webRTC together with the Calls API platform and if you need to take routing decisions in case of inbound webRTC call.

We strongly recommend that you do not implement the webRTC Dynamic Destination Resolving feature but instead leverage the Calls API platform and associated event traffic.


Check errors

Our HTTP endpoints return standard HTTP status codes.

The CALL_FINISHED and CALL_FAILED events, sent to your eventUrl webhook, will include the hangup cause or failure reason in the errorCode object within the event's properties element.

The ERROR event, sent to your eventUrl webhook, includes error details specific to the failed action it relates to in the errorCode object within the event's properties element.

    {"conferenceId": null,"callId": "4828889f-b53e-48ae-821d-e72c9279db97","timestamp": "2022-04-25T14:24:48.366Z","applicationId": "62273b76f7295c00d67f84c3","properties": {	"errorCode": "FILE_CAN_NOT_BE_PLAYED",	"errorDetails": "Playing file failed"},"type": "ERROR"

Check live calls and conferences

You can query the list of active (live) calls or information about a particular live call via API. Use the GET /calls/1/calls or GET /calls/1/calls/:id methods, respectively.

Similarly, you can query the list of active (live) conferences via API or information about a particular live conference via API. Use the GET /calls/1/conferences or GET /calls/1/conferences/:id methods.

Check historical logs and reports

From the web interface

You can retrieve the historical list of your calls by checking logs (opens in a new tab) and reports (opens in a new tab) in the Infobip web interface, under the Analyze module.


You can query the list of historical calls or information about a particular past call via API. Use the GET /calls/1/calls/history or GET /calls/1/calls/:id/history methods, respectively.

Data will be available for a rolling period of two months. You can filter GET requests if you use query parameters, such as querying all past INBOUND calls with a state equal to FAILED in between two specific dates.

Need assistance

Explore Infobip tutorials

Encountering issues

Contact our support

What's new? Check out

Release notes

Unsure about a term? See


Research panel

Help shape the future of our products
Service Terms & ConditionsPrivacy policyTerms of use