What are the chances of success of audio social media?

As audio social media interfaces become more popular, we should be able to assess what are their strong and weak attributes, and what can we do to make them successful.

A question was raised in one of the design groups: When using audiobooks, how do you locate a specific paragraph or page that you want to come back to?

When we talk about a real book or even a digital book, this question is hardly raised, we can use our visual memory to quickly scan and locate the text or image we are looking for.

The rise of audio apps is already here: audio assistance, audiobooks, audio articles, podcasts. More recently social media have introduced the use of audio as well.

Audio as a technology of communication is nothing new. We already have telephones, radios, music players, and audiobooks. Audio as a medium is well established. Audiobooks, for example, were first recorded in the 1930s on vinyl records.

Photo by Nathana Rebouças on Unsplash

Audio social media is relatively new, App like Clubhouse and Stereo are different from previous audio technology in two main ways. Firstly, unlike audiobooks and podcasts, which are pre-recorded, conversations are happening live, in real-time. Secondly and even more importantly, social media is interactive. We don’t just listen passively, but actually have the option to talk to other people and contribute to the conversation.

We know that social media is a huge commercial and social success. It has completely changed the way we communicate and consume information. But what about audio social media? What are the chances of audio social media becoming a viable commercial product?

Let’s examine the differences between visual conversation and audio conversation, we will do that by exploring different aspects of visual vs. audio media, and then checking to see which one is the winner in each aspect evaluated.

Psychological Aspects

Photo by Priscilla Du Preez on Unsplash

Human Connectivity

One of the main disadvantages of text communication is the lack of emotional nuances and simultaneous feedback. According to research, text-only communication is inherently dehumanizing. When communicating via text, it is quite easy to forget that there is a real person on the receiving end. Miscommunications are likely to develop simply due to misreadings or a wrong choice of words.
Audio media, on the other hand, can easily lead to a vast improvement in communication. Voice allows a wide range of emotional expressions: happiness, excitement, sarcasm, empathy, sadness, anger, laughter, disgust, concern, seriousness. Being aware of the emotional state of the person on the other side potentially reduces hostility and creates empathy. In addition, it is proven that talking to others makes people feel better than they expect it will, while social media is proven to make people feel worse.
The winner in this category is: Audio

Reduce Cognitive Load

Reading requires focus and concentration. It requires both seeing and reading the text for it to be processed in our brains. Listening requires focus as well, but less mental effort than does reading. Speaking is also an easier and freer form of expression, and requires less cognitive effort than writing to formulate and then to express our thoughts. When we write, we choose our words more carefully and have the additional task for presenting those words in typed form.
The winner in this category is: Audio

Remember things later

There is a Chinese proverb: “I hear, and I forget; I see, and I remember.” Visuals have a better chance to be imprinted on our brains and, as a result, are more easily retrieved later.
The situation is different for audio. Researchers have found that when it comes to memory, we don’t remember things we hear nearly as well as things we see or touch. There are audio apps that have a solution for marking a point in time and coming back to it later.
The winner in this category is: Visual

Keep your Anonymity

In a textual-based interface, one can stay completely anonymous, without exposing any intimate element of one’s personality. Our voices reveal many aspects of who we are such: Our accent, age, country of origin, gender. When speaking on an audio app those details are all exposed. This of course can have good or bad implications, depending on a wide variety of factors and preferences.
The winner in this category is: Tie


Photo by Sangga Rima Roman Selia on Unsplash

The hierarchy of information

Visual design contains a very wide and sophisticated range of information hierarchies: typography, titles, subtitles, paragraphs, illustrations, images, graphics, and more. Audio has a very primitive hierarchy, it is either voice, sound effects, or music.
The winner in this category is: Visual

Search and Scanning

Text is easy to scan and search. The fact that text is visually defined, allows us to find the exact piece of information we were looking for. If we lose concentration, we can scan the titles quickly and jump forward or backward to find the last point we were reading. In audio, which is entirely linear, we need to listen to everything in order to find a specific segment, or remember the exact timecode. A good solution would be an audio search or a visual indication of certain parts or chapters.
The winner in this category is: Visual

Saving the data for later

Visuals are naturally built to be saved and documented, unless they are purposely deleted. Audio can be recorded, but as mentioned above, the recording itself is not very easy to sort and search later. There are services like Otter.ai that provide transcripts of meetings, and they may become more widely used in the future.
The winner in this category is: Visual

Users’ Involvement

Photo by Austin Distel on Unsplash

Conversation Time Frame

Visual interfaces can be asynchronous. We can join the conversation at any point, leave and come back later, without having lost the text which exists independently of our viewing as it is being generated.
I recently received a notification of someone replying to my comment from four years ago, I replied back and we continued the conversation as though four years haven’t passed. Audio conversation, on the contrary, must happen in real-time, meaning that listening and replying necessarily must be done at the same time.
Synchronicity is an advantage when everyone needs to get the information together at the same time, as well as to respond right away or make an immediate decision. It can also create a sense of urgency and exclusivity, but eventually, when it comes to social media, it restricts the discussion to a limited timeframe.
The winner in this category is: Visual


The Hook Model is a method aimed at engaging users in such a way as to cause consumers to buy products or use services habitually. This is done by implementing four steps known as trigger, action, variable reward, and investment.
Visual interfaces are highly engaging, Users can compose posts, upload images and videos, comment and react. All these “investments” create a sense of belonging and ownership, which creates an emotional connection to the app, and encourages user retention.
Audio interfaces are innately limited in the types of audio investments that audio app users can create for themselves, and therefore provide a weaker hook for users engagement and retention.
The winner in this category is: Visual

Simultaneous participation

In a chat, many people can write and respond at the same time, even if their comments are presented in chronological order. In a voice chat, only one person can speak at a time if everyone to listen and understand.
The winner in this category is: Visual

Technical Aspects

Photo by Francesco Ungaro on Unsplash

Use the app in noisy surroundings

Reading and writing can be done anywhere (as can listening provided one is equipped with decent headphones), even in noisy settings since these can be done without disturbing others or being disturbed. But, speaking requires that the speaker’s immediate surroundings are sufficiently quiet in order for his listener to hear him above the noise. For this reason, for example, it is often difficult to speak in a public place, at home with screaming kids in the background, or under such environmental disturbances as wind and rain.
The winner in this category is: Visual / Listening

Multitasking while using the app

It is possible to listen or talk while performing routine tasks such as driving, sports, and household chores. Those habitual actions require only procedural memory, meaning that we can focus our cognitive attention on other things. But, it is not possible to read or write of any degree of complexity messages while performing routine tasks. A possible yet partial solution would be Text to speech apps.
The winner in this category is: Audio

Using the app while our presence is required

We can not listen or talk on online platforms while our presence and/or participation is required in our present setting, such as when attending a meeting or classroom. We can, however, quietly read and respond to text messages.
The winner in this category is: Visual

Creating content

Creating texts of all kinds as well producing relatively simple images, videos is very easily done with most media. However, recording audio requires more investment in professional equipment. This is still so, even though more and more tools for making high-level audios are becoming accessible.
The winner in this category is: Tie

Can one medium contain the other one?

Visual interfaces can include audio player controllers, but audio interfaces do not include visual UI. One minor exception to this is that we in fact ask voice assistants to, for example, “show the file on the screen”.
The winner in this category is: Visual


Now let’s conclude the benefits of each platform — Visual Vs. Audio:

Audio format’s greatest strength is in creating strong human connections and enabling a wide range of emotional expression. Listening and speaking require less cognitive effort and, if anonymity is important for you, the use of voice as the means of communication enables privacy. It is also the best format for multitasking and can be participated in while performing routine tasks. The very fact that the time frame is limited creates a sense of urgency and exclusivity.

However, as a social media platform, audio faces various challenges.

Just like sound waves, audio communications float invisibly in the air and then disappear without leaving a trace. One of the basic interactive features of social media is the ability to upload different types of content and to share and interact with them.
Visual platforms provide many ways to create a hierarchy of complex information such as author name, date, picture, group name, titles, text, paragraph, images, videos. Participants may read all the text, or just parts of it, search the writer’s name or other metadata, skip to the comments or locate just a single paragraph from memory.
Audio itself does not provide all of that. This is so since it contains only a single linear channel of chronological information and therefore there is no way to scan the segment quickly to decide if it’s worth listening to.
Also, if you don’t remember the exact timecode it is difficult to locate a specific paragraph. Ultimately, visual memory is much stronger than audio memory, which makes written text much more memorable. All that amounts to making it harder to create a strong connection with users, due to the fact that the audio engagement features are pretty limited.

A visual interface seems like a much natural candidate for a social media app. It presents a much richer and sophisticated information hierarchy, plenty of options for searching and scanning content, and also easier to remember and retrieve the information later.
This is especially so when all of the content is saved and documented. Since it is asynchronous, it does not require everyone to be present at the conversation at the same time, and participants can attend to the conversation at different times.
Also, everyone can write their comments all at the same time, since they don’t need to wait for other participants to finish their turn in the communication. Most of all, visual platforms can make great use of the “hook model” to boost engagement and customer retention. This is commonly done by encouraging users to act and make investments such as uploading content and getting variable rewards in return, like reactions, comments, and new content.

Audio apps have the potential to be very useful and effective, as long as all their weaknesses are taken under consideration, while their strong attributes are used correctly to serve the app’s usability and business goals.

Lead product designer @ Verizon XR labs