How Thomas Mol used Whisper, Diarization, and GPT4 to build the ultimate interview SaaS

An interview with the creator of Audiogest, an accurate interview transcription tool

How Thomas Mol used Whisper, Diarization, and GPT4 to build the ultimate interview SaaS
Learning about Whisper, Diarization, and GPT4 and the art of transcription with Thomas Mol

Welcome to the AIModels.fyi interview series, where we bring you conversations with innovators in the field of AI. Today, we have the pleasure of speaking with Thomas Mol, the creator of Audiogest. Thomas shares his journey of developing a powerful transcription and summarization tool that has gained popularity among students and professionals alike.

In this interview, we'll explore the inspiration behind Audiogest, its unique features, and the technology powering it. Let’s begin!

Thomas, what inspired you to create Audiogest and what is it all about?

I am a student and a few months back I was conducting interviews for my thesis project. Of course, I had to provide transcripts as reference material in my thesis report. This was right around the time ChatGPT became popular so I stumbled on another AI model by OpenAI called Whisper. It worked great for me to generate transcripts of my interviews but there was no API yet back then. So I built a quick webapp prototype and published it, so other students could use it as well. After some tweaking and adding diarization I actually gained a few paying users, something I did not expect at all!

How does Audiogest work? Can you explain the process of uploading an audio file and obtaining a transcription and summary?

I tried making it as easy as possible. You can simply take any audio or video file and drag and drop it on the website. Then, type a context prompt (this is great so that the AI recognizes things like acronyms and names) and fill in the number of speakers. Then just click “Process” and the transcript should be ready in about 5 to 10 minutes. When the transcribing is finished, you simply hit “Generate Summary” to create a summary!

What technologies and AI models are utilized in Audiogest? Can you tell us about the tech stack behind it?

I built Audiogest using SvelteKit, a fairly new javascript full-stack framework, which has been great to work with. Other than that I use Prisma, Authjs for auth, TailwindCSS for styling, a Postgres database, all hosted on Railway. For the AI stack, I am currently hosting my own custom transcribing pipeline on Replicate. This has been great for me as a startup, as it allows payment per second that the model runs. Currently, the pipeline uses a variation of whisper, called faster-whisper and I chain it with a diarization model from pyannote. For the summary generation I simply send the transcript to the GPT-4 API with a special prompt.

What sets Audiogest apart from other transcription and summarization tools in terms of uniqueness?

Audiogest is one of the most accurate, if not the most accurate, automatic transcription with diarization services out there. Users have been reaching out to me saying they were so happy with the accuracy, especially for languages other than English. Also, the simplicity of the app and ease of use is what make it a breeze to use.

How does Audiogest ensure the privacy and security of user data and processed information?

Privacy is important to many of my users and I take that seriously. Files that are uploaded are stored temporarily (24hrs) in a storage bucket and are not used to train or fine-tune AI models. Also, OpenAI recently changed its API policy, stating that it will not use data sent through its API for training or fine-tuning, and data is kept at most for 30 days for abuse monitoring.

Can you explain the pricing structure and affordability of Audiogest? What options are available?

Instead of commonly used subscriptions, I opted for a pre-paid pricing structure. Users can purchase credits for generating transcripts, where 1 credit = 1 minute of audio transcription. There are no additional costs for generating summaries. I am optimizing the exact prices as it is sometimes difficult to estimate the costs of running the AI models. Today, users can purchase 150 credits for 19$, 480 credits for 49$, and 1200 credits for 99$.

How accurate are the transcriptions and summaries generated by Audiogest, and what steps have been taken to ensure high accuracy?

It’s difficult to measure exact accuracy because it varies between different languages and accents, but I’d say it is about 90+% accurate. Summaries are still in the early stages, but I am planning to optimize this to increase usefulness. To ensure the high accuracy of the transcripts I follow the developments in AI research closely and try to use the latest models when I can. Anything that improves accuracy or speed I immediately implement. I hope to be able to fine-tune and optimize the models in the future myself so I can deliver state-of-art AI transcribing.

How quickly can users expect to receive their transcripts and summaries when using Audiogest?

Turnaround time for a transcript can be as fast as 5 minutes. There is a start-up time of a few minutes, and then it takes roughly ÂĽ the time of the length of your audio. Although I have seen 90-minute audios be transcribed in just 15 minutes total!

What languages, platforms, and file formats are supported by Audiogest?

Audiogest supports 99+ languages and many recording file formats. This includes .aac, .ogg, .opus, .m4a, .mp3, .mp4, .mpeg, .mpga, .wav, .webm. These file types are most commonly used for recordings or voice memos of Zoom, Microsoft Teams, Google Meet, iPhone voice memos, WhatsApp, Signal, and much more.

Besides Audiogest, what other projects are you currently working on as an app builder?

When I had a few users on Audiogest I wanted to get some insights and was looking for good analytics tools. However, I did not find one (yet) where I could just provide my database connection URL and get instant useful statistics and visualizations. So I have been prototyping a tool that does that, but it is still very much in the early stages.

Where do you see the future of audio transcription and summarization heading? Are there any upcoming features or improvements planned for Audiogest?

I think there is still some room for improvement in speed and accuracy of the transcripts. Diarization is also not perfect yet and needs some fine-tuning. What I am more excited about, though, is what is possible with the generated transcripts. Right now Audiogest offers simple summarization but I am planning to add much more interesting and useful digest features. For example, I want the app to be able to detect common themes or topics across transcripts, which would be really useful for customer researchers. I’ve also been thinking of adding Zapier integration to automate transcribing for users even further. Lots of exciting things to come and try out!

How can interested users get started with Audiogest? Can you provide guidance on signing up and using the tool?

Users can go to https://audiogest.app/ and sign up there. You only need an email address! You also get 40 minutes of free audio transcription when you sign up, so you can try it out before deciding if you want to purchase credits.

Conclusion

That concludes our interview with Thomas, the visionary behind Audiogest. We've learned about the inspiration behind the tool and how it simplifies the process of audio transcription and summarization. If you're in need of accurate and efficient transcription services, be sure to check the project out! Thanks for reading, and stay tuned for more insightful interviews with trailblazers in the world of technology and AI!

Subscribe or follow me on Twitter for more content like this!