Back to AI Solutions
Speech AI

AI Transcription &
Meeting Intelligence

Local speech-to-text with speaker diarization. Transcribe meetings, interviews, and calls with automatic speaker labelling — all processed on your hardware, nothing leaves your machine.

2 min
For 30 Min Audio
Auto
Speaker Identification
10+
Formats Supported
0
Data Uploaded

How It Works

Drop in a recording — meeting, interview, call, video — and the AI extracts audio, transcribes it with OpenAI Whisper running locally, identifies who said what, and delivers a formatted, speaker-labelled transcript with optional AI-generated summary.

Transcription pipeline showing audio input, extraction, AI transcription with Whisper, speaker diarization, formatted transcript, and meeting summary
1

Audio Extraction

Feed in any audio or video format. The system extracts clean audio from MP4, WebM, M4A, or processes raw audio files directly.

2

Transcribe & Identify

OpenAI Whisper runs locally on GPU/Apple Silicon to transcribe speech. Speaker diarization identifies and labels each speaker.

3

Summarise & Deliver

Get a timestamped, speaker-labelled transcript plus an optional AI summary with key decisions, action items, and follow-ups.

Technical Details

OpenAI Whisper MLX (Apple Silicon) PyAnnote FFmpeg Python Local LLM Summary
Model & Processing Details

Speech Recognition: OpenAI Whisper large-v3 model running locally via MLX (Apple Silicon) or CUDA (NVIDIA GPU).

Speaker Diarization: PyAnnote audio pipeline for speaker segmentation and clustering. Automatically determines number of speakers.

Audio Processing: FFmpeg for format conversion, noise reduction, and audio track extraction from video files.

Summary Generation: Local LLM (Llama 3 / Qwen) processes transcripts for meeting summaries and action item extraction.

Hardware Requirements

Apple Silicon: M1 Pro or higher recommended. M1/M2 Max provides optimal performance with unified memory for large models.

NVIDIA GPU: RTX 3060 (12GB VRAM) minimum. RTX 3090/4090 for batch processing and larger models.

RAM: 16GB minimum, 32GB+ recommended for concurrent transcription and summarisation.

Who This Is For

Legal Professionals

Client consultations, depositions, mediation recordings. Legally privileged content stays on your machine.

Healthcare

Patient consultations, clinical notes, specialist referral recordings. Privacy-compliant local processing.

Corporate Teams

Board meetings, strategy sessions, client calls. Searchable records with action item tracking.

Researchers

Interview transcription, focus groups, field recordings. Speaker-labelled output for qualitative analysis.

Frequently Asked Questions

How fast is the AI transcription?
30 minutes of audio is transcribed in approximately 2 minutes on Apple Silicon hardware. Processing speed scales with hardware — GPU-accelerated systems are even faster. The system processes audio significantly faster than real-time.
Can it identify different speakers in a meeting?
Yes. Speaker diarization automatically identifies and labels different speakers throughout the recording. The output shows who said what, with timestamps, making it easy to follow multi-person conversations, interviews, and panel discussions.
What audio and video formats are supported?
MP4, MP3, WAV, M4A, WebM, FLAC, OGG, and most common audio/video formats. For video files, the system automatically extracts the audio track. Zoom, Teams, and Google Meet recordings are fully supported.
Does the audio data leave my computer?
No. The entire pipeline runs locally using OpenAI Whisper on your hardware. Audio files are processed on-device and never uploaded to any cloud service. This makes it suitable for legally privileged recordings, medical consultations, and confidential meetings.
Can it generate meeting summaries and action items?
Yes. After transcription, a local LLM can process the transcript to generate meeting summaries, extract action items, identify key decisions, and highlight follow-up tasks. All processing remains local.

Never Lose a Meeting Detail Again

AI transcription that runs on your hardware. Every word captured, every speaker identified, every action item tracked.