Copilot Studio and Speech-to-Text Model Integration

Introduction

Integrating a Speech-to-Text (STT) model into Copilot Studio allows chatbots to process voice inputs, transcribe speech into text, and respond accordingly. This is useful for:

✅ Voice-enabled chatbots in customer support.
✅ Hands-free interactions for accessibility.
✅ Transcribing audio messages for data processing.

Since Copilot Studio does not natively support Speech-to-Text, we integrate it with Azure Speech Services API and Power Automate to enable real-time speech processing.

Step 1: Understanding Speech-to-Text in Copilot Studio

1.1 What Can Speech-to-Text Do?

With Azure Speech Services API, Copilot Studio can:
✔ Convert spoken words into text.
✔ Support multiple languages and dialects.
✔ Detect speaker accents and punctuation automatically.
✔ Process audio files or live voice inputs.
✔ Enable real-time voice interactions in chatbots.

1.2 Use Cases for Speech-to-Text in Copilot Studio

📌 Customer Support → Voice-based chatbots for call centers.
📌 Healthcare → Doctors can dictate patient notes.
📌 Retail & E-commerce → Customers can place orders using voice commands.
📌 Finance & Banking → Speech-to-text for fraud detection and documentation.

Step 2: Setting Up Speech-to-Text Integration

Since Copilot Studio does not natively process speech, we use Azure Speech Services and Power Automate to handle speech input.

2.1 Prerequisites

✅ Microsoft Azure Account (Sign up here)
✅ Azure Speech Services API Key
✅ Copilot Studio Access (Power Virtual Agents)
✅ Power Automate License (for API integration)

2.2 Create an Azure Speech Services API

1️⃣ Go to Azure Portal.
2️⃣ Search for “Speech” in the Azure Marketplace.
3️⃣ Click Create and enter:

Resource Group → Select/Create a group.
Region → Choose a data center near you.
Pricing Tier → Free or paid.
4️⃣ Click Review + Create → Wait for deployment.
5️⃣ Go to Resource → Copy the API Key and Endpoint URL.

✅ Azure Speech API is now ready!

2.3 Enable Voice Input in Copilot Studio

1️⃣ Log in to Copilot Studio (Power Virtual Agents).
2️⃣ Open your chatbot and navigate to Settings → Click Enable Voice Input.
3️⃣ Set the Accepted Audio Formats:

.wav, .mp3, .ogg, .flac.
4️⃣ Save the changes.

✅ Users can now send voice recordings via chatbot!

Step 3: Connecting Copilot Studio with Speech-to-Text API

We use Power Automate to send voice recordings to Azure Speech Services API and return transcribed text.

3.1 Create a Power Automate Flow

1️⃣ Open Power Automate → Click Create Flow.
2️⃣ Select Automated Flow → Name it “Convert Speech to Text”.
3️⃣ Choose Copilot Studio Trigger:

Select “When a user uploads an audio file”.

3.2 Add an HTTP Request to Azure Speech API

1️⃣ Click + New Step → Choose “HTTP”.
2️⃣ Set the Method to POST.
3️⃣ Enter Request URL:

https://<your-region>.api.cognitive.microsoft.com/speech/v1.0/recognize

4️⃣ Click Headers → Add:

Ocp-Apim-Subscription-Key → Paste your API Key.
Content-Type → audio/wav.
5️⃣ In Body, enter:

{
  "audioData": "@{triggerOutputs()?['body/url']}"
}

✅ This sends the audio file to Azure Speech API for transcription!

3.3 Process the API Response

1️⃣ Click + New Step → Select “Parse JSON”.
2️⃣ Use Dynamic Content → Select the API Response Body.
3️⃣ Define JSON schema:

{
  "type": "object",
  "properties": {
    "RecognitionStatus": { "type": "string" },
    "DisplayText": { "type": "string" }
  }
}

✅ This extracts the transcribed text from the API response!

3.4 Send Results Back to Copilot Studio

1️⃣ Click + New Step → Select “Respond to Power Virtual Agents”.
2️⃣ Enter Dynamic Response Message:

"Transcription: @{body('Parse_JSON')?['DisplayText']}"
3️⃣ Save and Publish the Flow.

✅ Now, the chatbot can transcribe voice messages into text!

Step 4: Testing the Speech-to-Text Bot

1️⃣ Open Copilot Studio → Click Test Bot.
2️⃣ Upload an audio file (e.g., “Hello, how are you?”).
3️⃣ The chatbot should respond:
“Transcription: Hello, how are you?”

Step 5: Expanding Speech-to-Text Capabilities

5.1 Real-Time Speech Processing

To enable real-time transcription, modify the API request:
🔹 Use WebSocket Streaming API instead of batch processing.
🔹 Implement Azure Speech SDK for live voice input.

5.2 Multilingual Speech Recognition

Enable multiple languages by modifying the request:
🔹 Add “language” parameter in the request body.
🔹 Example: "language": "es-ES" for Spanish.

5.3 Speaker Identification

To recognize who is speaking, integrate Azure Speaker Recognition API:
🔹 Detect and verify specific speakers in a conversation.
🔹 Enable personalized chatbot responses based on user identity.

Final Thoughts

🚀 Key Takeaways:

✔ Step 1: Set up Azure Speech Services API.
✔ Step 2: Enable voice input in Copilot Studio.
✔ Step 3: Use Power Automate to send audio to Azure Speech API.
✔ Step 4: Process API responses and return results in chatbot messages.
✔ Step 5: Expand features with real-time transcription, speaker recognition, and multilingual support.

Would you like a pre-built Power Automate template for faster integration?

Copilot Studio and speech-to-text model integration