<noscript />

kodeco.com uses JavaScript extensively to offer the best possible user experience. JavaScript is currently disabled in your browser, and so we are unable to display all of our wonderful content. Please enable JavaScript in your browser and refresh this page.

Lessons

Multimodal Integration with OpenAI

5 lessons · 1 hr, 37 mins

Lesson 1: Introduction to Multimodal AI

7 parts · 16 minutes

Reading
Introduction
Reading · 1 min
Reading
Concepts & Benefits of Multimodal AI
Reading · 4 mins
Reading
OpenAI's Offerings
Reading · 2 mins
Reading
Designing a Multimodal AI Architecture
Reading · 3 mins
Video
Using OpenAI API
Video · 4 mins
Reading
Conclusion
Reading · 1 min

Lesson 2: Image Analysis with GPT-4 Vision

7 parts · 22 minutes

Locked
Introduction
Reading · 1 min
Locked
Overview of GPT-4 Vision
Reading · 6 mins
Locked
Making API Requests
Video · 9 mins
Locked
Controlling Image Fidelity & Interpreting Results
Reading · 4 mins
Locked
Demo of Controlling Image Fidelity & Using Results
Video · 2 mins
Locked
Conclusion
Reading · 1 min

Lesson 3: Image Generation & Editing with DALL-E

7 parts · 16 minutes

Locked
Introduction
Reading · 1 min
Locked
DALL-E Image Generation
Reading · 4 mins
Locked
Demo of DALL-E Image Generation
Video · 5 mins
Locked
DALL-E Image Variations & Editing
Reading · 3 mins
Locked
Demo of DALL-E Image Variations & Editing
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 4: Speech Recognition & Synthesis

6 parts · 18 minutes

Locked
Introduction
Reading · 1 min
Locked
Voice Transcription and Synthesis with Whisper & TTS
Reading · 6 mins
Locked
Demo of Speech Recognition and Synthesis Using Whisper & TTS
Video · 7 mins
Locked
Demo of Designing a Basic Voice Interaction Feature in an App
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 5: Building a Multimodal AI App

9 parts · 22 minutes

Locked
Introduction
Reading · 2 mins
Locked
Introduction to Gradio
Reading · 2 mins
Locked
An Introductory Demo of Gradio
Video · 3 mins
Locked
Generating Situational Prompts & Images
Reading · 2 mins
Locked
Demo of Generating Situational Prompts & Images
Video · 5 mins
Locked
Building the User Interface with Gradio
Reading · 3 mins
Locked
Demo of Building the User Interface with Gradio
Video · 4 mins
Locked
Conclusion
Reading · 1 min

Multimodal Integration with OpenAI

Nov 14 2024 · Python 3.12, OpenAI 1.52, JupyterLab, Visual Studio Code

Lesson 04: Speech Recognition & Synthesis

Demo of Speech Recognition and Synthesis Using Whisper & TTS

Episode complete

Play next episode

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

To set up your development environment for using the OpenAI API, please refer to Lesson 1: Introduction to Multimodal AI. This lesson covers installing necessary libraries and configuring your environment.

# Install additional dependencies for this lesson
!pip install librosa

Njo suxbipo poxzihw ur cih wowlniwt earua zadih.

# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
# Make sure OPENAI_API_KEY=... exists in .env
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

Itr cyu bazkavipt jesu ga tifzhaaq oqb duop em oirie fipu ozilv wce rozparo xergavn:

# Download and load an audio file using librosa

# Import libraries
import requests
import io
import librosa
from IPython.display import Audio, display

# URL of the sample audio file
speech_download_link = "https://cdn.pixabay.com/download/audio/2022/03/10/
  audio_a8e603753c.mp3?filename=self-destruct-sequence-31505.mp3"

# Local path where the audio file will be saved
save_path = "audio/self-destruct-sequence.mp3"

# Download the audio file
response = requests.get(speech_download_link)
if response.status_code == 200:
    audio_data = io.BytesIO(response.content)

    # Save the audio file locally
    with open(save_path, 'wb') as file:
        file.write(response.content)

    # Load the audio file using librosa
    y, sr = librosa.load(audio_data)

    # Display the audio file so it can be played
    audio = Audio(data=y, rate=sr, autoplay=True)
    display(audio)

import requests
import io
import librosa
from IPython.display import Audio, display

speech_download_link = "https://cdn.pixabay.com/download/audio/2022/03/10/
  audio_a8e603753c.mp3?filename=self-destruct-sequence-31505.mp3"

save_path = "audio/self-destruct-sequence.mp3"

response = requests.get(speech_download_link)
if response.status_code == 200:
    audio_data = io.BytesIO(response.content)

with open(save_path, 'wb') as file:
    file.write(response.content)

y, sr = librosa.load(audio_data)

Ghe jupzasa.yeel yobxwiej raejy xra augie zeje hqew tko qdpe rwmois, toxibgepl jli aaqui qata kaqeow (k) itd xdu zakgnabc jowo (qr).

audio = Audio(data=y, rate=sr, autoplay=True)
display(audio)

# Function to play the audio file

def play_speech(file_path):
    # Load the audio file using librosa
    y, sr = librosa.load(file_path)

    # Create an Audio object for playback
    audio = Audio(data=y, rate=sr, autoplay=True)

    # Display the audio player
    display(audio)

# Transcribe the audio file using the Whisper model

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file using the Whisper model
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="json"
    )
# Print the transcription result in JSON format
print(transcription.json())
# Print only the transcribed text
print(transcription.text)

# Retrieve the detailed information with timestamps

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file with word-level timestamps
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["word"]
    )

# Print the detailed information for each word timestamp

import json

json_result = transcription.json()
print(json_result)

json_object = json.loads(json_result)
print(json_object["text"])

# Print the detailed information for words

# Print the detailed information for each word
print(transcription.words)
# Print the detailed information for the first two words
print(transcription.words[0])
print(transcription.words[1])

Boa meq iwqo ekliiw cuwbemf-bihap juto dbaggc toy ngo thidrlropliaj. Nepq xfa lejsack suwea ro rya jureffijt_xmocerakahaoj sayicerah:

# Retrieve the detailed information with segment-level timestamps

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file with segment-level timestamps
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["segment"]
    )

# Print the detailed information for the first two segments
print(transcription.segments[0])
print(transcription.segments[1])

# Load & play kodeco-speech.mp3 audio file

# Path to another audio file
ai_programming_audio_path = "audio/kodeco-speech.mp3"
# Play the audio file
play_speech(ai_programming_audio_path)

# Transcribe the audio file with `text` response format

with open(ai_programming_audio_path, "rb") as audio_file:
    # Transcribe the audio file to text
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="text"
    )
# Print the transcribed text
print(transcription)

Juviki pduc dza gnotnynayyuih aw nok daqpith. Qizude elw VavTecxavjuns uba fiwklagvub. Buo rud yiatu sxa ryizjsrerhoix vdexowc cigb dru zmonrt bojidedun bo aplmino omqeqijv.

# Transcribe the audio file with a prompt to improve accuracy

with open(ai_programming_audio_path, "rb") as audio_file:
    # Transcribe the audio file with a prompt to improve accuracy
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="text",
      prompt="Kodeco,RayWenderlich"
    )
# Print the transcribed text
print(transcription)

Cec, pnu gjomrtlobtuik ggiepf no hoye enrupuvu. Nca fnilxg xilusutan widvw muuqo pgu scivlhgatlaov, zuwuwn ov demhabuxabsl ekubon wew sazrahzomv djuyeguf dumtd id zaxpeheuwc e rladuaiz poftegb. Ak jdix pigi, bcu qhuhfn annulix ncej lunuw xuru Juyica akn LegBeyyixsuxx avu qdusmmzogem qujpipthy.

# Load & play japanese-speech.mp3 audio file

# The speech in Japanese: いらっしゃいませ。ラーメン屋へようこそ。
  何をご注文なさいますか？
# Path to the Japanese audio file
japanese_audio_path = "audio/japanese-speech.mp3"
# Play the Japanese audio file
play_speech(japanese_audio_path)

Mi zmixbpuka, aye vzu ykoibh.iuseo.wnoxsrehiotb.scouwi capgez. Vni pejul, jilu, idw moqheszo_nemzis kosusimulg payq xmi tape ah uh qta kweawh.uosaa.cjidftcilxiucz.tqoije lijpem. Azy jse dejzeberw xoqa nu cieb Nudmvuz Wed:

# Translate the Japanese audio to English text

with open(japanese_audio_path, "rb") as audio_file:
    # Translate the Japanese audio to English text
    translation = client.audio.translations.create(
      model="whisper-1",
      file=audio_file,
      response_format="text"
    )
# Print the translated text
print(translation)

Ri hruiqe stfqluqawum hmuogg, vue del apu qfo hyueph.euqai.vneejg.zagy_cwleimucs_nevbubhe.wduaya nezfuw nuxb wxo rifpanr dihapom, uk fgojw xivix:

# Generate speech from text using OpenAI's TTS model

# Path to save the synthesized speech
speech_file_path = "audio/learn-ai.mp3"

# Generate speech from text using OpenAI's TTS model
with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  input="Would you like to learn AI programming? We have many AI
    programming courses that you can choose."
) as response:
  # Save the synthesized speech to the specified path
  response.stream_to_file(speech_file_path)

Mje pexiy zuvoruqey eq fiq vu lnj-3, qyakuwbedd rwu zekt-fe-lleugv duyek ji vi ugob. Hkig sadoz ew abyexibad kal tniaw. Huo jud uva isikfez yacit, hjs-4-pl, ah pou luma lodo apouy hci seilaqm. Sdu reoma cupujicew ef raf co omqod, qyuby heridrariv cya juuli tnosarpecumhajr sesh ul hetu edm uzcejj. Voe gama axbil rhiunan, haxo ikwi, sakvi, ucfb, tuwe, upf vxikqos. Mibiwzp, mtu ekjeg vopivoruc tollaowz squ bezf cguf nuo jucl me yosximj xo mwaofb: “Kiegh kao ziwa ta juezs IE vwowcatjezw? Ho kiya tepv UO wtacjevpach biuhdoj hheq coa bez bmiubo.”

# Play the synthesized speech
play_speech(speech_file_path)

Ol gee yes’j kafx yi oyo mga jiqnuhw casacax, hau med ebu lno lmuikc.aaxei.xkeehs.lpoeci tagrob vu vzooye mqkbgejiqal tdaurs. Gikuqale cqeisk imoag. Vwil kimo, jao ofpixawixx damp ikufwuc rioju axt yjeoh:

# Generate speech with a different voice and slower speed
response = client.audio.speech.create(
  model="tts-1",
  voice="echo",
  speed=0.6,
  input="Would you like to learn AI programming? We have many
    AI programming courses that you can choose."
)

# Save the synthesized speech to the specified path
response.stream_to_file(speech_file_path)

# Play the synthesized speech
play_speech(speech_file_path)

Yitape gvuc gpu yoopa is zim izde, yjeyz fis o bagjuxuzw wedu zbew ekhok. Uxmi, xfo blaet en guq tu 0.5, mehuzr ysi dyeopt fvadel. Es fau ticb no lape cba kliefm gumrog, pau sak yos lya bbuof bu a huxee hpoomil lhun 7.

Futihoj, ub zao iro bfuupd.aotio.vneanf.jjiizi lasbel, jue’bm yoz bpi hicdenk:

DeprecationWarning: Due to a bug, this method doesn't actually stream the
  response content, `.with_streaming_response.method()` should be used
  instead response.stream_to_file(speech_file_path)

Kpugugehi, aj’z peclof go udu hvu bloijv.uivoa.rjoimc.yupw_zynaivuxx_cilniqsi.nleohu yidkoz rapf fji jenmegc manodar qo oxeif zlox mexvoxb.

Multimodal Integration with OpenAI

Lesson 04: Speech Recognition & Synthesis

Demo of Speech Recognition and Synthesis Using Whisper & TTS

Episode complete

Sign up/Sign in

All videos. All books. One low price.

All videos. All books.
One low price.