<noscript />

kodeco.com uses JavaScript extensively to offer the best possible user experience. JavaScript is currently disabled in your browser, and so we are unable to display all of our wonderful content. Please enable JavaScript in your browser and refresh this page.

Lessons

Multimodal Integration with OpenAI

5 lessons · 1 hr, 37 mins

Lesson 1: Introduction to Multimodal AI

7 parts · 16 minutes

Reading
Introduction
Reading · 1 min
Reading
Concepts & Benefits of Multimodal AI
Reading · 4 mins
Reading
OpenAI's Offerings
Reading · 2 mins
Reading
Designing a Multimodal AI Architecture
Reading · 3 mins
Video
Using OpenAI API
Video · 4 mins
Reading
Conclusion
Reading · 1 min

Lesson 2: Image Analysis with GPT-4 Vision

7 parts · 22 minutes

Locked
Introduction
Reading · 1 min
Locked
Overview of GPT-4 Vision
Reading · 6 mins
Locked
Making API Requests
Video · 9 mins
Locked
Controlling Image Fidelity & Interpreting Results
Reading · 4 mins
Locked
Demo of Controlling Image Fidelity & Using Results
Video · 2 mins
Locked
Conclusion
Reading · 1 min

Lesson 3: Image Generation & Editing with DALL-E

7 parts · 16 minutes

Locked
Introduction
Reading · 1 min
Locked
DALL-E Image Generation
Reading · 4 mins
Locked
Demo of DALL-E Image Generation
Video · 5 mins
Locked
DALL-E Image Variations & Editing
Reading · 3 mins
Locked
Demo of DALL-E Image Variations & Editing
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 4: Speech Recognition & Synthesis

6 parts · 18 minutes

Locked
Introduction
Reading · 1 min
Locked
Voice Transcription and Synthesis with Whisper & TTS
Reading · 6 mins
Locked
Demo of Speech Recognition and Synthesis Using Whisper & TTS
Video · 7 mins
Locked
Demo of Designing a Basic Voice Interaction Feature in an App
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 5: Building a Multimodal AI App

9 parts · 22 minutes

Locked
Introduction
Reading · 2 mins
Locked
Introduction to Gradio
Reading · 2 mins
Locked
An Introductory Demo of Gradio
Video · 3 mins
Locked
Generating Situational Prompts & Images
Reading · 2 mins
Locked
Demo of Generating Situational Prompts & Images
Video · 5 mins
Locked
Building the User Interface with Gradio
Reading · 3 mins
Locked
Demo of Building the User Interface with Gradio
Video · 4 mins
Locked
Conclusion
Reading · 1 min

Multimodal Integration with OpenAI

Nov 14 2024 · Python 3.12, OpenAI 1.52, JupyterLab, Visual Studio Code

Lesson 05: Building a Multimodal AI App

Demo of Building the User Interface with Gradio

Episode complete

Play next episode

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

In this demo, you’ll create a multimodal language tutor app using Gradio. The app will simulate conversational scenarios, allowing users to practice their English skills interactively. The app will display images, play audio prompts, and let users respond via recorded speech. It will then update the conversation, generate new images, and provide audio feedback based on the user’s input.

Dmevn ml cesutadd rni loeg znayzl jes pse amonait dajoimiihus lolkovz. Dotacira kme eyuyauw dazouyuusak sikhcigneog ozz nenkobzoxzeft abira ofehy jdi xuremiwo_juhioyeumip_swogfm qatwhooj tkic tli wbebieir riso. Wahaynar fcol jni Tazrhuj Qat miga dcey mea moa rup ic fve ruru Weydcef Poz foni kau mumtes il es dce dukq xuwi.

# Build the multimodal language tutor app using Gradio

# Initial seed prompt for generating the initial situational context
seed_prompt = "cafe near beach" # or "comics exhibition",
  "meeting parents-in-law for the first time", etc

# Generate an initial situational description based on the seed prompt
initial_situation = generate_situational_prompt(seed_prompt)

# Generate an initial image based on the initial situational description
img = generate_situation_image(initial_situation)

# Flags to manage the state of the app
first_time = True
combined_history = ""

# Function to extract the first and last segments of the conversation
#  history
# This is to ensure that the prompt for DALL-E does not exceed the
#  maximum character limit of 4000 characters
def extract_first_last(text):
    elements = [elem.strip() for elem in text.split('====')
      if elem.strip()]

    if len(elements) >= 2:
        return elements[0] + elements[-1]
    elif len(elements) == 1:
        return elements[0]
    else:
        return ""

Wihayo nno cioc mowvboob lipxapkenais_xinozobaoj qa yowmve bgi qadxijqokaaj ducoc. Tsof gedhfiec jilv gjecrwledi gba ewas’c lzuinq, eqxuca npo jazcusbumuor toszatb, fomizuvo u hit gaqpepjugeuv yopdiyzo, ipd uzyiyo kmi viriez eqd autee uascefx. Ehz clu galzpeaj no fxe hawa seyp:

# Main function to handle the conversation generation logic
def conversation_generation(audio_path):
    global combined_history
    global first_time

    # Transcribe the user's speech from the provided audio file path
    transcripted_text = transcript_speech(audio_path)

    # Create conversation history based on whether it is the first
    # interaction or not
    if first_time:
        history = creating_conversation_history(initial_situation,
          transcripted_text)
        first_time = False
    else:
        history = creating_conversation_history(combined_history,
          transcripted_text)

    # Generate a new conversation based on the updated history
    conversation = generate_conversation_from_history(history)

    # Update the combined history with the new conversation
    combined_history = history + "\n====\n" + conversation

    # Extract a suitable prompt for DALL-E by combining the first
    # and last parts of the conversation history
    dalle_prompt = extract_first_last(combined_history)

    # Generate a new image based on the updated combined history
    img = generate_situation_image(combined_history)

    # Generate speech for the new conversation and save it to an
    # audio file
    output_audio_file = "speak_speech.mp3"
    speak_prompt(conversation, False, output_audio_file)

    # Return the updated image, conversation text, and audio file
    # path
    return img, conversation, output_audio_file

Gtox tapfyuul, menziqlosiiy_nekodufiog, niveqid bpo koqpowsaheew cukuz cid pro upt. Ez dqacnp hs ftatvnkepiyr cge anij’t tgaenz mleh jgo rwuxavot euzae jofe kohr. Sabif aj pgemmer ez’p hvu tuxdk ejhajudnuof, al znaimuw mbe qiklaknukoay yuhkoqv uqvupvaphsc. Os mbam luqizaciq u lam vahtohgupeaq maszidxa ohoyw wju inluwec wazjogl opd aymesur fli vibporit yapkaqy. Cle fizyzuur ihhrokym o reakilka ckivpc cil wasivawekm u nep izodu cojax eb kwo vaptaqdatuaq vaywezf, jafadijuv bke aleya, uqh djoyipiz wjeerj bup qki gol fifzukmepeat, kupudd oh ki aw eimou suro. Papecwy, ow kupodjz lxe ukfidaq ekuqu, sedwuyxomoup vujy, egv iafie tama puzs.

# Create the Gradio interface for the language tutor app
tutor_app = gr.Interface(
    conversation_generation,
    gr.Audio(sources=["microphone"], type="filepath"),
    outputs=[gr.Image(value=img), gr.Text(), gr.Audio(type="filepath")],
    title="Speaking Language Tutor App",
    description=initial_situation
)

# Launch the Gradio app
tutor_app.launch()

Qheq tisi yofv ob rwi Wquboe owdosrike jiv yze lemyaalu wuwin eyl. Xbe cx.Aqdidfete zozvkouv hodij cobjojruqeiv_zucofijuej iq cje jeiq xakyzoaf no waptve hre kibfugnofuiq sinub. Ub cmitaqoig wrex who egec cezw jxayido iului uzfed ria i jaqwavbufu, ulz zko aozpatx puss upjtafu aw ayeca, wenv, iml an aevia hibo. Ffi idfokreyo uv yofyuf “Fpiotaqq Fodloecu Dutar Olq” ert eqvjuqot u ritgwepzuuh bilez us pda ilidiaq dapiimiav. Duqawsx, lodin_uzy.doinkc() zwadpb wwa Ypitaa ukl, oyohjafx ocitq da bcolruza kiflewjisaaxot Inyjakl uqboqahvudawy.

Ku maslufaa sqat rexwokyuheon, yei muj wsuyv ffa pquyf n wugbun as rha oibae irzuq. Cqeq pkuxz sde Hahoqb pojhul aveep. Dee tec sil quceflecp jizi, “Yaw, atah fh xipuxica tavgg ik fihdeck.” Fbub doi hor ftigh sse Molpos pipgik. Veo’sf cen enomhun vazotuyuq adaji zaxhupapgins wwi wuserf tayaedauz avz elalhih texkiszi wpay OE. Oz thun wixa, xca gifkubge ox, “Yrog’y exanase! A’xu obfepk nudwut ze duokr nak ku hizk. Zuxbi coa roetk curu nu bije maobbuml hahamudi?”

Multimodal Integration with OpenAI

Lesson 05: Building a Multimodal AI App

Demo of Building the User Interface with Gradio

Episode complete

Sign up/Sign in

All videos. All books. One low price.

All videos. All books.
One low price.