This task illustrates the effectiveness of a speaker’s timbre analysis via the following prompt:
Qwen2.5-Omni-7B:Listen to the provided audio and produce a caption describing the speaker's timbre.
MiDashengLM-7B: Write an audio caption describing the speaker's timbre.
Example 1 Qwen2.5-Omni-7B: A female voice speaking. MiDashengLM-7B: A female voice with a slightly high pitch and moderate volume delivers an enthusiastic monologue. |
Example 2 Qwen2.5-Omni-7B: A male in his twenties speaks in Spanish. MiDashengLM-7B: A male speaker with a neutral tone delivers information in Spanish. |
Example 3 Qwen2.5-Omni-7B: A female is speaking in a neutral mood. MiDashengLM-7B: A female speaker with a neutral tone delivers a Portuguese monologue. |
Example 4 Qwen2.5-Omni-7B: A female with a Spanish accent is speaking. MiDashengLM-7B: A male speaker with a neutral tone delivers a Spanish-language monologue. |
Example 5 Qwen2.5-Omni-7B: A man is speaking and music is playing. MiDashengLM-7B: A male voice with a neutral tone delivers English pronunciation examples, including 'ring', 'wing', and 'running'. |
Example 6 Qwen2.5-Omni-7B: A woman is speaking. MiDashengLM-7B: A female voice is heard, delivering a neutral-toned speech in English. |
Example 7 Qwen2.5-Omni-7B: A man is speaking. MiDashengLM-7B: A male voice with a deep, resonant tone delivers a monologue in English. |
Example 8 Qwen2.5-Omni-7B: A male voice is speaking. MiDashengLM-7B: A male voice with a deep, resonant tone delivers a solemn monologue. |
Example 9 Qwen2.5-Omni-7B: A male voice speaking in a neutral tone. MiDashengLM-7B: A male speaker with a neutral tone delivers a Portuguese message about toy donations. |
Example 10 Qwen2.5-Omni-7B: A male voice speaking in English with a neutral mood. MiDashengLM-7B: A male speaker with a neutral tone and clear enunciation discusses real-time data visualization. |
Example 11 Qwen2.5-Omni-7B: The speaker's voice is characterized by a mature, experienced tone, suggesting a mature adult male. MiDashengLM-7B: A female speaker with a clear, articulate Telugu voice delivers a monologue. |
Example 12 Qwen2.5-Omni-7B: A female voice is speaking in English, with a neutral mood. MiDashengLM-7B: A female voice with a neutral tone and clear enunciation. |
This task illustrates the effectiveness of a speaker’s language analysis via the following prompt:
请描述说话人的语言特性, 包括说话人的语种, 口音等.
Example 1 Qwen2.5-Omni-7B: 说话人使用的是英语,带有斯里兰卡的口音。 MiDashengLM-7B: 说话人使用的是英语,带有印度口音。 |
Example 2 Qwen2.5-Omni-7B: 英语,美国口音 MiDashengLM-7B: 说话人使用的是英语,并带有美国口音。 |
Example 3 Qwen2.5-Omni-7B: Benison is his estranged father's name and Fortune is his mother's maiden name. MiDashengLM-7B: 说话人使用的是英语,具有美国口音。 |
Example 4 Qwen2.5-Omni-7B: 普通话-南京口音 MiDashengLM-7B: 说话人使用的是普通话,没有明显的口音。 |
Example 5 Qwen2.5-Omni-7B: 普通话-女性口音 MiDashengLM-7B: 说话人使用的是普通话,没有明显的口音。 |
Example 6 Qwen2.5-Omni-7B: 普通话-北京口音 MiDashengLM-7B: 说话人使用的是普通话,没有明显的口音。 |
This task illustrates the effectiveness of environmental sound recognition via the following prompt:
Qwen2.5-Omni-7B: Classify the given multi-label audio in English.
MiDashengLM-7B: Which labels describe the sound?
Example 1 Qwen2.5-Omni-7B: {'Rattle'} MiDashengLM-7B: {'Coin (dropping)', 'Domestic sounds, home sounds'} Label: {'Coin (dropping)', 'Domestic sounds and home sounds'} |
Example 2 Qwen2.5-Omni-7B: {'Music'} MiDashengLM-7B: {'Vehicle', 'Alarm', 'Vehicle horn, car horn, honking', 'Car', 'Motorcycle'} Label: {'Vehicle horn and car horn and honking', 'Vehicle', 'Alarm', 'Car', 'Motorcycle'} |
Example 3 Qwen2.5-Omni-7B: {'Splash - splatter'} MiDashengLM-7B: {'Water', 'Liquid', 'Splash, splatter'} Label: {'Water', 'Liquid', 'Splash and splatter'} |
Example 4 Qwen2.5-Omni-7B: {'Radio'} MiDashengLM-7B: {'Speech', 'Male speech, man speaking', 'Human voice'} Label: {'Speech', 'Male speech and man speaking', 'Human voice'} |
Example 5 Qwen2.5-Omni-7B: {'Squeak'} MiDashengLM-7B: {'Slam', 'Domestic sounds, home sounds', 'Door', 'Squeak'} Label: {'Domestic sounds and home sounds', 'Slam', 'Door', 'Squeak'} |
Example 6 Qwen2.5-Omni-7B: {'Gunshot - gunfire'} MiDashengLM-7B: {'Gunshot, gunfire', 'Explosion'} Label: {'Explosion', 'Gunshot and gunfire'} |
Example 7 Qwen2.5-Omni-7B: {'Dog', 'bark'} MiDashengLM-7B: {'Dog', 'Domestic animals, pets', 'Bark', 'Animal'} Label: {'Dog', 'Bark', 'Domestic animals and pets', 'Animal'} |
Example 8 Qwen2.5-Omni-7B: {'Cat', 'meow'} MiDashengLM-7B: {'Domestic animals, pets', 'Meow', 'Cat', 'Animal'} Label: {'Meow', 'Cat', 'Domestic animals and pets', 'Animal'} |
Example 9 Qwen2.5-Omni-7B:{'Bird'} MiDashengLM-7B: {'Bird', 'Chirp, tweet', 'Bird vocalization, bird call, bird song', 'Wild animals', 'Animal'} Label: {'Bird', 'Bird vocalization and bird call and bird song', 'Wild animals', 'Animal', 'Chirp and tweet'} |
This task illustrates the effectiveness of music instrument recognition via the following prompt:
Qwen2.5-Omni-7B: Recognize the music instrument with keywords in English.
MiDashengLM-7B: What's the music instrument?
Example 1 Qwen2.5-Omni-7B: violin MiDashengLM-7B: string Label: string |
Example 2 Qwen2.5-Omni-7B: keyboard MiDashengLM-7B: guitar Label: guitar |
Example 3 Qwen2.5-Omni-7B: organ MiDashengLM-7B: vocal Label: vocal |
Example 4 Qwen2.5-Omni-7B: reed MiDashengLM-7B: flute Label: flute |
Example 5 Qwen2.5-Omni-7B: guitar MiDashengLM-7B: keyboard Label: keyboard |
Example 6 Qwen2.5-Omni-7B: keyboard MiDashengLM-7B: mallet Label: mallet |
Example 7 Qwen2.5-Omni-7B: bass MiDashengLM-7B: reed Label: reed |
Example 8 Qwen2.5-Omni-7B: guitar MiDashengLM-7B: bass Label: bass |
Example 9 Qwen2.5-Omni-7B: bass MiDashengLM-7B: organ Label: organ |
MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.
If you find MiDashengLM useful in your research, please consider citing our work:
@misc{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {Xiaomi MiLM Plus Horizon Team},
year = {2025},
}