Telegram Group & Telegram Channel
Forwarded from Github LLMs
LLMs can see and hear without any training

30 Jan 2025 · Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar ·

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Paper: https://arxiv.org/pdf/2501.18096v1.pdf

Code: https://github.com/facebookresearch/mils

https://www.tg-me.com/deep_learning_proj
Please open Telegram to view this post
VIEW IN TELEGRAM



tg-me.com/Machine_learn/3367
Create:
Last Update:

LLMs can see and hear without any training

30 Jan 2025 · Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar ·

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Paper: https://arxiv.org/pdf/2501.18096v1.pdf

Code: https://github.com/facebookresearch/mils

https://www.tg-me.com/deep_learning_proj

BY Machine learning books and papers




Share with your friend now:
tg-me.com/Machine_learn/3367

View MORE
Open in Telegram


Machine learning books and papers Telegram | DID YOU KNOW?

Date: |

What is Secret Chats of Telegram

Secret Chats are one of the service’s additional security features; it allows messages to be sent with client-to-client encryption. This setup means that, unlike regular messages, these secret messages can only be accessed from the device’s that initiated and accepted the chat. Additionally, Telegram notes that secret chats leave no trace on the company’s services and offer a self-destruct timer.

Machine learning books and papers from us


Telegram Machine learning books and papers
FROM USA