For example, a multimodal file search might allow a user to search documents not only based on textual metadata, such as file name or tags, but also based on the content of those files, such as text contained within a document, graphical elements in a PDF file, or even dialogues from an audio or video recording.
This can significantly facilitate finding specific files in large data sets, especially in those that contain various types of files.
In this contest, we will focus on query-based multimodal retrieval task by using the combinations of image and text model.