Department wise Listing | NUML Online Research Repository
List of Content
Back to Listing
Title Abstract Action(s)
EXPLOITING SEMANTIC KNOWLEDGE FOR IMAGE CAPTIONING USING DEEP LEARNING The technique of generating textual explanations for images is commonly referred to as image captioning. It has attracted a lot of attention recently because it may be used in a variety of fields. There are some challenges in image captioning, one of them is the lack of incorporating semantic knowledge in generating image captions. Semantic knowledge can be helpful in object detection by exploiting relationships among objects and in language semantics. In this study, the issue of image captioning is investigated by combining two efficient models, the vision transformer (ViT) and the generative pre-trained transformer 2 (GPT-2). The ViT uses self-attention techniques that are applied to image patches to capture visual elements and overall context from images. The GPT-2 model complements ViT with extraordinary language production abilities that enable it to produce content that is cohesive and related to the situation. An encoder-decoder-based deep learning model is proposed where the ViT performs the encoder function, extracting meaningful visual representations from images, while the GPT-2 model performs the decoder function, producing descriptive captions based on the retrieved visual features. This method makes it possible to seamlessly combine textual and visual information, producing captions that faithfully reflect the content of the input images. The potential of this combination is demonstrated through empirical analyses, highlighting the advantages of utilizing both language and visual components in the ‘image captioning’ process. My research strengthens multimodal AI systems by bridging the gap between visual and language comprehension. The experiments were performed on the MS COCO dataset and Flicker 30k dataset. The model was validated using various evaluation metrics. Results show an improvement as Bleu-1, Bleu-2, Bleu-3, Bleu-4, Rogue, and Meteor by 10.58, 20.45, 21.07, 34.19, 0.3, and 11.16 respectively. The other evaluation metrics like Meteor improved by 11.16 and the Rogue metric improved by 0.3 on the MS COCO dataset.