Home
Repository Search
Listing
Academics - Research coordination office
R-RC -Acad
Admin-Research Repository
Engineering and Computer Science
Computer Science
Engineering
Mathematics
Languages
Arabic
Chinese
English
French
Persian
Urdu
German
Korean
Management Sciences
Economics
Governance and Public Policy
Management Sciences
Management Sciences Rawalpindi Campus
ORIC
Oric-Research
Social Sciences
Education
International Relations
Islamic thought & Culture
Media and Communication Studies
Pakistan Studies
Peace and Conflict Studies
Psychology
Content Details
Back to Department Listing
Title
EXPLOITING SEMANTIC KNOWLEDGE FOR IMAGE CAPTIONING USING DEEP LEARNING
Author(s)
Ali Raza
Abstract
The technique of generating textual explanations for images is commonly referred to as image captioning. It has attracted a lot of attention recently because it may be used in a variety of fields. There are some challenges in image captioning, one of them is the lack of incorporating semantic knowledge in generating image captions. Semantic knowledge can be helpful in object detection by exploiting relationships among objects and in language semantics. In this study, the issue of image captioning is investigated by combining two efficient models, the vision transformer (ViT) and the generative pre-trained transformer 2 (GPT-2). The ViT uses self-attention techniques that are applied to image patches to capture visual elements and overall context from images. The GPT-2 model complements ViT with extraordinary language production abilities that enable it to produce content that is cohesive and related to the situation. An encoder-decoder-based deep learning model is proposed where the ViT performs the encoder function, extracting meaningful visual representations from images, while the GPT-2 model performs the decoder function, producing descriptive captions based on the retrieved visual features. This method makes it possible to seamlessly combine textual and visual information, producing captions that faithfully reflect the content of the input images. The potential of this combination is demonstrated through empirical analyses, highlighting the advantages of utilizing both language and visual components in the ‘image captioning’ process. My research strengthens multimodal AI systems by bridging the gap between visual and language comprehension. The experiments were performed on the MS COCO dataset and Flicker 30k dataset. The model was validated using various evaluation metrics. Results show an improvement as Bleu-1, Bleu-2, Bleu-3, Bleu-4, Rogue, and Meteor by 10.58, 20.45, 21.07, 34.19, 0.3, and 11.16 respectively. The other evaluation metrics like Meteor improved by 11.16 and the Rogue metric improved by 0.3 on the MS COCO dataset.
Type
Thesis/Dissertation MS
Faculty
Engineering and Computer Science
Department
Engineering
Language
English
Publication Date
2024-03-25
Subject
Publisher
Contributor(s)
Format
Identifier
Source
Relation
Coverage
Rights
Category
Description
Attachment
Name
Timestamp
Action
ed36d382cf.pdf
2024-04-04 13:38:36
Download