Phrase-based image caption generator with hierarchical LSTM network
Document Type
Article
Publication Date
1-1-2019
Abstract
Automatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however possess a temporal hierarchy structure, with complex dependencies between each subsequence. In this paper, we propose a phrase-based image captioning model using a hierarchical Long Short-Term Memory (phi-LSTM) architecture to generate image description. In contrast to the conventional solutions that generate caption in a pure sequential manner, phi-LSTM decodes image caption from phrase to sentence. It consists of a phrase decoder to decode the noun phrases of variable length, and an abbreviated sentence decoder to decode the abbreviated form of the image description. A complete image caption is formed by combining the generated phrases with sentence during the inference stage. Empirically, our proposed model shows a better or competitive result on the Flickr8k, Flickr30k and MS-COCO datasets in comparison to the state-of-the art models. We also show that our proposed model is able to generate more novel captions (not seen in the training data) which are richer in word contents in all these three datasets. © 2018 Elsevier B.V.
Keywords
Image captioning, Natural language processing, Long short-term memory, Deep learning
Divisions
fsktm
Funders
Postgraduate Research Grant ( PPP ) PG003-2016A,Frontier Research Grant FG002-17AFR, from University of Malaya
Publication Title
Neurocomputing
Volume
333
Publisher
Elsevier