Phrase-based image caption generator with hierarchical LSTM network

Document Type

Article

Publication Date

1-1-2019

Abstract

Automatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however possess a temporal hierarchy structure, with complex dependencies between each subsequence. In this paper, we propose a phrase-based image captioning model using a hierarchical Long Short-Term Memory (phi-LSTM) architecture to generate image description. In contrast to the conventional solutions that generate caption in a pure sequential manner, phi-LSTM decodes image caption from phrase to sentence. It consists of a phrase decoder to decode the noun phrases of variable length, and an abbreviated sentence decoder to decode the abbreviated form of the image description. A complete image caption is formed by combining the generated phrases with sentence during the inference stage. Empirically, our proposed model shows a better or competitive result on the Flickr8k, Flickr30k and MS-COCO datasets in comparison to the state-of-the art models. We also show that our proposed model is able to generate more novel captions (not seen in the training data) which are richer in word contents in all these three datasets. © 2018 Elsevier B.V.

Keywords

Image captioning, Natural language processing, Long short-term memory, Deep learning

Divisions

fsktm

Funders

Postgraduate Research Grant ( PPP ) PG003-2016A,Frontier Research Grant FG002-17AFR, from University of Malaya

Publication Title

Neurocomputing

Volume

333

Publisher

Elsevier

This document is currently not available here.

Share

COinS