In recent years, image caption generation, as a multimodal task in the field of artificial intelligence, integrates the related research of computer vision and natural language processing, and can realize the modal conversion from image to text.It plays an important role in visual assistance Crank Left and image understanding, and has attracted extensive attention from researchers.Firstly, this paper describes the task of image caption generation, and introduces three image caption generation methods: template-based method, retrieval-based method and encode-decode method.
Their respective method ideas, representative research and advantages and disadvantages are also introduced.Secondly, from the model structure, the research progress of image Art understanding phase and caption generation phase, this paper expounds in detail the method based on encoding-decoding, and summarizes the research over years into the research of image understanding and caption generation.Image understanding research includes attention mechanism and semantic aspects.
The research of caption generation is divided into traditional caption generation, dense caption generation and stylish caption generation.The performance, advantages and disadvantages of the model are summarized, and the datasets and evaluation index of the performance evaluation of the image captioning model are introduced.Finally, the challenges and difficulties in the field of image captioning are pointed out.