Abstract: Transformer-based models have reshaped image captioning but grapple with issues like caption accuracy, particularly for complex visuals. Addressing these shortcomings is essential. Motivated ...