In this project we construct the CLEVR-Ref+ dataset, by repurposing and augmenting the CLEVR dataset from VQA towards referring expressions. This synthetic diagnostic dataset will complement existing real-world ones:
There are two main reasons we did this:
One is to test and diagnose state-of-the-art referring expression models at a more detailed level than the detection accuracy or segmentation IoU, including our proposed IEP-Ref model that explicitly captures compositionality.
The other is to perform step-by-step inspection of the model’s reasoning process. We are especially interested in whether the neural modules in Neural Module Networks indeed perform the intended functionalities when trained end-to-end. According to our experiments, the answer is quite positive: