ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities

Chanjin Zheng^*^†
Shanghai Institute of Artificial Intelligence
for Education
East China Normal University, Shanghai, China
Faculty of Education
East China Normal University, Shanghai, China
chjzheng@dep.ecnu.edu.cn

Zengyi Yu^*
Faculty of Education
East China Normal University, Shanghai, China
College of Education
Zhejiang University of Technology, Hangzhou, China
202105720431@zjut.edu.cn

Yilin Jiang^*
College of Education
Zhejiang University of Technology, Hangzhou, China
zjut_jiangyilin@163.com

Mingzi Zhang
Faculty of Education
East China Normal University, Shanghai, China
College of Education
Zhejiang Normal University, Jinhua, China
windyday@zjnu.edu.cn

Xunuo Lu
School of Economy
Zhejiang University of Technology, Hangzhou, China
13968860822@163.com

Jing Jin
School of Education
Zhejiang Normal University, Jinhua, China
Tianchang Guanchao Primary School
Hangzhou, China
383230730@qq.com

Liteng Gao
School of Artificial Intelligence
Science and Technology
University of Shanghai for Science and Technology, Shanghai, China
2335060610@st.usst.edu.cn

Conference: CHI 2025
^* Indicates Equal Contribution ^† Corresponding Author

App Code Analysis Code Dataset Data Analysis

ArtMentor System Interface and Operation Process.

Abstract

Multimodal Large Language Models (MLLMs) face challenges in artwork evaluation, including subjective human assessments, limitations of result-oriented methods, and lack of modularity. In this paper, we propose that the design and analysis of HCI spaces, using process-oriented data, can more effectively evaluate MLLM capabilities and drive improvements. Applying this methodology, we introduce ArtMentor, a space that combines a dataset and three systems to enhance MLLM evaluations. ArtMentor documents 380 sessions with five art teachers, assessing artworks across nine critical dimensions. The modular system features entity recognition, review generation, and suggestion generation agents, enabling iterative upgrades. Process-based results analysis integrates machine learning and natural language processing to ensure reliable evaluations. Finally, we emphasize MLLM’s focus on details at the expense of the bigger picture and the superior performance of review generation compared to suggestion generation. We encourage further collaboration to cost-effectively enhance MLLM capabilities. Our contributions are available at https://artmentor.github.io.

A multi-agent data collection system from ArtMentor evaluates GPT-4o's art assessment capabilities across 380 sessions with five art teachers and three GPT-4o agents. Each session begins with artwork upload, followed by automatic entity recognition. Teachers refine these entities, select from nine evaluation dimensions, and revise the generated scores, reviews, and suggestions until accurate. The process ends with teachers submitting the finalized entities, scores, reviews, and suggestions.

ArtMentor Space consists of four key components: a. Multi-Agent Data Collection System, b. HCI Dataset, c. Data Analysis System, d. Iterative Upgrades System. The Multi-Agent System includes three agents: E-Agent for entity recognition, R-Agent for review generation, and S-Agent for suggestions. R-Agent and S-Agent perform nine roles, such as Realism and Deformation. Nine HCI processes (P1-P9) are marked by origin: green for computer, orange for human. After data collection, we generate an HCI dataset with five products, evaluated by four metrics. Iterative upgrades focus on improving underperforming roles.

This image displays an interface for analyzing artwork. It features a central panel showing an abstract painting of a horse with various tags. Icons of a robot (E-Agent) and a person in a safety helmet (Art Teacher) are visible. The interface includes buttons for uploading artworks and managing entities, alongside a JSON-style data display showing recognized elements in the artwork.

This image displays an interface for art evaluation and suggestion. The top of the interface shows action buttons for "Evaluate & Generate", "Modify", and "Submit". Below, there's a section labeled "Realism" with a score of 5 out of 10. The interface includes two main text areas: a "Review" section critiquing an artwork of a horse, and a "Suggestion" section offering improvements. Each text area has "Modify" and "Submit" buttons. On the right side, two robot icons are labeled "Review Agent" and "Suggestion Agent". The interface uses a light color scheme with rounded elements, suggesting a user-friendly design for collaborative art analysis and improvement.

This image displays GPT-4o's accuracy in recognizing art styles across 20 artworks. A 5x4 grid represents each artwork, with lighter circles indicating correct recognition and darker ones showing errors. Artworks 4-7, identified as ink wash paintings, are notably misrecognized, highlighting GPT-4o's specific weakness with this style. Despite this, the overall accuracy is 80%, as shown by the large circle below. This visualization effectively demonstrates GPT-4o's general competence in art style recognition while pinpointing its struggle with ink wash paintings.

This bar chart illustrates GPT-4o's entity recognition capabilities. Notably, the precision score stands out at 0.935, significantly higher than other metrics. Accuracy (0.833) and recall (0.836) are nearly equal, while the F1 score (0.881) falls between precision and the other two metrics. The high precision indicates GPT-4o's strong ability to avoid false positives in entity recognition, despite slightly lower overall accuracy and recall.

Entity classification metrics for GPT-4o across 20 artworks (Artwork Numbers 1-20).

Score Acceptance Metrics.

Text Acceptance Metrics for R-Agent.

Text Acceptance Metrics for S-Agent.

Assessment Criteria for Realistic Artwork

Prompt Generation Logic for Suggestion Agent

HCI File Structure Analysis

Entities Folder

This folder contains 20 JSON files, each representing the data of an entity.


{
  "original": ["Face", "Black hair", "Open mouth", "Green shirt", "Blue shorts", "Black shoes", "Monkey", "Cat", "Dog", "Bird", "Insect", "Exclamation mark", "Yellow platform", "Books"],
  "added": ["Yellow balances", "schoolbag"],
  "removed": ["Yellow platform"],
  "style": {
    "original": ["Style: Cartoon"],
    "added": [],
    "removed": []
  }
}

Field Explanations:

original: Elements recognized in the original image
added: New elements added by the user
removed: Elements removed by the user
style: Style-related information, including original style, added styles, and removed styles

score_Review Folder

This folder contains 180 files, each representing scores and reviews for a photo across 9 dimensions.


[
  {
    "round": 1,
    "data": {
      "scores": {
        "original": 0,
        "current": 0,
        "initGPTscore": null
      },
      "Reviews": {
        "original": "",
        "current": "",
        "added": "",
        "removed": ""
      }
    }
  },
  {
    "round": 2,
    "data": {
      "scores": {
        "original": 4,
        "current": 4,
        "initGPTscore": 4
      },
      "Reviews": {
        "original": "The artwork effectively uses contrasting colors to enhance visual interest...",
        "current": "The artwork effectively uses contrasting colors to enhance visual interest...",
        "added": "",
        "removed": ""
      }
    }
  }
]

Field Explanations:

round: Scoring round
scores: Contains original score, current score, and initial GPT score
Reviews: Contains original review, current review, added review, and removed review

suggestion Folder

This folder contains 180 files, each representing suggestions for a photo across 9 dimensions.


[
  {
    "round": 1,
    "data": {
      "suggestions": {
        "original": "",
        "current": "",
        "added": "",
        "removed": ""
      }
    }
  },
  {
    "round": 2,
    "data": {
      "suggestions": {
        "original": "To improve the color contrast in the artwork, consider using more vibrant and varied background colors...",
        "current": "To improve the color contrast in the artwork, consider using more vibrant and varied background colors...",
        "added": "",
        "removed": ""
      }
    }
  }
]

Field Explanations:

round: Suggestion round
suggestions: Contains original suggestion, current suggestion, added suggestion, and removed suggestion

BibTeX


@article{zheng2025artmentor,
  title={ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities},
  author={Zheng, Chanjin and Yu, Zengyi and Jiang, Yilin and Zhang, Mingzi and Lu, Xunuo and Jin, Jing and Gao, Liteng},
  journal={arXiv preprint arXiv:2502.13832},
  year={2025}
}