Agent Evaluations • artcurator

Overview

This article demonstrates formal evaluation workflows for comparing AI providers using artcurator’s unified interface. These patterns form the foundation for future automated agent-based evaluation systems.

What you’ll learn:

Systematic provider comparison protocols
Ensemble voting for confidence scoring
Quality metrics and benchmarking
Framework for AI-agent evaluators (future work)

Forward-looking: These workflows set up the infrastructure for formal model evaluation frameworks that will be implemented in future package iterations.

Workflow 1: Side-by-Side Provider Comparison

Run identical prompts across both providers to compare outputs:

img <- "artwork.jpeg"

# Gemini analysis
gem_styles <- art_style_ai(img, provider = "gemini", temp = 0)
gem_desc <- art_about_ai(img, provider = "gemini", temp = 0.7)

# OpenAI analysis
oai_styles <- art_style_ai(img, provider = "openai", temp = 0)
oai_desc <- art_about_ai(img, provider = "openai", temp = 0.7)

# Compare style tags
cat("=== GEMINI ===\n")

=== GEMINI ===

print(gem_styles[, .(tag, tag_norm)])

                                tag                         tag_norm
                             <char>                           <char>
1:       Cubist-Inspired Figuration       cubist-inspired figuration
2: Vibrant, Saturated Color Palette vibrant, saturated color palette
3:              Bold Black Outlines              bold black outlines
4:   Dense, Overlapping Composition   dense, overlapping composition
5:         Stylized, Prominent Eyes         stylized, prominent eyes
6:                 Textured Surface                 textured surface
7:   Scientific and Textual Collage   scientific and textual collage

cat("\n=== OPENAI ===\n")


=== OPENAI ===

print(oai_styles[, .(tag, tag_norm)])

                               tag                       tag_norm
                            <char>                         <char>
 1:        Fragmented cubist faces        fragmented cubist faces
 2:    Vivid high-contrast palette    vivid high-contrast palette
 3:       Bold black contour lines       bold black contour lines
 4:      Asymmetrical eye emphasis      asymmetrical eye emphasis
 5:     Text and chemical notation     text and chemical notation
 6:   Layered textures and impasto   layered textures and impasto
 7: Dynamic compositional crowding dynamic compositional crowding
 8:    Geometric-organic interplay    geometric-organic interplay
 9:          Symbolic color zoning          symbolic color zoning
10:  Surreal anatomical distortion  surreal anatomical distortion

# Compare descriptions
cat("\n=== GEMINI DESCRIPTION ===\n", gem_desc, "\n")


=== GEMINI DESCRIPTION ===
 This artwork is a vibrant, abstract composition featuring a collage of fragmented faces. Executed in a style influenced by Cubism, the piece uses bold black outlines to separate geometric planes and present multiple viewpoints at once. The palette is electric, dominated by saturated blues, reds, pinks, and oranges, creating a high-contrast and energetic visual field. Interspersed throughout the painting are scientific references to the element Zinc, including its chemical symbol (Zn), atomic number (30), atomic mass (65.38), and molecular diagrams. The overall mood is one of dynamic complexity, merging expressive human features with cool, factual scientific data.

cat("\n=== OPENAI DESCRIPTION ===\n", oai_desc, "\n")


=== OPENAI DESCRIPTION ===
 This artwork presents a dense cluster of overlapping faces, each fragmented into bold geometric shapes and outlined in black. The style is strongly influenced by Cubism and abstract expressionism, with distorted proportions, multiple viewpoints, and graphic linear details. A vivid palette of electric blues, reds, yellows, oranges, and greens dominates the composition, contrasted by smaller areas of white and black. Scientific notations and chemical structures, including references to zinc, are integrated into the imagery, adding a conceptual and experimental layer. The overall mood is energetic and slightly chaotic, suggesting heightened emotion, complexity of identity, and an analytical gaze turned inward.

Workflow 2: Ensemble Voting

Combine results from multiple providers to increase confidence:

# Merge unique tags from both providers
all_tags <- unique(c(gem_styles$tag, oai_styles$tag))
length(all_tags)

[1] 17

Track which provider found each tag:

tag_srcs <- data.table(
  tag = all_tags,
  gemini = all_tags %in% gem_styles$tag,
  openai = all_tags %in% oai_styles$tag
)[, votes := as.integer(gemini) + as.integer(openai)][order(-votes)]

tag_srcs[]

                                 tag gemini openai votes
                              <char> <lgcl> <lgcl> <int>
 1:       Cubist-Inspired Figuration   TRUE  FALSE     1
 2: Vibrant, Saturated Color Palette   TRUE  FALSE     1
 3:              Bold Black Outlines   TRUE  FALSE     1
 4:   Dense, Overlapping Composition   TRUE  FALSE     1
 5:         Stylized, Prominent Eyes   TRUE  FALSE     1
 6:                 Textured Surface   TRUE  FALSE     1
 7:   Scientific and Textual Collage   TRUE  FALSE     1
 8:          Fragmented cubist faces  FALSE   TRUE     1
 9:      Vivid high-contrast palette  FALSE   TRUE     1
10:         Bold black contour lines  FALSE   TRUE     1
11:        Asymmetrical eye emphasis  FALSE   TRUE     1
12:       Text and chemical notation  FALSE   TRUE     1
13:     Layered textures and impasto  FALSE   TRUE     1
14:   Dynamic compositional crowding  FALSE   TRUE     1
15:      Geometric-organic interplay  FALSE   TRUE     1
16:            Symbolic color zoning  FALSE   TRUE     1
17:    Surreal anatomical distortion  FALSE   TRUE     1

Classify each provider’s tags separately for comparison:

gem_class <- classify_styles_ai(gem_styles$tag, provider = "gemini")
oai_class <- classify_styles_ai(oai_styles$tag, provider = "openai")

# Compare category assignments across providers
cat("=== CATEGORY COMPARISON ===\n")

=== CATEGORY COMPARISON ===

cat("Gemini unique categories:", length(unique(gem_class$category)), "\n")

Gemini unique categories: 3

cat("OpenAI unique categories:", length(unique(oai_class$category)), "\n")

OpenAI unique categories: 6

# Show first few from each
cat("\nGemini sample:\n")


Gemini sample:

print(head(gem_class[, .(tag, category)], 3))

                                tag      category
                             <char>        <char>
1:       Cubist-Inspired Figuration        Cubism
2: Vibrant, Saturated Color Palette       Fauvism
3:              Bold Black Outlines Expressionism

cat("\nOpenAI sample:\n")


OpenAI sample:

print(head(oai_class[, .(tag, category)], 3))

                           tag      category
                        <char>        <char>
1:     Fragmented cubist faces        Cubism
2: Vivid high-contrast palette Expressionism
3:    Bold black contour lines Expressionism

Insight: Tags identified by both providers are higher confidence. Use for quality filtering.

Workflow 3: Benchmark Protocol

Systematic evaluation across providers with structured metrics:

run_benchmark <- function(img_path, providers = c("gemini", "openai")) {
  res <- list()

  for (prov in providers) {
    res[[prov]] <- list(
      styles = art_style_ai(img_path, provider = prov, temp = 0),
      description = art_about_ai(img_path, provider = prov, temp = 0.7),
      profile = art_profile_ai(
        "Test Art", "Test Artist", 10, 5000,
        img_path,
        provider = prov
      )
    )
  }

  res
}

benchmark_res <- run_benchmark("artwork.jpeg")

# Compare tag counts
cat("Gemini tags:", nrow(benchmark_res$gemini$styles), "\n")

Gemini tags: 7

cat("OpenAI tags:", nrow(benchmark_res$openai$styles), "\n")

OpenAI tags: 10

# Compare description lengths
cat("Gemini desc length:", nchar(benchmark_res$gemini$description), "\n")

Gemini desc length: 650

cat("OpenAI desc length:", nchar(benchmark_res$openai$description), "\n")

OpenAI desc length: 739

Workflow 4: Quality Metrics

Define and measure quality criteria:

evaluate_quality <- function(styles_dt, desc_text) {
  list(
    num_features = nrow(styles_dt),
    unique_tags = length(unique(styles_dt$tag)),
    has_descriptions = all(nchar(styles_dt$desc) > 10),
    desc_length = nchar(desc_text),
    has_tag_norm = "tag_norm" %in% names(styles_dt),
    no_duplicates = !any(duplicated(styles_dt$tag_norm))
  )
}

# Evaluate both providers
gem_quality <- evaluate_quality(
  benchmark_res$gemini$styles,
  benchmark_res$gemini$description
)

oai_quality <- evaluate_quality(
  benchmark_res$openai$styles,
  benchmark_res$openai$description
)

cat("=== QUALITY COMPARISON ===\n")

=== QUALITY COMPARISON ===

cat("Gemini features:", gem_quality$num_features, "\n")

Gemini features: 7

cat("OpenAI features:", oai_quality$num_features, "\n")

OpenAI features: 10

cat("Gemini desc length:", gem_quality$desc_length, "\n")

Gemini desc length: 650

cat("OpenAI desc length:", oai_quality$desc_length, "\n")

OpenAI desc length: 739

Framework: Agent-Based Evaluation (Future)

Forward-looking setup for automated evaluation systems.

Evaluation Protocol Structure

# Template for future agent-based evaluators
evaluation_protocol <- list(
  test_cases = list(
    list(
      id = "test-001",
      img_path = "artwork.jpeg",
      expected_tags = c("Realism", "Portrait"),
      providers = c("gemini", "openai")
    )
  ),
  metrics = c("accuracy", "consistency", "coverage", "quality"),
  evaluator = "future-ai-agent"
)

# Run protocol (demonstration)
run_eval_protocol <- function(protocol) {
  lapply(protocol$test_cases, function(test) {
    res <- list()
    for (prov in test$providers) {
      res[[prov]] <- art_style_ai(test$img_path, provider = prov, temp = 0)
    }

    list(
      test_id = test$id,
      results = res,
      evaluated = FALSE,
      scores = NA
    )
  })
}

protocol_res <- run_eval_protocol(evaluation_protocol)

Agent Evaluator Interface (Placeholder)

# Future implementation:
# evaluate_with_agent <- function(provider_results, ground_truth) {
#   # AI agent evaluates provider outputs against criteria
#   # Returns: accuracy_score, consistency_score, quality_score
# }

cat("Agent-based evaluation: To be implemented\n")

Agent-based evaluation: To be implemented

cat("Current: Manual comparison and voting\n")

Current: Manual comparison and voting

cat("Future: Automated agent scoring across all metrics\n")

Future: Automated agent scoring across all metrics

Evaluation Templates

Reusable functions for systematic evaluation:

# Compare providers on same prompt
compare_providers <- function(img_path, func, ...) {
  gem <- func(img_path, provider = "gemini", ...)
  oai <- func(img_path, provider = "openai", ...)
  list(gemini = gem, openai = oai)
}

# Example usage
comp <- compare_providers("artwork.jpeg", art_style_ai, temp = 0)
cat("Gemini:", nrow(comp$gemini), "tags\n")

Gemini: 7 tags

cat("OpenAI:", nrow(comp$openai), "tags\n")

OpenAI: 10 tags

Best Practices for Evaluation

Use Case	Provider	Model
Fast analysis	OpenAI	gpt-5.1-mini
Batch processing	Gemini	gemini-2.5-flash
Complex analysis	Gemini	gemini-2.5-pro

Task	Temperature
Style extraction	0
Tag classification	0
Descriptions	0.7

Key Takeaways

Systematic comparison - Use identical prompts and parameters across providers
Ensemble voting - Combine results for higher confidence
Quality metrics - Define objective measures (coverage, consistency, structure)
Evaluation framework - Build foundation for future agent-based evaluators

Next: Formal agent-based evaluation system will automate these workflows and provide objective scoring.