Results for Paid LLM APIs

The below captures the performance of 3 models from two commercial LLM APIs: OpenAI (GPT-3.5 Turbo, GPT-4, ...) and MistralAI (tiny, small, medium).

There are many other providers, but OpenAI is the most commonly used. MistralAI commercial API has launched recently and has a very good relationship with the Open-Source community, so we've added it as a challenger to compare OpenAI's cost effectiveness ("cost per point", ie, how many cents would you pay for 1pt in this benchmark)

Reminder: The below scores are on a scale 0-100, where 100 is the best possible score and 0 means the generated code was not even parseable.

# Imports
using JuliaLLMLeaderboard
using CairoMakie, AlgebraOfGraphics
using MarkdownTables, DataFramesMeta
using Statistics: mean, median, quantile, std;

# ! Configuration
SAVE_PLOTS = false
DIR_RESULTS = joinpath(pkgdir(JuliaLLMLeaderboard), "code_generation")
PAID_MODELS_DEFAULT = [
    "gpt-3.5-turbo",
    "gpt-3.5-turbo-1106",
    "gpt-3.5-turbo-0125",
    "gpt-4-1106-preview",
    "gpt-4-0125-preview",
    "gpt-4-turbo-2024-04-09",
    "gpt-4o-2024-05-13",
    "mistral-tiny",
    "mistral-small",
    "mistral-medium",
    "mistral-large",
    "mistral-small-2402",
    "mistral-medium-2312",
    "mistral-large-2402",
    "claude-3-opus-20240229",
    "claude-3-sonnet-20240229",
    "claude-3-haiku-20240307",
    "claude-2.1",
    "gemini-1.0-pro-latest",
    "deepseek-chat",
    "deepseek-coder"
];
PROMPTS = [
    "JuliaExpertCoTTask",
    "JuliaExpertAsk",
    "InJulia",
    "JuliaRecapTask",
    "JuliaRecapCoTTask"
];

Load Latest Results

Use only the 10 most recent evaluations available for each definition/model/prompt

df = @chain begin
    load_evals(DIR_RESULTS; max_history = 10)
    @rsubset :model in PAID_MODELS_DEFAULT && :prompt_label in PROMPTS
end;

Model Comparison

Highest average score by model:

fig = @chain df begin
    @by [:model] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
    end
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    @orderby -:score
    @aside local order_ = _.model
    data(_) *
    mapping(:model => sorter(order_) => "Model",
        :score => "Avg. Score (Max 100 pts)") *
    visual(BarPlot; bar_labels = :y,
        label_offset = 0, label_rotation = 1)
    draw(;
        axis = (limits = (nothing, nothing, 0, 100),
            xticklabelrotation = 45,
            title = "Paid APIs Performance"))
end
SAVE_PLOTS && save("assets/model-comparison-paid.png", fig)
fig

Table:

output = @chain df begin
    @by [:model] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_std_deviation = std(:score)
        :count_zero_score = count(iszero, :score)
        :count_full_score = count(==(100), :score)
    end
    transform(_,
        [:elapsed, :score, :score_std_deviation] .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    @orderby -:score
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

Model	Elapsed	Score	Score Std Deviation	Count Zero Score	Count Full Score	Cost Cents
claude-3-opus-20240229	20.3	83.2	19.6	2	329	3.9
claude-3-sonnet-20240229	8.7	78.8	26.2	22	308	0.73
gpt-4-turbo-2024-04-09	10.8	75.3	29.6	38	290	1.38
claude-3-haiku-20240307	4.0	74.9	27.2	9	261	0.05
gpt-4-0125-preview	30.3	74.4	30.3	39	284	1.29
gpt-4-1106-preview	22.4	74.4	29.9	19	142	1.21
gpt-4o-2024-05-13	4.3	72.9	29.1	29	257	0.0
deepseek-coder	13.0	71.6	32.6	39	115	0.01
mistral-large-2402	8.5	71.6	27.2	13	223	0.0
deepseek-chat	17.9	71.3	32.9	30	140	0.01
claude-2.1	10.1	67.9	30.8	47	229	0.8
gpt-3.5-turbo-0125	1.2	61.7	36.6	125	192	0.03
mistral-medium	18.1	60.8	33.2	22	90	0.41
mistral-small	5.9	60.1	30.2	27	76	0.09
mistral-small-2402	5.3	59.9	29.4	31	169	0.0
gpt-3.5-turbo-1106	2.1	58.4	39.2	82	97	0.04
mistral-tiny	4.6	46.9	32.0	75	42	0.02
gpt-3.5-turbo	3.6	42.3	38.2	132	54	0.04
gemini-1.0-pro-latest	4.2	34.8	27.4	181	25	0.0

While the victory of GPT-4 is not surprising, note that the our sample size is small and the standard deviation is quite high.

Overview by Prompt Template

Bar chart with all paid models and various prompt templates

fig = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @aside local average_ = @by(_, :model, :avg=mean(:score)) |>
                            x -> @orderby(x, -:avg).model
    data(_) *
    mapping(:model => sorter(average_) => "Model",
        :score => "Avg. Score (Max 100 pts)",
        color = :prompt_label => "Prompts",
        dodge = :prompt_label) * visual(BarPlot)
    draw(;
        figure = (; size = (900, 600)),
        axis = (xticklabelrotation = 45, title = "Comparison for Paid APIs"))
end
SAVE_PLOTS && save("assets/model-prompt-comparison-paid.png", fig)
fig

Table:

Surprised by the low performance of some models (eg, GPT 3.5 Turbo) on the CoT prompts? It's because the model accidentally sends a "stop" token before it writes the code.

output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
    end
    @aside average_ = @by _ :model :AverageScore=mean(:score) |> x -> round(x, digits = 1)
    unstack(:model, :prompt_label, :score; fill = 0.0)
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    leftjoin(average_, on = :model)
    @orderby -:AverageScore
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

model	InJulia	JuliaExpertAsk	JuliaExpertCoTTask	JuliaRecapCoTTask	JuliaRecapTask	AverageScore
claude-3-opus-20240229	84.1	84.0	85.1	81.6	81.2	83.2
claude-3-sonnet-20240229	80.9	79.0	80.3	75.6	78.2	78.8
gpt-4-turbo-2024-04-09	76.5	78.5	75.6	73.4	72.3	75.3
claude-3-haiku-20240307	75.1	75.0	64.1	79.1	81.3	74.9
gpt-4-0125-preview	72.7	77.5	72.4	75.0	74.6	74.4
gpt-4-1106-preview	74.9	79.1	71.8	72.4	73.6	74.4
gpt-4o-2024-05-13	71.2	75.7	80.6	67.9	69.2	72.9
deepseek-coder	81.1	69.9	56.8	71.9	78.1	71.6
mistral-large-2402	67.9	71.1	71.0	74.2	73.6	71.6
deepseek-chat	76.4	56.4	75.3	72.5	75.5	71.2
claude-2.1	64.3	65.4	72.2	69.2	68.4	67.9
gpt-3.5-turbo-0125	73.0	74.7	64.7	29.3	66.9	61.7
mistral-medium	63.1	60.5	63.4	55.9	61.2	60.8
mistral-small	67.3	61.4	59.9	56.1	55.9	60.1
mistral-small-2402	61.7	63.0	62.1	56.6	55.9	59.9
gpt-3.5-turbo-1106	74.6	73.6	73.4	15.4	55.0	58.4
mistral-tiny	51.7	44.3	41.1	50.5	47.2	47.0
gpt-3.5-turbo	73.1	60.9	32.8	26.2	18.4	42.3
gemini-1.0-pro-latest	36.0	38.6	35.2	30.8	33.3	34.8

Other Considerations

Comparison of Cost vs Average Score

fig = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    data(_) * mapping(:cost => (x -> x * 100) => "Avg. Cost (US Cents/query)",
        :score => "Avg. Score (Max 100 pts)",
        color = :model => "Model")
    draw(;
        axis = (xticklabelrotation = 45,
            title = "Cost vs Score for Paid APIs"))
end
SAVE_PLOTS && save("assets/cost-vs-score-scatter-paid.png", fig)
fig

Table:

Point per cent is the average score divided by the average cost in US cents

output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score_avg = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @rtransform :point_per_cent = :score_avg / :cost / 100
    @orderby -:point_per_cent
    #
    transform(_,
        names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

Model	Prompt Label	Elapsed	Score Avg	Score Median	Cnt	Point Per Cent	Cost Cents
gemini-1.0-pro-latest	InJulia	4.1	36.0	25.0	140.0	Inf	0.0
gemini-1.0-pro-latest	JuliaExpertAsk	3.9	38.6	50.0	140.0	Inf	0.0
gemini-1.0-pro-latest	JuliaExpertCoTTask	4.0	35.2	25.0	140.0	Inf	0.0
gemini-1.0-pro-latest	JuliaRecapCoTTask	4.8	30.8	25.0	140.0	Inf	0.0
gemini-1.0-pro-latest	JuliaRecapTask	4.3	33.3	25.0	140.0	Inf	0.0
gpt-4o-2024-05-13	InJulia	4.2	71.2	85.0	140.0	Inf	0.0
gpt-4o-2024-05-13	JuliaExpertAsk	1.6	75.7	86.7	140.0	Inf	0.0
gpt-4o-2024-05-13	JuliaExpertCoTTask	4.3	80.6	90.0	140.0	Inf	0.0
gpt-4o-2024-05-13	JuliaRecapCoTTask	5.5	67.9	64.6	140.0	Inf	0.0
gpt-4o-2024-05-13	JuliaRecapTask	5.8	69.2	73.8	140.0	Inf	0.0
mistral-large-2402	InJulia	7.5	67.9	62.5	140.0	Inf	0.0
mistral-large-2402	JuliaExpertAsk	5.3	71.1	80.0	140.0	Inf	0.0
mistral-large-2402	JuliaExpertCoTTask	8.6	71.0	80.0	140.0	Inf	0.0
mistral-large-2402	JuliaRecapCoTTask	10.8	74.2	83.3	140.0	Inf	0.0
mistral-large-2402	JuliaRecapTask	10.5	73.6	90.0	140.0	Inf	0.0
mistral-small-2402	InJulia	4.4	61.7	50.0	140.0	Inf	0.0
mistral-small-2402	JuliaExpertAsk	3.6	63.0	61.2	140.0	Inf	0.0
mistral-small-2402	JuliaExpertCoTTask	4.6	62.1	61.9	140.0	Inf	0.0
mistral-small-2402	JuliaRecapCoTTask	8.2	56.6	50.0	140.0	Inf	0.0
mistral-small-2402	JuliaRecapTask	5.8	55.9	50.0	140.0	Inf	0.0
deepseek-chat	InJulia	18.3	76.4	87.1	70.0	8337.9	0.01
deepseek-coder	InJulia	14.5	81.1	86.7	70.0	8030.0	0.01
deepseek-coder	JuliaExpertAsk	13.1	69.9	83.3	70.0	7075.8	0.01
deepseek-chat	JuliaExpertCoTTask	18.3	75.3	90.0	75.0	6902.3	0.01
deepseek-chat	JuliaExpertAsk	17.1	56.4	75.0	70.0	6179.2	0.01
deepseek-chat	JuliaRecapTask	16.9	75.5	88.8	70.0	5938.2	0.01
deepseek-coder	JuliaRecapTask	12.6	78.1	83.3	70.0	5708.3	0.01
deepseek-coder	JuliaRecapCoTTask	12.7	71.9	90.0	70.0	5282.5	0.01
deepseek-chat	JuliaRecapCoTTask	18.9	72.5	75.0	70.0	5271.4	0.01
deepseek-coder	JuliaExpertCoTTask	12.0	56.8	67.5	70.0	5182.2	0.01
mistral-tiny	JuliaExpertAsk	2.4	44.3	50.0	70.0	4333.3	0.01
gpt-3.5-turbo-0125	JuliaExpertAsk	0.9	74.7	80.0	140.0	4119.0	0.02
mistral-tiny	InJulia	3.8	51.7	50.0	68.0	2869.4	0.02
gpt-3.5-turbo-1106	JuliaExpertAsk	1.6	73.6	80.0	70.0	2747.9	0.03
gpt-3.5-turbo-0125	InJulia	1.6	73.0	80.0	140.0	2276.8	0.03
gpt-3.5-turbo	JuliaExpertAsk	3.1	60.9	60.0	70.0	2177.5	0.03
gpt-3.5-turbo-0125	JuliaExpertCoTTask	1.2	64.7	82.3	140.0	2168.8	0.03
claude-3-haiku-20240307	JuliaExpertAsk	2.8	75.0	80.0	140.0	2084.8	0.04
mistral-tiny	JuliaExpertCoTTask	6.6	41.1	50.0	70.0	2040.5	0.02
mistral-tiny	JuliaRecapCoTTask	4.9	50.5	50.0	70.0	1957.1	0.03
gpt-3.5-turbo-0125	JuliaRecapTask	1.2	66.9	75.0	140.0	1916.1	0.03
gpt-3.5-turbo-1106	JuliaExpertCoTTask	1.9	73.4	95.0	69.0	1873.1	0.04
mistral-tiny	JuliaRecapTask	5.1	47.2	50.0	70.0	1783.8	0.03
gpt-3.5-turbo-1106	InJulia	2.9	74.6	83.3	70.0	1672.1	0.04
gpt-3.5-turbo	InJulia	5.0	73.1	67.5	70.0	1633.3	0.04
claude-3-haiku-20240307	JuliaRecapCoTTask	4.2	79.1	90.0	140.0	1349.7	0.06
claude-3-haiku-20240307	InJulia	4.2	75.1	85.8	140.0	1338.2	0.06
claude-3-haiku-20240307	JuliaRecapTask	4.4	81.3	95.0	140.0	1296.3	0.06
claude-3-haiku-20240307	JuliaExpertCoTTask	4.2	64.1	62.5	140.0	1110.5	0.06
mistral-small	JuliaExpertAsk	3.7	61.4	52.5	70.0	1078.7	0.06
gpt-3.5-turbo-1106	JuliaRecapTask	1.9	55.0	62.5	69.0	1028.1	0.05
gpt-3.5-turbo	JuliaExpertCoTTask	3.1	32.8	0.0	70.0	1010.4	0.03
mistral-small	InJulia	5.3	67.3	60.0	70.0	890.8	0.08
gpt-3.5-turbo-0125	JuliaRecapCoTTask	1.2	29.3	0.0	140.0	850.8	0.03
mistral-small	JuliaExpertCoTTask	5.3	59.9	55.0	70.0	706.0	0.08
gpt-3.5-turbo	JuliaRecapCoTTask	3.6	26.2	0.0	70.0	585.4	0.04
mistral-small	JuliaRecapCoTTask	7.6	56.1	57.5	70.0	460.0	0.12
mistral-small	JuliaRecapTask	7.7	55.9	55.0	70.0	436.2	0.13
gpt-3.5-turbo	JuliaRecapTask	3.4	18.4	0.0	70.0	423.5	0.04
gpt-3.5-turbo-1106	JuliaRecapCoTTask	2.0	15.4	0.0	70.0	274.4	0.06
mistral-medium	JuliaExpertAsk	12.3	60.5	55.0	70.0	230.3	0.26
mistral-medium	InJulia	14.8	63.1	60.0	70.0	187.6	0.34
gpt-4-0125-preview	JuliaExpertAsk	10.8	77.5	86.7	140.0	157.7	0.49
claude-3-sonnet-20240229	JuliaExpertAsk	6.3	79.0	90.0	140.0	149.7	0.53
mistral-medium	JuliaExpertCoTTask	20.0	63.4	62.5	70.0	146.8	0.43
claude-3-sonnet-20240229	JuliaExpertCoTTask	7.2	80.3	95.0	140.0	129.2	0.62
gpt-4-1106-preview	JuliaExpertAsk	10.9	79.1	90.8	70.0	125.2	0.63
mistral-medium	JuliaRecapTask	20.2	61.2	65.0	70.0	116.0	0.53
mistral-medium	JuliaRecapCoTTask	23.3	55.9	50.0	70.0	110.9	0.5
claude-2.1	InJulia	9.3	64.3	60.0	140.0	98.6	0.65
claude-3-sonnet-20240229	JuliaRecapCoTTask	9.4	75.6	87.5	140.0	98.5	0.77
claude-3-sonnet-20240229	InJulia	10.0	80.9	95.0	140.0	95.8	0.84
claude-2.1	JuliaExpertAsk	9.6	65.4	71.2	140.0	93.8	0.7
gpt-4-turbo-2024-04-09	JuliaExpertAsk	7.0	78.5	86.7	140.0	93.5	0.84
claude-3-sonnet-20240229	JuliaRecapTask	10.6	78.2	90.0	140.0	86.1	0.91
claude-2.1	JuliaExpertCoTTask	10.6	72.2	75.0	140.0	82.8	0.87
claude-2.1	JuliaRecapCoTTask	10.6	69.2	75.0	140.0	78.3	0.88
claude-2.1	JuliaRecapTask	10.6	68.4	75.0	140.0	76.3	0.9
gpt-4-1106-preview	JuliaExpertCoTTask	21.7	71.8	92.5	70.0	63.9	1.12
gpt-4-0125-preview	JuliaExpertCoTTask	28.5	72.4	95.0	140.0	60.2	1.2
gpt-4-1106-preview	InJulia	27.4	74.9	86.7	70.0	57.9	1.29
gpt-4-turbo-2024-04-09	JuliaExpertCoTTask	10.5	75.6	95.0	140.0	56.5	1.34
gpt-4-0125-preview	InJulia	34.4	72.7	86.7	140.0	52.2	1.39
gpt-4-turbo-2024-04-09	InJulia	13.0	76.5	86.7	140.0	51.4	1.49
gpt-4-1106-preview	JuliaRecapCoTTask	25.0	72.4	85.6	70.0	48.9	1.48
gpt-4-1106-preview	JuliaRecapTask	26.9	73.6	77.5	70.0	47.9	1.54
gpt-4-turbo-2024-04-09	JuliaRecapCoTTask	11.8	73.4	88.8	140.0	45.6	1.61
gpt-4-0125-preview	JuliaRecapCoTTask	37.2	75.0	90.0	140.0	44.6	1.68
gpt-4-turbo-2024-04-09	JuliaRecapTask	11.5	72.3	90.0	140.0	44.5	1.63
gpt-4-0125-preview	JuliaRecapTask	40.8	74.6	90.0	140.0	43.7	1.71
claude-3-opus-20240229	JuliaExpertAsk	17.4	84.0	90.0	140.0	24.6	3.41
claude-3-opus-20240229	JuliaExpertCoTTask	17.6	85.1	100.0	140.0	24.3	3.5
claude-3-opus-20240229	JuliaRecapCoTTask	21.7	81.6	88.8	140.0	20.9	3.9
claude-3-opus-20240229	JuliaRecapTask	22.8	81.2	90.0	140.0	19.3	4.21
claude-3-opus-20240229	InJulia	22.1	84.1	100.0	140.0	18.9	4.46

Comparison of Time-to-generate vs Average Score

fig = @chain df begin
    @aside local xlims = quantile(df.elapsed_seconds, [0.01, 0.99])
    @by [:model, :prompt_label] begin
        :elapsed = mean(:elapsed_seconds)
        :elapsed_median = median(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    data(_) * mapping(:elapsed => "Avg. Elapsed Time (s)",
        :score => "Avg. Score (Max 100 pts)",
        color = :model => "Model")
    draw(; figure = (size = (600, 600),),
        axis = (xticklabelrotation = 45,
            title = "Elapsed Time vs Score for Paid APIs",
            limits = (xlims..., nothing, nothing)),
        palettes = (; color = Makie.ColorSchemes.tab20.colors))
end
SAVE_PLOTS && save("assets/elapsed-vs-score-scatter-paid.png", fig)
fig

Table:

Point per second is the average score divided by the average elapsed time

output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score_avg = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @rtransform :point_per_second = :score_avg / :elapsed
    @orderby -:point_per_second
    #
    transform(_,
        names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

Model	Prompt Label	Elapsed	Score Avg	Score Median	Cnt	Point Per Second	Cost Cents
gpt-3.5-turbo-0125	JuliaExpertAsk	0.9	74.7	80.0	140.0	80.0	0.02
gpt-3.5-turbo-0125	JuliaRecapTask	1.2	66.9	75.0	140.0	57.1	0.03
gpt-3.5-turbo-0125	JuliaExpertCoTTask	1.2	64.7	82.3	140.0	52.3	0.03
gpt-4o-2024-05-13	JuliaExpertAsk	1.6	75.7	86.7	140.0	47.8	0.0
gpt-3.5-turbo-0125	InJulia	1.6	73.0	80.0	140.0	46.9	0.03
gpt-3.5-turbo-1106	JuliaExpertAsk	1.6	73.6	80.0	70.0	45.5	0.03
gpt-3.5-turbo-1106	JuliaExpertCoTTask	1.9	73.4	95.0	69.0	38.9	0.04
gpt-3.5-turbo-1106	JuliaRecapTask	1.9	55.0	62.5	69.0	29.2	0.05
claude-3-haiku-20240307	JuliaExpertAsk	2.8	75.0	80.0	140.0	26.4	0.04
gpt-3.5-turbo-1106	InJulia	2.9	74.6	83.3	70.0	25.8	0.04
gpt-3.5-turbo-0125	JuliaRecapCoTTask	1.2	29.3	0.0	140.0	25.4	0.03
gpt-3.5-turbo	JuliaExpertAsk	3.1	60.9	60.0	70.0	19.6	0.03
claude-3-haiku-20240307	JuliaRecapCoTTask	4.2	79.1	90.0	140.0	19.0	0.06
gpt-4o-2024-05-13	JuliaExpertCoTTask	4.3	80.6	90.0	140.0	18.7	0.0
mistral-tiny	JuliaExpertAsk	2.4	44.3	50.0	70.0	18.7	0.01
claude-3-haiku-20240307	JuliaRecapTask	4.4	81.3	95.0	140.0	18.4	0.06
claude-3-haiku-20240307	InJulia	4.2	75.1	85.8	140.0	17.8	0.06
mistral-small-2402	JuliaExpertAsk	3.6	63.0	61.2	140.0	17.6	0.0
gpt-4o-2024-05-13	InJulia	4.2	71.2	85.0	140.0	16.9	0.0
mistral-small	JuliaExpertAsk	3.7	61.4	52.5	70.0	16.5	0.06
claude-3-haiku-20240307	JuliaExpertCoTTask	4.2	64.1	62.5	140.0	15.4	0.06
gpt-3.5-turbo	InJulia	5.0	73.1	67.5	70.0	14.5	0.04
mistral-small-2402	InJulia	4.4	61.7	50.0	140.0	14.0	0.0
mistral-tiny	InJulia	3.8	51.7	50.0	68.0	13.6	0.02
mistral-large-2402	JuliaExpertAsk	5.3	71.1	80.0	140.0	13.5	0.0
mistral-small-2402	JuliaExpertCoTTask	4.6	62.1	61.9	140.0	13.4	0.0
mistral-small	InJulia	5.3	67.3	60.0	70.0	12.7	0.08
claude-3-sonnet-20240229	JuliaExpertAsk	6.3	79.0	90.0	140.0	12.5	0.53
gpt-4o-2024-05-13	JuliaRecapCoTTask	5.5	67.9	64.6	140.0	12.4	0.0
gpt-4o-2024-05-13	JuliaRecapTask	5.8	69.2	73.8	140.0	11.9	0.0
mistral-small	JuliaExpertCoTTask	5.3	59.9	55.0	70.0	11.4	0.08
gpt-4-turbo-2024-04-09	JuliaExpertAsk	7.0	78.5	86.7	140.0	11.2	0.84
claude-3-sonnet-20240229	JuliaExpertCoTTask	7.2	80.3	95.0	140.0	11.2	0.62
gpt-3.5-turbo	JuliaExpertCoTTask	3.1	32.8	0.0	70.0	10.5	0.03
mistral-tiny	JuliaRecapCoTTask	4.9	50.5	50.0	70.0	10.3	0.03
gemini-1.0-pro-latest	JuliaExpertAsk	3.9	38.6	50.0	140.0	10.0	0.0
mistral-small-2402	JuliaRecapTask	5.8	55.9	50.0	140.0	9.6	0.0
mistral-tiny	JuliaRecapTask	5.1	47.2	50.0	70.0	9.3	0.03
mistral-large-2402	InJulia	7.5	67.9	62.5	140.0	9.1	0.0
gemini-1.0-pro-latest	JuliaExpertCoTTask	4.0	35.2	25.0	140.0	8.8	0.0
gemini-1.0-pro-latest	InJulia	4.1	36.0	25.0	140.0	8.7	0.0
mistral-large-2402	JuliaExpertCoTTask	8.6	71.0	80.0	140.0	8.2	0.0
claude-3-sonnet-20240229	InJulia	10.0	80.9	95.0	140.0	8.1	0.84
claude-3-sonnet-20240229	JuliaRecapCoTTask	9.4	75.6	87.5	140.0	8.0	0.77
gemini-1.0-pro-latest	JuliaRecapTask	4.3	33.3	25.0	140.0	7.6	0.0
gpt-3.5-turbo-1106	JuliaRecapCoTTask	2.0	15.4	0.0	70.0	7.6	0.06
claude-3-sonnet-20240229	JuliaRecapTask	10.6	78.2	90.0	140.0	7.4	0.91
mistral-small	JuliaRecapCoTTask	7.6	56.1	57.5	70.0	7.4	0.12
gpt-3.5-turbo	JuliaRecapCoTTask	3.6	26.2	0.0	70.0	7.4	0.04
gpt-4-1106-preview	JuliaExpertAsk	10.9	79.1	90.8	70.0	7.2	0.63
mistral-small	JuliaRecapTask	7.7	55.9	55.0	70.0	7.2	0.13
gpt-4-0125-preview	JuliaExpertAsk	10.8	77.5	86.7	140.0	7.2	0.49
gpt-4-turbo-2024-04-09	JuliaExpertCoTTask	10.5	75.6	95.0	140.0	7.2	1.34
mistral-large-2402	JuliaRecapTask	10.5	73.6	90.0	140.0	7.0	0.0
claude-2.1	InJulia	9.3	64.3	60.0	140.0	6.9	0.65
mistral-small-2402	JuliaRecapCoTTask	8.2	56.6	50.0	140.0	6.9	0.0
mistral-large-2402	JuliaRecapCoTTask	10.8	74.2	83.3	140.0	6.9	0.0
claude-2.1	JuliaExpertAsk	9.6	65.4	71.2	140.0	6.8	0.7
claude-2.1	JuliaExpertCoTTask	10.6	72.2	75.0	140.0	6.8	0.87
claude-2.1	JuliaRecapCoTTask	10.6	69.2	75.0	140.0	6.6	0.88
claude-2.1	JuliaRecapTask	10.6	68.4	75.0	140.0	6.4	0.9
gemini-1.0-pro-latest	JuliaRecapCoTTask	4.8	30.8	25.0	140.0	6.4	0.0
gpt-4-turbo-2024-04-09	JuliaRecapTask	11.5	72.3	90.0	140.0	6.3	1.63
gpt-4-turbo-2024-04-09	JuliaRecapCoTTask	11.8	73.4	88.8	140.0	6.2	1.61
mistral-tiny	JuliaExpertCoTTask	6.6	41.1	50.0	70.0	6.2	0.02
deepseek-coder	JuliaRecapTask	12.6	78.1	83.3	70.0	6.2	0.01
gpt-4-turbo-2024-04-09	InJulia	13.0	76.5	86.7	140.0	5.9	1.49
deepseek-coder	JuliaRecapCoTTask	12.7	71.9	90.0	70.0	5.6	0.01
deepseek-coder	InJulia	14.5	81.1	86.7	70.0	5.6	0.01
gpt-3.5-turbo	JuliaRecapTask	3.4	18.4	0.0	70.0	5.4	0.04
deepseek-coder	JuliaExpertAsk	13.1	69.9	83.3	70.0	5.3	0.01
mistral-medium	JuliaExpertAsk	12.3	60.5	55.0	70.0	4.9	0.26
claude-3-opus-20240229	JuliaExpertCoTTask	17.6	85.1	100.0	140.0	4.8	3.5
claude-3-opus-20240229	JuliaExpertAsk	17.4	84.0	90.0	140.0	4.8	3.41
deepseek-coder	JuliaExpertCoTTask	12.0	56.8	67.5	70.0	4.7	0.01
deepseek-chat	JuliaRecapTask	16.9	75.5	88.8	70.0	4.5	0.01
mistral-medium	InJulia	14.8	63.1	60.0	70.0	4.3	0.34
deepseek-chat	InJulia	18.3	76.4	87.1	70.0	4.2	0.01
deepseek-chat	JuliaExpertCoTTask	18.3	75.3	90.0	75.0	4.1	0.01
deepseek-chat	JuliaRecapCoTTask	18.9	72.5	75.0	70.0	3.8	0.01
claude-3-opus-20240229	InJulia	22.1	84.1	100.0	140.0	3.8	4.46
claude-3-opus-20240229	JuliaRecapCoTTask	21.7	81.6	88.8	140.0	3.8	3.9
claude-3-opus-20240229	JuliaRecapTask	22.8	81.2	90.0	140.0	3.6	4.21
gpt-4-1106-preview	JuliaExpertCoTTask	21.7	71.8	92.5	70.0	3.3	1.12
deepseek-chat	JuliaExpertAsk	17.1	56.4	75.0	70.0	3.3	0.01
mistral-medium	JuliaExpertCoTTask	20.0	63.4	62.5	70.0	3.2	0.43
mistral-medium	JuliaRecapTask	20.2	61.2	65.0	70.0	3.0	0.53
gpt-4-1106-preview	JuliaRecapCoTTask	25.0	72.4	85.6	70.0	2.9	1.48
gpt-4-1106-preview	JuliaRecapTask	26.9	73.6	77.5	70.0	2.7	1.54
gpt-4-1106-preview	InJulia	27.4	74.9	86.7	70.0	2.7	1.29
gpt-4-0125-preview	JuliaExpertCoTTask	28.5	72.4	95.0	140.0	2.5	1.2
mistral-medium	JuliaRecapCoTTask	23.3	55.9	50.0	70.0	2.4	0.5
gpt-4-0125-preview	InJulia	34.4	72.7	86.7	140.0	2.1	1.39
gpt-4-0125-preview	JuliaRecapCoTTask	37.2	75.0	90.0	140.0	2.0	1.68
gpt-4-0125-preview	JuliaRecapTask	40.8	74.6	90.0	140.0	1.8	1.71

Test Case Performance

Performance of different models across each test case

output = @chain df begin
    @by [:model, :name] begin
        :score = mean(:score)
    end
    #
    @aside average_ = @by _ :name :AverageScore=mean(:score) |> x -> round(x, digits = 1)
    unstack(:name, :model, :score; fill = 0.0)
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    leftjoin(average_, on = :name)
    @orderby -:AverageScore
end
markdown_table(output)

name	claude-2.1	claude-3-haiku-20240307	claude-3-opus-20240229	claude-3-sonnet-20240229	deepseek-chat	deepseek-coder	gemini-1.0-pro-latest	gpt-3.5-turbo	gpt-3.5-turbo-0125	gpt-3.5-turbo-1106	gpt-4-0125-preview	gpt-4-1106-preview	gpt-4-turbo-2024-04-09	gpt-4o-2024-05-13	mistral-large-2402	mistral-medium	mistral-small	mistral-small-2402	mistral-tiny	AverageScore
FloatWithUnits	62.0	98.0	100.0	100.0	100.0	100.0	57.0	76.0	91.5	80.0	60.5	72.0	78.5	93.5	99.5	98.0	70.0	100.0	80.2	85.1
timezone_bumper	82.1	98.1	99.7	95.5	100.0	100.0	39.9	48.0	77.4	79.2	90.0	90.0	94.8	95.0	96.4	97.0	76.6	78.1	62.0	84.2
clean_column	100.0	89.8	100.0	96.4	78.4	71.2	41.5	35.5	66.7	69.8	88.8	90.5	90.0	89.3	91.6	81.0	84.6	99.7	80.8	81.3
keeponlynames	90.1	65.0	85.3	94.9	88.4	74.4	54.0	50.8	80.6	74.2	90.9	91.0	86.2	77.5	98.7	66.2	76.6	67.9	51.0	77.0
wrap_string	93.8	77.2	64.5	70.2	81.7	82.5	32.6	64.0	50.1	55.3	94.9	97.8	94.6	97.0	71.9	84.7	68.0	68.6	48.3	73.6
countmodelrows	58.0	82.6	98.8	94.8	67.2	60.7	36.6	52.8	75.7	56.2	97.4	98.4	89.3	89.0	78.6	79.0	67.2	61.7	53.2	73.5
weatherdataanalyzer	74.1	93.3	86.8	86.8	93.0	83.8	26.5	35.2	64.2	59.0	85.4	85.0	81.0	67.4	86.0	85.4	55.4	52.6	56.8	71.5
add_yearmonth	53.8	86.2	92.0	81.0	71.2	62.5	35.8	33.0	67.6	65.2	78.6	72.8	75.9	68.0	72.2	48.0	62.2	40.2	33.2	63.1
event_scheduler	86.5	76.6	90.2	77.2	76.0	82.4	37.8	29.0	44.4	42.8	87.9	66.6	82.5	73.8	57.3	36.0	59.0	38.7	37.2	62.2
ispersonal	52.0	69.0	54.0	72.0	61.0	84.0	16.0	43.0	72.0	68.6	54.3	56.0	66.5	62.0	67.2	35.0	48.0	48.0	29.5	55.7
audi_filter	38.0	56.0	93.0	63.8	47.0	57.8	28.1	27.0	55.0	58.0	47.5	58.0	49.0	56.2	58.0	43.0	48.5	44.8	27.0	50.3
extractjuliacode	56.4	60.4	65.4	48.2	41.3	48.6	36.4	41.0	43.6	48.4	54.5	48.7	56.1	52.5	44.1	31.8	52.2	50.4	30.1	47.9
qanda_extractor	73.5	62.3	68.0	65.5	43.3	26.7	26.2	31.7	35.5	36.7	56.7	53.3	49.3	45.3	46.8	38.7	44.7	55.8	36.0	47.2
pig_latinify	30.6	34.6	67.1	57.0	49.0	67.1	18.7	24.7	39.8	23.1	54.7	61.4	60.1	54.2	33.6	27.8	28.8	31.6	33.1	42.0

This page was generated using Literate.jl.