Spaces:

mli-will
/

rbeval

Sleeping

App Files Files Community

William Arnold commited on Aug 21, 2024

Commit

d056c4a

1 Parent(s): 34898cb

Add AUC, ROC, etc

Browse files

Files changed (7) hide show

README.md +4 -3
requirements.txt +2 -0
src/rbeval/dash.py +18 -26
src/rbeval/plot/data.py +2 -2
src/rbeval/plot/model_comp.py +7 -7
src/rbeval/plot/score_cdf.py +132 -42
src/rbeval/plot/utils.py +19 -13

README.md CHANGED Viewed

@@ -17,14 +17,14 @@ This dashboard is best viewed at [the huggingface space](https://huggingface.co/
 LLM MCQA (multiple choice question-answering) benchmarks are measured in the following way:
 1. Some number of few shot examples are pulled from the validation set of the MCQA benchmark and formatted as
-    > **Quesiton**: What is the capital of France? \
     > (A) Paris \
     > (B) London \
     > (C) Berlin \
     > (D) Madrid \
     > **Answer**: A
 2. The target question is then appended, without the answer, and fed into the model as
-    > **Quesiton**: What is the capital of France? \
     > (A) Paris \
     > (B) London \
     > (C) Berlin \
@@ -62,6 +62,7 @@ Here, $\Delta$ is a measure of how much more confident the model is in the corre
 An ideal model would have $\Phi = 1$ (and therefore $\Delta=1$) always, while a model that performs random guessing would have $p_i = \Phi = 0.25$ (and therefore $\Delta=0$) always.
 ### Reading $\Phi$ plots
 Let's look at an example: MMLU on Llama-7b and Guanaco-7b, an early example of instruction tuning, in the 5-shot setting.
@@ -81,7 +82,7 @@ Again, the <span style="color:lightblue">**blue line is Llama-7b**</span> and th
 * The 'accuracy' as we defined earlier is the percentage of samples with $\Delta > 0$. We can see this as the intersection of the curves with the vertical line at $\Delta = 0$. We can see that while instruction tuning doesn't seem to have changed the accuracy significantly, it has *vastly* altered the distribution of $\Delta$ values.
 * Guanaco-7b has a higher percentage of samples with large $\Delta$ values than Llama-7b. For example, in ~12-13% of the samples, Guanaco-7b predicts the correct answer with a probability at least 0.2 greater than the most confident incorrect answer.
 * Guanaco-7b also has a higher percentage of samples with very low $\Delta$ values. For example, we can read that ~75% of the samples have $\Delta > -0.2$, meaning that ~25% have $\Delta \leq -0.2$. This means that Guanaco-7b predicts the wrong answer with a probability at least 0.2 greater than the correct answer in 25% of the samples, when compared to Llama-7b which only does that in ~6-7% of the samples.
 ## How to use this notebook

 LLM MCQA (multiple choice question-answering) benchmarks are measured in the following way:
 1. Some number of few shot examples are pulled from the validation set of the MCQA benchmark and formatted as
+    > **Question**: What is the capital of France? \
     > (A) Paris \
     > (B) London \
     > (C) Berlin \
     > (D) Madrid \
     > **Answer**: A
 2. The target question is then appended, without the answer, and fed into the model as
+    > **Question**: What is the capital of France? \
     > (A) Paris \
     > (B) London \
     > (C) Berlin \
 An ideal model would have $\Phi = 1$ (and therefore $\Delta=1$) always, while a model that performs random guessing would have $p_i = \Phi = 0.25$ (and therefore $\Delta=0$) always.
+<!---
 ### Reading $\Phi$ plots
 Let's look at an example: MMLU on Llama-7b and Guanaco-7b, an early example of instruction tuning, in the 5-shot setting.
 * The 'accuracy' as we defined earlier is the percentage of samples with $\Delta > 0$. We can see this as the intersection of the curves with the vertical line at $\Delta = 0$. We can see that while instruction tuning doesn't seem to have changed the accuracy significantly, it has *vastly* altered the distribution of $\Delta$ values.
 * Guanaco-7b has a higher percentage of samples with large $\Delta$ values than Llama-7b. For example, in ~12-13% of the samples, Guanaco-7b predicts the correct answer with a probability at least 0.2 greater than the most confident incorrect answer.
 * Guanaco-7b also has a higher percentage of samples with very low $\Delta$ values. For example, we can read that ~75% of the samples have $\Delta > -0.2$, meaning that ~25% have $\Delta \leq -0.2$. This means that Guanaco-7b predicts the wrong answer with a probability at least 0.2 greater than the correct answer in 25% of the samples, when compared to Llama-7b which only does that in ~6-7% of the samples.
+-->
 ## How to use this notebook

requirements.txt CHANGED Viewed

@@ -5,3 +5,5 @@ tqdm>=4.66.4
 numpy>=1.26.4
 dacite>=1.8.1
 seaborn>=0.13.1

 numpy>=1.26.4
 dacite>=1.8.1
 seaborn>=0.13.1
+polars>=1.5.0
+scikit-learn>=1.5.1

src/rbeval/dash.py CHANGED Viewed

@@ -1,6 +1,7 @@
 from dataclasses import asdict
 from pathlib import Path
 from typing import List, Optional
 import streamlit as st
 import argparse
 from dacite import from_dict
@@ -9,7 +10,6 @@ from rbeval.plot.dash_utils import markdown_insert_images
 from rbeval.plot.data import EvalGroup, get_samples
 from rbeval.plot.score_cdf import (
     CdfPlotConfig,
-    PlotData,
     plot_with_data,
     get_plot_data,
     plot_cfgs,
@@ -29,7 +29,7 @@ def cached_samples(dir: Path, name_filter: Optional[str]) -> List[EvalGroup]:
 @st.cache_data
 def cached_score_cdf(
     dir: Path, name_filter: Optional[str]
-) -> tuple[List[PlotData], List[CdfPlotConfig]]:
     samples = cached_samples(dir, name_filter)
     cfgs = plot_cfgs()
     data = [get_plot_data(cfg, samples) for cfg in cfgs]
@@ -48,20 +48,6 @@ def cache_compare(
     return grouped_dict, base_name, comp_name
-def filter_for_group(data: List[PlotData], group: str) -> List[PlotData]:
-    return [
-        PlotData(
-            renorm=[df for df in d.renorm if df["group"].iloc[0] == group],
-            norenorm=[df for df in d.norenorm if df["group"].iloc[0] == group],
-        )
-        for d in data
-    ]
-def get_group_names(data: List[PlotData]) -> List[str]:
-    return sorted(set([df["group"].iloc[0] for d in data for df in d.renorm]))
 def main():
     parser = argparse.ArgumentParser(description="rbeval dashboard")
     parser.add_argument("--evals", type=str, default="./lmo-fake", required=False)
@@ -77,26 +63,32 @@ def main():
             st.markdown(markdown_insert_images(markdown), unsafe_allow_html=True)
     score_cdf_data, cfgs = cached_score_cdf(eval_dir, None)
-    group_names = sorted([g.name for g in cached_samples(eval_dir, None)])
     st.markdown("""
-    Below is a toggle which renormalizes multiple choice answer probabilities to sum to 1.
     For more performant models (anything after Llama 1) or in higher fewshot scenarios, this doesn't impact the results very much.
     """)
     renormed = st.toggle("Renormalize Probabilities", True)
     st.subheader("Model Performance Curves")
     for group in group_names:
-        group_data = filter_for_group(score_cdf_data, group)
         with st.expander(group):
-            figs = [
-                fig
-                for data, cdf in zip(group_data, cfgs)
-                for fig in plot_with_data(cdf, data, renormed)
-            ]
-            for fig in figs:
-                st.altair_chart(fig.chart, use_container_width=True)  # type: ignore
     model_names = set(
         [

 from dataclasses import asdict
 from pathlib import Path
 from typing import List, Optional
+import pandas as pd
 import streamlit as st
 import argparse
 from dacite import from_dict
 from rbeval.plot.data import EvalGroup, get_samples
 from rbeval.plot.score_cdf import (
     CdfPlotConfig,
     plot_with_data,
     get_plot_data,
     plot_cfgs,
 @st.cache_data
 def cached_score_cdf(
     dir: Path, name_filter: Optional[str]
+) -> tuple[List[pd.DataFrame], List[CdfPlotConfig]]:
     samples = cached_samples(dir, name_filter)
     cfgs = plot_cfgs()
     data = [get_plot_data(cfg, samples) for cfg in cfgs]
     return grouped_dict, base_name, comp_name
 def main():
     parser = argparse.ArgumentParser(description="rbeval dashboard")
     parser.add_argument("--evals", type=str, default="./lmo-fake", required=False)
             st.markdown(markdown_insert_images(markdown), unsafe_allow_html=True)
     score_cdf_data, cfgs = cached_score_cdf(eval_dir, None)
+    assert len(score_cdf_data) > 0, "No score cdfs found"
+    group_names: List[str] = sorted(
+        score_cdf_data[0]["group"].unique().tolist(), reverse=True
+    )
     st.markdown("""
+    Below is a toggle which renormalizes the multiple choice answer probabilities to sum to 1.
     For more performant models (anything after Llama 1) or in higher fewshot scenarios, this doesn't impact the results very much.
     """)
     renormed = st.toggle("Renormalize Probabilities", True)
+    fs_names = [str(i) + "-shot" for i in range(0, 5 + 1)]
+    fs_filt_sel = st.multiselect("Fewshot Filter", fs_names, default=fs_names)
+    fs_filt = [int(i.split("-")[0]) for i in fs_filt_sel]
     st.subheader("Model Performance Curves")
     for group in group_names:
         with st.expander(group):
+            for cfg, df in zip(cfgs, score_cdf_data):
+                group_data = df[
+                    (df["group"] == group)
+                    & (df["renorm"] == renormed)
+                    & (df["fewshot"].isin(fs_filt))
+                ]
+                for fig in plot_with_data(cfg, group_data):
+                    st.altair_chart(fig.chart, use_container_width=True)  # type: ignore
     model_names = set(
         [

src/rbeval/plot/data.py CHANGED Viewed

@@ -59,8 +59,8 @@ def get_samples(inp: Path, name_filter: Optional[str]) -> List["EvalGroup"]:
                     inc_logprobs.append(probs)
                 eval = Eval(
                     name=samples_file.stem,
-                    cor_logprobs=np.array(cor_logprobs),
-                    inc_logprobs=np.array(inc_logprobs),
                 )
                 model_eval.evals.append(eval)
             np.save(str(model_eval_cache_file), asdict(model_eval))  # type: ignore

                     inc_logprobs.append(probs)
                 eval = Eval(
                     name=samples_file.stem,
+                    cor_logprobs=np.array(cor_logprobs, dtype=np.float64),
+                    inc_logprobs=np.array(inc_logprobs, dtype=np.float64),
                 )
                 model_eval.evals.append(eval)
             np.save(str(model_eval_cache_file), asdict(model_eval))  # type: ignore

src/rbeval/plot/model_comp.py CHANGED Viewed

@@ -11,7 +11,7 @@ import numpy as np
 from rbeval.eval_spec import EvalSpec
 from rbeval.plot.data import EvalGroup, Figure, ModelEval
-from rbeval.plot.utils import CdfData, renormed
 from typing import Any
@@ -135,23 +135,23 @@ def plot_diff_cdf(grouped: Dict[str, List[Scores]]) -> alt.HConcatChart:
         diff_cdf_data: List[pd.DataFrame] = []
         corr_cdf_data: List[pd.DataFrame] = []
         for score in score_list:
-            diff_cdf = CdfData.from_samples(score.cor_minus_inc_samples)
             diff_cdf_data.append(
                 pd.DataFrame(
                     {
-                        "p": diff_cdf.scores,
-                        "1-CDF(p)": diff_cdf.cdf_p,
                         "fewshot": score.spec.fewshot,
                         "model": score.spec.model_name,
                     }
                 )
             )
-            corr_cdf = CdfData.from_samples(score.cor_samples)
             corr_cdf_data.append(
                 pd.DataFrame(
                     {
-                        "p": corr_cdf.scores,
-                        "1-CDF(p)": corr_cdf.cdf_p,
                         "fewshot": score.spec.fewshot,
                         "model": score.spec.model_name,
                     }

 from rbeval.eval_spec import EvalSpec
 from rbeval.plot.data import EvalGroup, Figure, ModelEval
+from rbeval.plot.utils import PlotData, renormed
 from typing import Any
         diff_cdf_data: List[pd.DataFrame] = []
         corr_cdf_data: List[pd.DataFrame] = []
         for score in score_list:
+            diff_cdf = PlotData.perf_curve_from_samples(score.cor_minus_inc_samples)
             diff_cdf_data.append(
                 pd.DataFrame(
                     {
+                        "p": diff_cdf.x,
+                        "1-CDF(p)": diff_cdf.y,
                         "fewshot": score.spec.fewshot,
                         "model": score.spec.model_name,
                     }
                 )
             )
+            corr_cdf = PlotData.perf_curve_from_samples(score.cor_samples)
             corr_cdf_data.append(
                 pd.DataFrame(
                     {
+                        "p": corr_cdf.x,
+                        "1-CDF(p)": corr_cdf.y,
                         "fewshot": score.spec.fewshot,
                         "model": score.spec.model_name,
                     }

src/rbeval/plot/score_cdf.py CHANGED Viewed

@@ -1,5 +1,4 @@
-from dataclasses import dataclass, field
-from typing import List, Optional
 from numpy._typing import NDArray
 from rbeval.plot.data import Eval, EvalGroup, Figure
@@ -7,83 +6,81 @@ from abc import ABC, abstractmethod
 import numpy as np
 import altair as alt
 import pandas as pd
-from rbeval.plot.utils import CdfData, renormed
-@dataclass
-class PlotData:
-    renorm: List[pd.DataFrame] = field(default_factory=list)
-    norenorm: List[pd.DataFrame] = field(default_factory=list)
 def plot_cfgs():
-    return [CorrectProbCdfPlot(), CorrIncorrDiffConfig()]
 def score_cdf(samples: List[EvalGroup], args: List[str]) -> List[Figure]:
     return [
         a
         for cfg in plot_cfgs()
-        for renorm in [True, False]
-        for a in plot_with_data(cfg, get_plot_data(cfg, samples), renorm)
     ]
 def get_plot_data(
     cfg: "CdfPlotConfig",
     samples: List[EvalGroup],
-) -> PlotData:
-    data = PlotData()
     for renorm in [True, False]:
-        gfs = data.renorm if renorm else data.norenorm
         for group in samples:
-            dfs: List[pd.DataFrame] = []
             for m in group.model_evals:
                 spec = m.eval_spec
                 cdf = cfg.get_cdf(m.evals, renorm)
-                df = pd.DataFrame(
                     {
-                        "x": cdf.scores,
-                        "y": cdf.cdf_p,
                         "label": m.model_name,
                         "group": group.name,
                         "renorm": renorm,
                         "fewshot": spec.fewshot,
                     }
                 )
-                dfs.append(df)
-            gfs.append(pd.concat(dfs))
     return data
 def plot_with_data(
     cfg: "CdfPlotConfig",
-    data: PlotData,
-    renorm: bool = True,
 ) -> List[Figure]:
     figures: List[Figure] = []
-    group_dfs = data.renorm if renorm else data.norenorm
-    for df in group_dfs:
-        group_name: str = str(df["group"].iloc[0])  # type: ignore
         label_selection = alt.selection_point(fields=["label"], bind="legend")  # type: ignore
         fs_selection = alt.selection_point(fields=["fewshot"], bind="legend")  # type: ignore
         chart = (
-            alt.Chart(df)  # type: ignore
-            .mark_line()
-            .encode(
-                x=alt.X("x:Q", title=cfg.xlabel),
-                y=alt.Y("y:Q", title=cfg.ylabel),
                 color=alt.Color(
                     "label:N", legend=alt.Legend(symbolOpacity=1.0, labelLimit=1000)
-                ).scale(scheme="set1"),
                 opacity=alt.condition(  # type: ignore
                     label_selection & fs_selection,
                     alt.Opacity("fewshot:O"),
                     alt.value(0.0),  # type: ignore
                 ),
             )
-            .properties(title=cfg.title(group_name, renorm))
             .add_params(fs_selection, label_selection)
             .interactive()
         )
@@ -102,9 +99,10 @@ class CdfPlotConfig(ABC):
     ylabel: str
     name: str = ""
     xline: Optional[float] = None
     @abstractmethod
-    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "CdfData":
         pass
     def title(self, group_name: str, prob_renorm: bool) -> str:
@@ -119,25 +117,74 @@ class CdfPlotConfig(ABC):
 class CorrectProbCdfPlot(CdfPlotConfig):
-    name = "𝚽 Performance Curve"
     xlabel = "𝚽"
-    ylabel = "% of correct answers with 𝚽 > x"
     xline = 0.25
-    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "CdfData":
         samples = [np.exp(e.cor_logprobs) for e in evals]
         if prob_renorm:
             samples = [renormed(e)[0] for e in evals]
-        return CdfData.from_samples(samples)
 class CorrIncorrDiffConfig(CdfPlotConfig):
-    name = "𝚫 Performance Curve"
     xline = 0.0
     xlabel = "𝚫"
-    ylabel = "% of samples with 𝚫 > x"
-    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "CdfData":
         score_arrs: List[NDArray[np.float64]] = []
         for e in evals:
             if prob_renorm:
@@ -148,4 +195,47 @@ class CorrIncorrDiffConfig(CdfPlotConfig):
             score_arrs.append(cor_probs - inc_probs.max(axis=1))
-        return CdfData.from_samples(score_arrs, per_sample_weighting=True)

+from typing import List, Literal, Optional
 from numpy._typing import NDArray
 from rbeval.plot.data import Eval, EvalGroup, Figure
 import numpy as np
 import altair as alt
 import pandas as pd
+from sklearn.metrics import roc_curve, roc_auc_score  # type: ignore
+from rbeval.plot.utils import PlotData, renormed
 def plot_cfgs():
+    return [
+        CorrectProbCdfPlot(),
+        CorrIncorrDiffConfig(),
+        ROCCurve(),
+        MaxIncorProbCdfPlot(),
+        AccVsLoss(),
+        AccVsAUC(),
+    ]
 def score_cdf(samples: List[EvalGroup], args: List[str]) -> List[Figure]:
     return [
         a
         for cfg in plot_cfgs()
+        for a in plot_with_data(cfg, get_plot_data(cfg, samples))
     ]
 def get_plot_data(
     cfg: "CdfPlotConfig",
     samples: List[EvalGroup],
+) -> pd.DataFrame:
+    records = []
     for renorm in [True, False]:
         for group in samples:
             for m in group.model_evals:
                 spec = m.eval_spec
                 cdf = cfg.get_cdf(m.evals, renorm)
+                records.append(
                     {
+                        "x": cdf.x,
+                        "y": cdf.y,
                         "label": m.model_name,
                         "group": group.name,
                         "renorm": renorm,
                         "fewshot": spec.fewshot,
                     }
                 )
+    data = pd.DataFrame.from_records(records)
     return data
 def plot_with_data(
     cfg: "CdfPlotConfig",
+    data: pd.DataFrame,
 ) -> List[Figure]:
     figures: List[Figure] = []
+    for (group_name, renorm), df in data.groupby(["group", "renorm"]):
+        assert isinstance(group_name, str)
+        assert isinstance(renorm, (bool, np.bool_))
         label_selection = alt.selection_point(fields=["label"], bind="legend")  # type: ignore
         fs_selection = alt.selection_point(fields=["fewshot"], bind="legend")  # type: ignore
+        chart = alt.Chart(df.explode(["x", "y"]))  # type: ignore
+        chart = chart.mark_line() if cfg.type == "line" else chart.mark_point()
         chart = (
+            chart.encode(
+                x=alt.X("x:Q", title=cfg.xlabel, scale=alt.Scale(zero=False)),
+                y=alt.Y("y:Q", title=cfg.ylabel, scale=alt.Scale(zero=False)),
                 color=alt.Color(
                     "label:N", legend=alt.Legend(symbolOpacity=1.0, labelLimit=1000)
+                ).scale(scheme="dark2"),
+                shape="label:N" if cfg.type == "scatter" else alt.Undefined,
                 opacity=alt.condition(  # type: ignore
                     label_selection & fs_selection,
                     alt.Opacity("fewshot:O"),
                     alt.value(0.0),  # type: ignore
                 ),
             )
+            .properties(title=cfg.title(group_name, renorm))  # type: ignore
             .add_params(fs_selection, label_selection)
             .interactive()
         )
     ylabel: str
     name: str = ""
     xline: Optional[float] = None
+    type: Literal["line", "scatter"] = "line"
     @abstractmethod
+    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "PlotData":
         pass
     def title(self, group_name: str, prob_renorm: bool) -> str:
 class CorrectProbCdfPlot(CdfPlotConfig):
+    name = "CDF(𝚽)"
     xlabel = "𝚽"
+    ylabel = "% of correct answers with 𝚽 < x"
     xline = 0.25
+    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "PlotData":
         samples = [np.exp(e.cor_logprobs) for e in evals]
         if prob_renorm:
             samples = [renormed(e)[0] for e in evals]
+        return PlotData.perf_curve_from_samples(samples)
+class MaxIncorProbCdfPlot(CdfPlotConfig):
+    name = "CDF(Max(Incorrect))"
+    xlabel = "max(incorrect)"
+    ylabel = "% of correct answers with max(incorrect) < x"
+    xline = 0.25
+    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "PlotData":
+        if prob_renorm:
+            samples = [renormed(e)[1].max(axis=1) for e in evals]
+        else:
+            samples = [np.exp(np.max(e.inc_logprobs, axis=1)) for e in evals]
+        return PlotData.perf_curve_from_samples(samples)
+class AccVsLoss(CdfPlotConfig):
+    name = "Cross Entropy Loss vs Accuracy"
+    xlabel = "Accuracy"
+    ylabel = "CE Loss"
+    xline = None
+    type = "scatter"
+    def get_cdf(self, evals: List[Eval], _prob_renorm: bool) -> "PlotData":
+        cor, incor = zip(*[renormed(e) for e in evals])
+        cor = np.concatenate(cor)
+        incor = np.concatenate(incor).max(axis=1)
+        pct_corr = np.mean(cor > incor)
+        celoss = np.mean(-np.log(cor))
+        return PlotData(np.array([celoss]), np.array([pct_corr]))
+class AccVsAUC(CdfPlotConfig):
+    name = "Simulated AUROC vs Accuracy"
+    xlabel = "Accuracy"
+    ylabel = "Simulated AUROC"
+    xline = None
+    type = "scatter"
+    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "PlotData":
+        cor, incor = zip(*[renormed(e) for e in evals])
+        cor = np.concatenate(cor)
+        incor = np.concatenate(incor).max(axis=1)
+        pct_corr = np.mean(cor > incor)
+        scores, labels, weights = roc_data(evals, prob_renorm)
+        auc = roc_auc_score(labels, scores, sample_weight=weights)
+        return PlotData(np.array([auc]), np.array([pct_corr]))
 class CorrIncorrDiffConfig(CdfPlotConfig):
+    name = "CDF(𝚫)"
     xline = 0.0
     xlabel = "𝚫"
+    ylabel = "% of samples with 𝚫 < x"
+    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "PlotData":
         score_arrs: List[NDArray[np.float64]] = []
         for e in evals:
             if prob_renorm:
             score_arrs.append(cor_probs - inc_probs.max(axis=1))
+        return PlotData.perf_curve_from_samples(score_arrs, per_sample_weighting=True)
+class ROCCurve(CdfPlotConfig):
+    name = "Simulated ROC Curve"
+    xline = None
+    xlabel = "FPR"
+    ylabel = "TPR"
+    def get_cdf(self, evals: List[Eval], prob_renorm: bool) -> "PlotData":
+        scores, labels, weights = roc_data(evals, prob_renorm)
+        assert len(scores) == len(labels) == len(weights)
+        tpr, fpr, _ = roc_curve(labels, scores, sample_weight=weights)
+        x_interp = np.linspace(0, 1, 600)
+        y_interp = np.interp(x_interp, fpr, tpr)
+        return PlotData(x_interp, y_interp)
+def roc_data(evals: List[Eval], prob_renorm):
+    weight_arrs = []
+    total = sum(len(e.cor_logprobs) for e in evals)
+    for samples in evals:
+        this = np.ones(2 * len(samples.cor_logprobs)) / (2 * total)
+        weight_arrs.append(this)
+    score_arrs = []
+    label_arrs = []
+    for e in evals:
+        if prob_renorm:
+            cor_probs, inc_probs = renormed(e)
+        else:
+            cor_probs = np.exp(e.cor_logprobs)
+            inc_probs = np.exp(e.inc_logprobs)
+        score_arrs.append(cor_probs)
+        label_arrs.append(np.ones(len(cor_probs)))
+        score_arrs.append(inc_probs.max(axis=1))
+        label_arrs.append(np.zeros(inc_probs.shape[0]))
+    scores = np.concatenate(score_arrs)
+    labels = np.concatenate(label_arrs)
+    weights = np.concatenate(weight_arrs)
+    return scores, labels, weights

src/rbeval/plot/utils.py CHANGED Viewed

@@ -17,14 +17,17 @@ def renormed(eval: Eval) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
 @dataclass
-class CdfData:
-    cdf_p: np.ndarray
-    scores: np.ndarray
     @classmethod
-    def from_samples(
-        cls, samples: List[NDArray[np.float64]], per_sample_weighting: bool = True
-    ) -> "CdfData":
         num_cats = len(samples)
         scores = np.concatenate(samples)
         if per_sample_weighting:
@@ -37,26 +40,29 @@ class CdfData:
             weights = np.concatenate(weight_arrs)
         else:
             weights = np.ones_like(scores) / len(scores)
-        return cls.from_weights(weights, scores)
     @classmethod
-    def from_weights(
         cls,
         weights: NDArray[np.float64],
         base_scores: NDArray[np.float64],
         max_p: int = 600,
-    ) -> "CdfData":
         sort_perm = base_scores.argsort()
         base_weights = weights[sort_perm]
         base_scores = base_scores[sort_perm]
-        base_cdf_p = 1 - np.cumsum(base_weights)
         minscore, maxscore = base_scores[0], base_scores[-1]
         if len(base_scores) > max_p:
             scores = np.linspace(minscore, maxscore, max_p)  # type: ignore
             cdf_p = np.interp(scores, base_scores, base_cdf_p)  # type: ignore
         else:
             scores, cdf_p = base_scores, base_cdf_p
-        return CdfData(
-            cdf_p=cdf_p,
-            scores=scores,  # type: ignore
         )

 @dataclass
+class PlotData:
+    y: np.ndarray
+    x: np.ndarray
     @classmethod
+    def perf_curve_from_samples(
+        cls,
+        samples: List[NDArray[np.float64]],
+        per_sample_weighting: bool = True,
+        one_minus: bool = False,
+    ) -> "PlotData":
         num_cats = len(samples)
         scores = np.concatenate(samples)
         if per_sample_weighting:
             weights = np.concatenate(weight_arrs)
         else:
             weights = np.ones_like(scores) / len(scores)
+        return cls.perf_curve_from_weights(weights, scores, one_minus=one_minus)
     @classmethod
+    def perf_curve_from_weights(
         cls,
         weights: NDArray[np.float64],
         base_scores: NDArray[np.float64],
         max_p: int = 600,
+        one_minus: bool = True,
+    ) -> "PlotData":
         sort_perm = base_scores.argsort()
         base_weights = weights[sort_perm]
         base_scores = base_scores[sort_perm]
+        base_cdf_p = np.cumsum(base_weights)
+        if one_minus:
+            base_cdf_p = 1 - base_cdf_p
         minscore, maxscore = base_scores[0], base_scores[-1]
         if len(base_scores) > max_p:
             scores = np.linspace(minscore, maxscore, max_p)  # type: ignore
             cdf_p = np.interp(scores, base_scores, base_cdf_p)  # type: ignore
         else:
             scores, cdf_p = base_scores, base_cdf_p
+        return PlotData(
+            y=cdf_p,
+            x=scores,  # type: ignore
         )