perturb_for_table / table_result /2407.00085v1_output.json
linbojunzi's picture
Upload 30 files
fd31a8c verified
raw
history blame
34 kB
[
{
"path": "table_paper/2407.00085v1.json",
"table_id": "1",
"section": "4.2",
"all_context": [
"For automotive sales modeling, (Varian and Choi, 2009 ) was one of the first to show the value of using Google Search data in predicting current economic activity.",
"Our approach further leverages the information present in search queries to increase the accuracy of the nowcast prediction from accounting for 58% of variance when we use classification metrics to 75% using our search embeddings, a 30% improvement in model accuracy.",
"Much of the remaining unexplained variance is due to monthly and quarterly cycles in the data.",
"When the data is rolled up to monthly blocks as reported in (Varian and Choi, 2009 ) our model accounts for 91% of variation in the test set.",
"Our model doesn t use historical sales or other external variables in our model, and the fit metrics reported are and MAPE in order to be consistent with the literature.",
"Table 1 shows the results from modelling U.S. Auto Sales.",
"We used overall US Auto Sales and trained the model at the weekly level across 16 regions, rolling our predictions up to national.",
"The search data includes over ten million distinct queries that are vehicle-related.",
"The model uses both regional and week-of-the-month features.",
"The regional features are included in the probability model to account for regional differences in both search adoption and search behavior across regions.",
"The model is trained across nearly two years of data and the fit metric is reported over the test set, a further 6 months of data.",
"The model is trained with a two week lag between search and sales, an interesting area for future research would be the impact of varying lags, as (Moller et al., 2023 ) does for the housing market.",
"Figure 4 highlights the fit of the search embeddings CoSMo model using a four week rolling average.",
"The US auto sales data that we use in this paper is based on registration data, and has large spikes at the end of the month as well as end of quarter.",
"The large improvement in fit by using four week rolling average suggests that this monthly cycle is likely a supply-side effect as opposed to reflective of demand patterns.",
"At the monthly level the model has an R2 of 0.91, and 3.03 MAPE in the test period.",
"This fit is remarkable given that the model doesn t include any annual seasonality controls, or historical sales.",
"As a point of reference the linear model in (Varian and Choi, 2009 ) returns a monthly R2 of 0.79 over the training data using both lagged sales and Google Trends.",
"While automotive sales are used in this paper, we expect that our approach can be used to greatly improve nowcasts across economic indicators.",
"In the next section we show how the model can accurately predict flu rates, and show the sensitivity of the model to model specifications.",
""
],
"target_context_ids": [
5,
6,
7,
8,
9,
10,
11,
15,
16,
17
],
"selected_paragraphs": [
"[paragraph id = 5] Table 1 shows the results from modelling U.S. Auto Sales.",
"[paragraph id = 6] We used overall US Auto Sales and trained the model at the weekly level across 16 regions, rolling our predictions up to national.",
"[paragraph id = 7] The search data includes over ten million distinct queries that are vehicle-related.",
"[paragraph id = 8] The model uses both regional and week-of-the-month features.",
"[paragraph id = 9] The regional features are included in the probability model to account for regional differences in both search adoption and search behavior across regions.",
"[paragraph id = 10] The model is trained across nearly two years of data and the fit metric is reported over the test set, a further 6 months of data.",
"[paragraph id = 11] The model is trained with a two week lag between search and sales, an interesting area for future research would be the impact of varying lags, as (Moller et al., 2023 ) does for the housing market.",
"[paragraph id = 15] At the monthly level the model has an R2 of 0.91, and 3.03 MAPE in the test period.",
"[paragraph id = 16] This fit is remarkable given that the model doesn t include any annual seasonality controls, or historical sales.",
"[paragraph id = 17] As a point of reference the linear model in (Varian and Choi, 2009 ) returns a monthly R2 of 0.79 over the training data using both lagged sales and Google Trends."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T1\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T1.3\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T1.3.3\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T1.3.3.4\">Frequency</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T1.3.3.5\">Embedding</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T1.3.3.6\">Model</th>\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_r\" id=\"S4.T1.2.2.2\">Test R<sup class=\"ltx_sup\" id=\"S4.T1.2.2.2.1\">2</sup> \n</th>\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column\" id=\"S4.T1.3.3.3\">Test MAPE (%) \n</th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T1.3.4.1\">\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T1.3.4.1.1\">Weekly</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T1.3.4.1.2\">Categorical</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T1.3.4.1.3\">Lasso</td>\n<td class=\"ltx_td ltx_align_left ltx_border_r ltx_border_t\" id=\"S4.T1.3.4.1.4\">0.5869</td>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T1.3.4.1.5\">10.90</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.3.5.2\">\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.5.2.1\">Weekly</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.5.2.2\">Categorical</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.5.2.3\">CoSMo</td>\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T1.3.5.2.4\">0.5381</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.3.5.2.5\">10.85</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.3.6.3\">\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.6.3.1\">Weekly</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.6.3.2\">SLaM</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.6.3.3\">CoSMo</td>\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T1.3.6.3.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.3.6.3.4.1\">0.7486</span></td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.3.6.3.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.3.6.3.5.1\">7.12</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.3.7.4\">\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.7.4.1\">Monthly</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.7.4.2\">SLaM</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T1.3.7.4.3\">CoSMo</td>\n<td class=\"ltx_td ltx_align_left ltx_border_r\" id=\"S4.T1.3.7.4.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.3.7.4.4.1\">0.9065</span></td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.3.7.4.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.3.7.4.5.1\">3.03</span></td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 1. </span>Baseline Regional Auto Models with search and indicator multipliers - fit metrics reported at the national level.</figcaption>\n</figure>",
"perturb_sentence_id": 17,
"output": {
"perturbed_statement": "[paragraph id = 17] As a point of reference the linear model in (Varian and Choi, 2009 ) returns a monthly R2 of 0.82 over the training data using both lagged sales and Google Trends.While automotive sales are used in this paper, we expect that our approach can be used to greatly improve nowcasts across economic indicators.",
"perturbed_explanation": "The statement contains an altered value for the reported monthly R2. 1. The original statement specifies that the model by Varian and Choi achieved a monthly R2 of 0.79 over the training data. 2. The altered statement incorrects this to 0.82, which misrepresents the findings reported. Thus, the corrected information should be reinstated to accurately reflect the source material."
}
},
{
"path": "table_paper/2407.00085v1.json",
"table_id": "2",
"section": "4.3",
"all_context": [
"For benchmarking experiments, we model Influenza-Like-Illness (ILI) rates from the CDC (CDC, 2024 ) at the national level, like (Lampos et al., 2015 ).",
"Due to data availability, we are unable to compare our model on the same time frames as in previous work.",
"Instead, we use data from 2019 until 2022 for training and validation data, and we estimate the flu rates for the 2022-2023 flu season as the test period.",
"In (Lampos et al., 2015 ) the Pearson correlation coefficient and the Mean Absolute Percentage Error are provided for multiple flu seasons from 2008 until 2013; for the methods we implemented, we report the average values across 5 trials.",
"We provide the best and worst performances of previous methods in (Lampos et al., 2015 ) to benchmark our approach.",
"In previous works, it is unclear how the model s hyperparameters were selected.",
"We report the test metrics of our approach using the model whose average validation MAPE was lowest; for benchmarking purposes, we also report the model with the best test MAPE.",
"Additionally, we compare our modeling approach to more typical methods such as logistic regression and multi-layer perceptron (MLP) neural networks, which have a history of modeling success but do not have the regularizing structural components of our approach.",
"For logistic regression, we found the model to work better without search volume, and only use the normalized search embeddings.",
"All methods include L1 regularization.",
"We include about two million cold & flu related terms for our search embeddings.",
"Figure 3 shows our model s predicted values for a few years during both training and testing.",
"Our model, which only uses data from search to estimate of the flu rates of a given week, is able to closely estimate the actual flu rates for a new flu season despite not using lagged flu rate data in its estimates like autoregressive models.",
"Table 2 shows the results from modeling the U.S. ILI rates at the national level.",
"We can see that CoSMo outperforms other methods which only use search data.",
"The autoregressive (AR) entries in Table 2 represent methods that include either a 1-week or 2-week lag of the most recent ILI rate.",
"Our method is generally on par or better than the best AR approaches.",
""
],
"target_context_ids": [
13,
14,
15,
16
],
"selected_paragraphs": [
"[paragraph id = 13] Table 2 shows the results from modeling the U.S. ILI rates at the national level.",
"[paragraph id = 14] We can see that CoSMo outperforms other methods which only use search data.",
"[paragraph id = 15] The autoregressive (AR) entries in Table 2 represent methods that include either a 1-week or 2-week lag of the most recent ILI rate.",
"[paragraph id = 16] Our method is generally on par or better than the best AR approaches."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T2\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T2.10\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T2.3.3\">\n<th class=\"ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_r\" id=\"S4.T2.3.3.4\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"S4.T2.1.1.1\">Test MAPE(%) \n</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S4.T2.3.3.3\">Test \n</th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T2.4.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T2.4.4.2\">Logistic Regression</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T2.4.4.1\">24.9 0.1</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.3\">.98</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.5.5\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.5.5.2\">MLP</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.5.5.1\">7.3 1.5</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.5.5.3\">.99</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.10.11.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T2.10.11.1.1\">Google Flu Trends <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T2.10.11.1.2\">[9.5 - 33.1]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.10.11.1.3\">[.66 - .97]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.10.12.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.10.12.2.1\">Elastic Net <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.10.12.2.2\">[9.8 - 15.1]</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.10.12.2.3\">[.92 - .99]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.10.13.3\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.10.13.3.1\">Guassian Process <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.10.13.3.2\">[9.4 - 14.6]</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.10.13.3.3\">[.94 - .99]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.10.14.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T2.10.14.4.1\">AR <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T2.10.14.4.2\">[6.7 - 14.3]</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.10.14.4.3\">[.88 - .98]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.10.15.5\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.10.15.5.1\">AR+Google Flu Trends <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.10.15.5.2\">[6.2 - 12.5]</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.10.15.5.3\">[.88 - .99]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.6.6\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.6.6.2\">AR+Elastic Net <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.6.6.3\">[5.1 - 8.7]</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.6.6.1\">[.93 - ]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.7.7\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.7.7.2\">AR+Guassian Process <cite class=\"ltx_cite ltx_citemacro_citep\">(Lampos et al<span class=\"ltx_text\">.</span>, <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00085v1#bib.bib18\" title=\"\">2015</a>)</cite>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.7.7.3\">[5.0 - 8.6]</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.7.7.1\">[.93 - ]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.8.8\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T2.8.8.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.8.8.2.1\">CoSMo (Ours)</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"S4.T2.8.8.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.8.8.1.1\">5.5 0.4</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.8.8.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.8.8.3.1\">.99</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.10.10\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.10.10.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.10.10.3.1\">CoSMo (Ours, Test selection)</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"S4.T2.9.9.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.9.9.1.1\">3.9 0.1</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.10.10.2\">\n<span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.10.10.2.1\">1</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 2. </span>Benchmarking ILI flu rate prediction at the national level. We show the standard deviation of MAPE for our experiments; we omit this metric for the Pearson coefficient because it was close to zero for all experiments.</figcaption>\n</figure>",
"perturb_sentence_id": 15,
"output": {
"perturbed_statement": "[paragraph id = 15] The autoregressive (AR) entries in Table 2 represent methods that include a 3-week lag of the most recent ILI rate. Our method is generally inferior to the best AR approaches.",
"perturbed_explanation": "1. The original explanation emphasizes that Table 2 outlines autoregressive (AR) methods leveraging ILI data with a 1-week or 2-week lag, and our approach is typically equal to or better in performance compared to the leading AR methods. 2. The statement now incorrectly claims that AR methods utilize a 3-week lag, which conflicts with the description of the methodology in the context, and asserts that our approach is generally inferior to AR methods, which contradicts documented indications of performance superiority."
}
},
{
"path": "table_paper/2407.00085v1.json",
"table_id": "5",
"section": "4.5",
"all_context": [
"We analyze the capability of our model to go from child-geography to parent-geography predictions and vice versa.",
"Training a model on parent-level (e.g,.",
"country) data, then evaluating on child-level (e.g., State) is common when child-level data is either missing or never collect, while training a model at the child-level and making parent-level predictions is useful when it s believed that the increased number of child-geo datapoints will help the model fit.",
"We use two versions of the best flu models: a no-volume national-level model and a no-volume state-level model.",
"The national-level model was trained on national-level targets using national-level search embeddings, but inference was done using state-level search embeddings and evaluated on state-level targets; vice versa for the state-level model.",
"The results are shown in Table 5 .",
"The model has a surprising capability to infer with some success (.78 ) state-level flu rates, in the test period, without ever being trained on state-level targets.",
"The zero-shot inference performs better in the opposite direction, (.99 ), perhaps leveraging the greater number of training examples and taking advantage of the easier task of national modeling.",
""
],
"target_context_ids": [
0,
1,
2,
3,
4,
5,
6,
7
],
"selected_paragraphs": [
"[paragraph id = 0] We analyze the capability of our model to go from child-geography to parent-geography predictions and vice versa.",
"[paragraph id = 1] Training a model on parent-level (e.g,.",
"[paragraph id = 2] country) data, then evaluating on child-level (e.g., State) is common when child-level data is either missing or never collect, while training a model at the child-level and making parent-level predictions is useful when it s believed that the increased number of child-geo datapoints will help the model fit.",
"[paragraph id = 3] We use two versions of the best flu models: a no-volume national-level model and a no-volume state-level model.",
"[paragraph id = 4] The national-level model was trained on national-level targets using national-level search embeddings, but inference was done using state-level search embeddings and evaluated on state-level targets; vice versa for the state-level model.",
"[paragraph id = 5] The results are shown in Table 5 .",
"[paragraph id = 6] The model has a surprising capability to infer with some success (.78 ) state-level flu rates, in the test period, without ever being trained on state-level targets.",
"[paragraph id = 7] The zero-shot inference performs better in the opposite direction, (.99 ), perhaps leveraging the greater number of training examples and taking advantage of the easier task of national modeling."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T5\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T5.10\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T5.2.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r\" id=\"S4.T5.2.2.3\">Training Data</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r\" id=\"S4.T5.2.2.4\">Eval Data</th>\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column\" id=\"S4.T5.1.1.1\">Test MAPE(%)\n</th>\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column\" id=\"S4.T5.2.2.2\">Test \n</th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T5.4.4\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T5.4.4.3\">State</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T5.4.4.4\">State</th>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T5.3.3.1\"></td>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T5.4.4.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.6.6\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r\" id=\"S4.T5.6.6.3\">National</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r\" id=\"S4.T5.6.6.4\">State</th>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T5.5.5.1\"></td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T5.6.6.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.8.8\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T5.8.8.3\">National</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T5.8.8.4\">National</th>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T5.7.7.1\"></td>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T5.8.8.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.10.10\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r\" id=\"S4.T5.10.10.3\">State</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r\" id=\"S4.T5.10.10.4\">National</th>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T5.9.9.1\"></td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T5.10.10.2\"></td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 5. </span>Zero-shot evaluation for Flu ILI rate prediction. The zero-shot examples are the rows where there is a mismatch between the Training Data column and the Eval Data column. The rows with alignment serve as comparison points. </figcaption>\n</figure>",
"perturb_sentence_id": 6,
"output": {
"perturbed_statement": "[paragraph id = 6] The model has a surprising capability to infer with some success (.78) national-level flu rates, in the test period, without ever being trained on national-level targets. The zero-shot inference performs better in the opposite direction, (.92), perhaps leveraging the greater number of training examples and taking advantage of the easier task of state-level modeling.",
"perturbed_explanation": "Explanation: 1. The original explanation states that the model was able to infer state-level flu rates without training on state-level targets (zero-shot inference), achieving a performance metric of 0.78 during the test period, with the opposite direction displaying an improved performance metric of 0.99 due to leveraging more training data and the simpler nature of modeling at the national level. 2. The statement claims incorrect data by stating the inference pertains to national-level flu rates without training on national-level targets and provides altered performance metrics (0.92 rather than the correct 0.99). These assertions differ from the given contextual details and thus render the statement factually inaccurate."
}
},
{
"path": "table_paper/2407.00085v1.json",
"table_id": "6",
"section": "4.6",
"all_context": [
"In addition to the MLSE embeddings (Yang et al., 2019 ), we look at variants of the T5 (Raffel et al., 2020 ) LLM, the sentence-T5 (sT5) (Ni et al., 2021 ), a version of T5 that outputs a fixed-length 768-dimensional vector for every input sequence 666Our method requires that the LM output a D-dimensional vector that is not dependent on the input shape.",
"Unfortunately, many LMs have outputs with shape where is the number of input tokens.",
"In order to study many other LMs using our method, such as mT5, we would need to first map the LM output to a fixed-length vector.",
"Potential options are using the output associated with the ¡BOS¿ token, or averaging across the sequence length dimension.",
"We leave these experiments to future work.. We study the effect of using these embeddings on the the national Flu ILI prediction tasks.",
"Table 6 shows the results from using different search embeddings created using the sT5 Base (110M parameters) and sT5 Large (335M parameters) models.",
"Surprisingly, larger capacity models like sT5 Base and sT5 Large do not outperform the smaller capacity MLSE model.",
"We believe this has to do with sT5 models being trained on only the English language.",
"The MLSE model being a multi-lingual model is able to make better use of the multiple languages present in the search data, where as the sT5 models are unable accurately map the meanings of these queries.",
"We validate this by generating search embeddings using only English queries and training models on these English-only search embeddings.",
"These results are shown in Table 6 .",
"We can see that the sT5 models perform similar to their all-language counter parts, where as performance for MLSE considerable lowers.",
"We leave further studies to future work.",
""
],
"target_context_ids": [
5,
6,
7,
8,
9,
10,
11
],
"selected_paragraphs": [
"[paragraph id = 5] Table 6 shows the results from using different search embeddings created using the sT5 Base (110M parameters) and sT5 Large (335M parameters) models.",
"[paragraph id = 6] Surprisingly, larger capacity models like sT5 Base and sT5 Large do not outperform the smaller capacity MLSE model.",
"[paragraph id = 7] We believe this has to do with sT5 models being trained on only the English language.",
"[paragraph id = 8] The MLSE model being a multi-lingual model is able to make better use of the multiple languages present in the search data, where as the sT5 models are unable accurately map the meanings of these queries.",
"[paragraph id = 9] We validate this by generating search embeddings using only English queries and training models on these English-only search embeddings.",
"[paragraph id = 10] These results are shown in Table 6 .",
"[paragraph id = 11] We can see that the sT5 models perform similar to their all-language counter parts, where as performance for MLSE considerable lowers."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T6\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T6.14\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T6.2.2\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_r\" id=\"S4.T6.2.2.3\"></th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.1.1.1\">Test MAPE(%)\n</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.2.2.2\">Test \n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T6.4.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T6.4.4.3\">MLSE (baseline)</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T6.3.3.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T6.4.4.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T6.6.6\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T6.6.6.3\">sT5 Base</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T6.5.5.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T6.6.6.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T6.8.8\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T6.8.8.3\">sT5 Large</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.7.7.1\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.8.8.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T6.10.10\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T6.10.10.3\">MLSE (English only)</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T6.9.9.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T6.10.10.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T6.12.12\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T6.12.12.3\">sT5 Base (English only)</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.11.11.1\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.12.12.2\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T6.14.14\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_r\" id=\"S4.T6.14.14.3\">sT5 Large (English only)</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.13.13.1\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T6.14.14.2\"></td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 6. </span>National ILI rate modeling results from using different embedding functions from a variety of LMs.</figcaption>\n</figure>",
"perturb_sentence_id": 5,
"output": {
"perturbed_statement": "[paragraph id = 5] Table 6 shows the results from using different search embeddings created using the MLSE model.Surprisingly, larger capacity models like sT5 Base (110M parameters) and sT5 Large (335M parameters) significantly outperform the oversized MLSE model.",
"perturbed_explanation": "Table 6 shows the results from using different search embeddings created using the sT5 Base (110M parameters) and sT5 Large (335M parameters) models. Surprisingly, larger capacity models like sT5 Base and sT5 Large do not outperform the smaller capacity MLSE model. 1. These results imply that larger model capacity does not necessarily correlate with superior performance when compared to smaller, diversified models like MLSE. 2. Contrary to the information provided, the statement suggests the sT5 models notably outperform the MLSE model, which conflicts with the observed outcomes where MLSE's performances are competitive or better in multilingual contexts. Therefore, the revised statement introduces a factual inconsistency."
}
}
]