[
{
"path": "table_paper/2407.00085v1.json",
"table_id": "1",
"section": "4.2",
"all_context": [
"For automotive sales modeling, (Varian and Choi, 2009 ) was one of the first to show the value of using Google Search data in predicting current economic activity.",
"Our approach further leverages the information present in search queries to increase the accuracy of the nowcast prediction from accounting for 58% of variance when we use classification metrics to 75% using our search embeddings, a 30% improvement in model accuracy.",
"Much of the remaining unexplained variance is due to monthly and quarterly cycles in the data.",
"When the data is rolled up to monthly blocks as reported in (Varian and Choi, 2009 ) our model accounts for 91% of variation in the test set.",
"Our model doesn t use historical sales or other external variables in our model, and the fit metrics reported are and MAPE in order to be consistent with the literature.",
"Table 1 shows the results from modelling U.S. Auto Sales.",
"We used overall US Auto Sales and trained the model at the weekly level across 16 regions, rolling our predictions up to national.",
"The search data includes over ten million distinct queries that are vehicle-related.",
"The model uses both regional and week-of-the-month features.",
"The regional features are included in the probability model to account for regional differences in both search adoption and search behavior across regions.",
"The model is trained across nearly two years of data and the fit metric is reported over the test set, a further 6 months of data.",
"The model is trained with a two week lag between search and sales, an interesting area for future research would be the impact of varying lags, as (Moller et al., 2023 ) does for the housing market.",
"Figure 4 highlights the fit of the search embeddings CoSMo model using a four week rolling average.",
"The US auto sales data that we use in this paper is based on registration data, and has large spikes at the end of the month as well as end of quarter.",
"The large improvement in fit by using four week rolling average suggests that this monthly cycle is likely a supply-side effect as opposed to reflective of demand patterns.",
"At the monthly level the model has an R2 of 0.91, and 3.03 MAPE in the test period.",
"This fit is remarkable given that the model doesn t include any annual seasonality controls, or historical sales.",
"As a point of reference the linear model in (Varian and Choi, 2009 ) returns a monthly R2 of 0.79 over the training data using both lagged sales and Google Trends.",
"While automotive sales are used in this paper, we expect that our approach can be used to greatly improve nowcasts across economic indicators.",
"In the next section we show how the model can accurately predict flu rates, and show the sensitivity of the model to model specifications.",
""
],
"target_context_ids": [
5,
6,
7,
8,
9,
10,
11,
15,
16,
17
],
"selected_paragraphs": [
"[paragraph id = 5] Table 1 shows the results from modelling U.S. Auto Sales.",
"[paragraph id = 6] We used overall US Auto Sales and trained the model at the weekly level across 16 regions, rolling our predictions up to national.",
"[paragraph id = 7] The search data includes over ten million distinct queries that are vehicle-related.",
"[paragraph id = 8] The model uses both regional and week-of-the-month features.",
"[paragraph id = 9] The regional features are included in the probability model to account for regional differences in both search adoption and search behavior across regions.",
"[paragraph id = 10] The model is trained across nearly two years of data and the fit metric is reported over the test set, a further 6 months of data.",
"[paragraph id = 11] The model is trained with a two week lag between search and sales, an interesting area for future research would be the impact of varying lags, as (Moller et al., 2023 ) does for the housing market.",
"[paragraph id = 15] At the monthly level the model has an R2 of 0.91, and 3.03 MAPE in the test period.",
"[paragraph id = 16] This fit is remarkable given that the model doesn t include any annual seasonality controls, or historical sales.",
"[paragraph id = 17] As a point of reference the linear model in (Varian and Choi, 2009 ) returns a monthly R2 of 0.79 over the training data using both lagged sales and Google Trends."
],
"table_html": "
\n
\n\n
\n
Frequency
\n
Embedding
\n
Model
\n
Test R2 \n
\n
Test MAPE (%) \n
\n
\n\n\n
\n
Weekly
\n
Categorical
\n
Lasso
\n
0.5869
\n
10.90
\n
\n
\n
Weekly
\n
Categorical
\n
CoSMo
\n
0.5381
\n
10.85
\n
\n
\n
Weekly
\n
SLaM
\n
CoSMo
\n
0.7486
\n
7.12
\n
\n
\n
Monthly
\n
SLaM
\n
CoSMo
\n
0.9065
\n
3.03
\n
\n\n
\nTable 1. Baseline Regional Auto Models with search and indicator multipliers - fit metrics reported at the national level.\n
",
"perturb_sentence_id": [
17,
18
],
"output": {
"perturbed_statement": "[paragraph id = 17] As a point of reference the linear model in (Varian and Choi, 2009 ) returns a monthly R2 of 0.79 over the training data using both lagged sales and Google Trends. While automotive sales are used in this paper, we expect that our approach can be used to greatly improve forecasts across economic sectors.",
"perturbed_explanation": "1. The statement compares the model's capabilities by referencing a linear model from Varian and Choi (2009), highlighting an R2 of 0.79 and mentioning the use of lagged sales and Google Trends. 2. By changing 'nowcasts' to 'forecasts', the altered statement implies a traditional forecasting approach rather than the more immediate estimation that nowcasting provides. Nowcasting typically involves estimating current or very near-term conditions, while forecasting may refer to a longer-term prediction. This change misrepresents the original context of the statement, which emphasizes the capability to produce quick, current estimates in economic indicators."
}
},
{
"path": "table_paper/2407.00085v1.json",
"table_id": "2",
"section": "4.3",
"all_context": [
"For benchmarking experiments, we model Influenza-Like-Illness (ILI) rates from the CDC (CDC, 2024 ) at the national level, like (Lampos et al., 2015 ).",
"Due to data availability, we are unable to compare our model on the same time frames as in previous work.",
"Instead, we use data from 2019 until 2022 for training and validation data, and we estimate the flu rates for the 2022-2023 flu season as the test period.",
"In (Lampos et al., 2015 ) the Pearson correlation coefficient and the Mean Absolute Percentage Error are provided for multiple flu seasons from 2008 until 2013; for the methods we implemented, we report the average values across 5 trials.",
"We provide the best and worst performances of previous methods in (Lampos et al., 2015 ) to benchmark our approach.",
"In previous works, it is unclear how the model s hyperparameters were selected.",
"We report the test metrics of our approach using the model whose average validation MAPE was lowest; for benchmarking purposes, we also report the model with the best test MAPE.",
"Additionally, we compare our modeling approach to more typical methods such as logistic regression and multi-layer perceptron (MLP) neural networks, which have a history of modeling success but do not have the regularizing structural components of our approach.",
"For logistic regression, we found the model to work better without search volume, and only use the normalized search embeddings.",
"All methods include L1 regularization.",
"We include about two million cold & flu related terms for our search embeddings.",
"Figure 3 shows our model s predicted values for a few years during both training and testing.",
"Our model, which only uses data from search to estimate of the flu rates of a given week, is able to closely estimate the actual flu rates for a new flu season despite not using lagged flu rate data in its estimates like autoregressive models.",
"Table 2 shows the results from modeling the U.S. ILI rates at the national level.",
"We can see that CoSMo outperforms other methods which only use search data.",
"The autoregressive (AR) entries in Table 2 represent methods that include either a 1-week or 2-week lag of the most recent ILI rate.",
"Our method is generally on par or better than the best AR approaches.",
""
],
"target_context_ids": [
13,
14,
15,
16
],
"selected_paragraphs": [
"[paragraph id = 13] Table 2 shows the results from modeling the U.S. ILI rates at the national level.",
"[paragraph id = 14] We can see that CoSMo outperforms other methods which only use search data.",
"[paragraph id = 15] The autoregressive (AR) entries in Table 2 represent methods that include either a 1-week or 2-week lag of the most recent ILI rate.",
"[paragraph id = 16] Our method is generally on par or better than the best AR approaches."
],
"table_html": "
\nTable 2. Benchmarking ILI flu rate prediction at the national level. We show the standard deviation of MAPE for our experiments; we omit this metric for the Pearson coefficient because it was close to zero for all experiments.\n
",
"perturb_sentence_id": [
15,
16
],
"output": {
"perturbed_statement": "[paragraph id = 15] The autoregressive (AR) entries in Table 2 represent methods that include either a 3-week or 4-week lag of the most recent ILI rate. Our method is generally on par or better than the best AR approaches.",
"perturbed_explanation": "The original explanation states: 1) The autoregressive (AR) entries in Table 2 include a 1-week or 2-week lag of the most recent ILI rate. The statement is incorrect because: 2) The statement changes the lag period to 3-week or 4-week, which is factually inaccurate as the context refers to a 1-week or 2-week lag. This introduces a misunderstanding about the methodology and accuracy of the AR models compared to other methods."
}
},
{
"path": "table_paper/2407.00085v1.json",
"table_id": "5",
"section": "4.5",
"all_context": [
"We analyze the capability of our model to go from child-geography to parent-geography predictions and vice versa.",
"Training a model on parent-level (e.g,.",
"country) data, then evaluating on child-level (e.g., State) is common when child-level data is either missing or never collect, while training a model at the child-level and making parent-level predictions is useful when it s believed that the increased number of child-geo datapoints will help the model fit.",
"We use two versions of the best flu models: a no-volume national-level model and a no-volume state-level model.",
"The national-level model was trained on national-level targets using national-level search embeddings, but inference was done using state-level search embeddings and evaluated on state-level targets; vice versa for the state-level model.",
"The results are shown in Table 5 .",
"The model has a surprising capability to infer with some success (.78 ) state-level flu rates, in the test period, without ever being trained on state-level targets.",
"The zero-shot inference performs better in the opposite direction, (.99 ), perhaps leveraging the greater number of training examples and taking advantage of the easier task of national modeling.",
""
],
"target_context_ids": [
0,
1,
2,
3,
4,
5,
6,
7
],
"selected_paragraphs": [
"[paragraph id = 0] We analyze the capability of our model to go from child-geography to parent-geography predictions and vice versa.",
"[paragraph id = 1] Training a model on parent-level (e.g,.",
"[paragraph id = 2] country) data, then evaluating on child-level (e.g., State) is common when child-level data is either missing or never collect, while training a model at the child-level and making parent-level predictions is useful when it s believed that the increased number of child-geo datapoints will help the model fit.",
"[paragraph id = 3] We use two versions of the best flu models: a no-volume national-level model and a no-volume state-level model.",
"[paragraph id = 4] The national-level model was trained on national-level targets using national-level search embeddings, but inference was done using state-level search embeddings and evaluated on state-level targets; vice versa for the state-level model.",
"[paragraph id = 5] The results are shown in Table 5 .",
"[paragraph id = 6] The model has a surprising capability to infer with some success (.78 ) state-level flu rates, in the test period, without ever being trained on state-level targets.",
"[paragraph id = 7] The zero-shot inference performs better in the opposite direction, (.99 ), perhaps leveraging the greater number of training examples and taking advantage of the easier task of national modeling."
],
"table_html": "
\n
\n\n
\n
Training Data
\n
Eval Data
\n
Test MAPE(%)\n
\n
Test \n
\n
\n\n\n
\n
State
\n
State
\n
\n
\n
\n
\n
National
\n
State
\n
\n
\n
\n
\n
National
\n
National
\n
\n
\n
\n
\n
State
\n
National
\n
\n
\n
\n\n
\nTable 5. Zero-shot evaluation for Flu ILI rate prediction. The zero-shot examples are the rows where there is a mismatch between the Training Data column and the Eval Data column. The rows with alignment serve as comparison points. \n
",
"perturb_sentence_id": [
6,
7
],
"output": {
"perturbed_statement": "[paragraph id = 6] The model has a surprising capability to infer with some success (.78) national flu rates, in the test period, without ever being trained on national-level targets. The zero-shot inference performs worse in the opposite direction, (.80), perhaps due to fewer training examples and the increased complexity of state-level modeling.",
"perturbed_explanation": "1. The statement originally explains the model's capability to infer state-level flu rates with some success without being trained on state-level targets, and that the zero-shot inference performs better when predicting parent-level data (countries) than child-level data (states) due to having more training examples and the simplicity of national modeling. 2. The statement incorrectly claims that the model infers national flu rates without ever being trained on national-level targets, which contradicts the context explaining that the national-level model was indeed trained on national-level targets. 3. Additionally, the statement erroneously states that the zero-shot inference performs worse when predicting parent-level data, which is the opposite of what is actually described, i.e., predicting parent-level data is reported to be more successful."
}
},
{
"path": "table_paper/2407.00085v1.json",
"table_id": "6",
"section": "4.6",
"all_context": [
"In addition to the MLSE embeddings (Yang et al., 2019 ), we look at variants of the T5 (Raffel et al., 2020 ) LLM, the sentence-T5 (sT5) (Ni et al., 2021 ), a version of T5 that outputs a fixed-length 768-dimensional vector for every input sequence 666Our method requires that the LM output a D-dimensional vector that is not dependent on the input shape.",
"Unfortunately, many LMs have outputs with shape where is the number of input tokens.",
"In order to study many other LMs using our method, such as mT5, we would need to first map the LM output to a fixed-length vector.",
"Potential options are using the output associated with the ¡BOS¿ token, or averaging across the sequence length dimension.",
"We leave these experiments to future work.. We study the effect of using these embeddings on the the national Flu ILI prediction tasks.",
"Table 6 shows the results from using different search embeddings created using the sT5 Base (110M parameters) and sT5 Large (335M parameters) models.",
"Surprisingly, larger capacity models like sT5 Base and sT5 Large do not outperform the smaller capacity MLSE model.",
"We believe this has to do with sT5 models being trained on only the English language.",
"The MLSE model being a multi-lingual model is able to make better use of the multiple languages present in the search data, where as the sT5 models are unable accurately map the meanings of these queries.",
"We validate this by generating search embeddings using only English queries and training models on these English-only search embeddings.",
"These results are shown in Table 6 .",
"We can see that the sT5 models perform similar to their all-language counter parts, where as performance for MLSE considerable lowers.",
"We leave further studies to future work.",
""
],
"target_context_ids": [
5,
6,
7,
8,
9,
10,
11
],
"selected_paragraphs": [
"[paragraph id = 5] Table 6 shows the results from using different search embeddings created using the sT5 Base (110M parameters) and sT5 Large (335M parameters) models.",
"[paragraph id = 6] Surprisingly, larger capacity models like sT5 Base and sT5 Large do not outperform the smaller capacity MLSE model.",
"[paragraph id = 7] We believe this has to do with sT5 models being trained on only the English language.",
"[paragraph id = 8] The MLSE model being a multi-lingual model is able to make better use of the multiple languages present in the search data, where as the sT5 models are unable accurately map the meanings of these queries.",
"[paragraph id = 9] We validate this by generating search embeddings using only English queries and training models on these English-only search embeddings.",
"[paragraph id = 10] These results are shown in Table 6 .",
"[paragraph id = 11] We can see that the sT5 models perform similar to their all-language counter parts, where as performance for MLSE considerable lowers."
],
"table_html": "
\n
\n\n
\n
\n
Test MAPE(%)\n
\n
Test \n
\n
\n
\n
MLSE (baseline)
\n
\n
\n
\n
\n
sT5 Base
\n
\n
\n
\n
\n
sT5 Large
\n
\n
\n
\n
\n
MLSE (English only)
\n
\n
\n
\n
\n
sT5 Base (English only)
\n
\n
\n
\n
\n
sT5 Large (English only)
\n
\n
\n
\n\n
\nTable 6. National ILI rate modeling results from using different embedding functions from a variety of LMs.\n
",
"perturb_sentence_id": [
5,
6
],
"output": {
"perturbed_statement": "[paragraph id = 5] Table 6 shows the results from using different search embeddings created using the sT5 Base (110M parameters) and sT5 Large (335M parameters) models. Surprisingly, smaller capacity models like sT5 Base and sT5 Large outperform the larger capacity MLSE model.",
"perturbed_explanation": "Original Explanation: Table 6 showcases the performance results when using different model configurations for search embeddings. Surprisingly, despite their larger model size, the sT5 Base and sT5 Large do not outperform the smaller MLSE model in the context provided. 1. The perturbed statement erroneously claims that the smaller capacity models like sT5 Base and sT5 Large outperform the MLSE model. However, the context explicitly states that the performance of sT5 models is similar and not superior to the MLSE model in the multi-language setting, with MLSE experiencing a considerable reduction in performance when using English-only data."
}
}
]