[ { "path": "table_paper/2407.00108v1.json", "table_id": "1", "section": "3.2", "all_context": [ "The human evaluation aspect of the study is interpreted as the effort required to post-edit the translations to a production standard, and captured in the number of errors, keystrokes and total edit time.", "The task was performed by professional HTs and PEs using ZOOSubs, an in-house software application belonging to ZOO Digital, built to facilitate manual translation of video material (Figure 1 ).", "The software s interface displays the video material along with timed subtitles in the original language.", "The target stream, i.e.", "the set of text boxes provided to the right of the source stream, is where the HTs input their translations to the desired language.", "It can optionally be pre-populated with “draft” translations – a setting we opted for in this study – allowing post-editors to edit, divide or combine the segments as they see fit.", "To make amendments to a segment, the PE needs to click on its box.", "From that point, the system tracks the time spent editing the box and the number of keystrokes made.", "These metrics are recorded for each box separately and taken into account only if the post-edited text differs from the original.", "After applying modifications, an Issues for event window appears for the user to specify the purpose of the changes by selecting errors from a predefined list, optionally providing text commentary.", "We leveraged this functionality of ZOOSubs to measure the total and average time and number of keystrokes made by HTs and PEs given some pre-existing translations.", "We also measured the number of selected errors.", "For this project, we created a bespoke taxonomy of errors (Table 1 ) based on translation errors reported in previous work [Freitag et al., 2021 , Sharou and Specia, 2022 ], the original list of issues already present in the ZOOSubs system and relevant errors from previous work (§2 ).", "Error categories from the aforementioned sources were compiled together and curated to fit the study requirements444We uploaded a draft taxonomy to ZOOSubs, and the first author performed a test evaluation against a stream with segments to validate the list.", "As a result, some errors were split into more granular categories, some were renamed and some generalised.", "The PEs operated on seven episodes from three TV series of varying genres: a fictional series about space exploration, a documentary exploring aspects of everyday life, and a family cooking competition show.", "They were unaware that some of the text they worked with was machine translated, but were told that it was for a research project and asked to relax some constraints such as adhering to the reading speed limits.", "In addition, we asked four HTs (two to German, two to French) to translate one episode of the cooking show from scratch in ZOOSubs so we could compare their effort to that of post-editors.", "For each of the seven episodes, the PEs were asked to post-edit one out of four versions of the text, corresponding to the list outlined in §3 .", "We included the human references (Ref) to account for the fact that PEs can sometimes post-edit a translation even when the original one is valid.", "Our setup ensured that the same PE evaluated the output for each episode exactly once (i.e.", "does not see two different versions of the same text) (Table 2 ).", "When referring to individual PEs, we use the notation PE.", "[L][i], where L {G (German), F (French)}, and i denotes the PE ID .", "The recruited PEs and HTs were professionals within the subtitle domain and freelance employees of ZOO Digital.", "They were informed that the undertaken work was carried out for a research project, but nevertheless, they were paid for their effort at competitive PE and HT rates, standard within the company for this type of work.", "Information about the PEs and HTs years of experience (YOE) was collected to shed more light on the findings (Table 3 ).", "They also answered a short survey about their views regarding machine translation, discussed in detail in §5.3 : Which one would you prefer: translating a stream from scratch or completing a quality check on (post-editing) a stream?", "What are your views on the use of machine translation in the industry?", "In your view, are there benefits to post-editing translations over translating from scratch?", "All French HTs had training in post-editing, and three out of four preferred it to translating from scratch, while no German HTs had received such training in the past, and all but one strictly preferred FST.", "All PEs had at least one YOE in post-editing and one and a half in the subtitle domain.", "Although the HTs within both pairs had a similar amount of experience in translation in general and in the subtitle domain ( for French vs for German), the French HTs had the advantage in terms of YOE in both subtitling (a mean difference of YOE) and post-editing (a mean difference of YOE).", "" ], "target_context_ids": [ 12, 13, 14 ], "selected_paragraphs": [ "[paragraph id = 12] For this project, we created a bespoke taxonomy of errors (Table 1 ) based on translation errors reported in previous work [Freitag et al., 2021 , Sharou and Specia, 2022 ], the original list of issues already present in the ZOOSubs system and relevant errors from previous work (§2 ).", "[paragraph id = 13] Error categories from the aforementioned sources were compiled together and curated to fit the study requirements444We uploaded a draft taxonomy to ZOOSubs, and the first author performed a test evaluation against a stream with segments to validate the list.", "[paragraph id = 14] As a result, some errors were split into more granular categories, some were renamed and some generalised." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Type\n\nDescription\n\n
Translation quality
Catastrophic translation\n\nImpossible to post-edit, must be translated from scratch.\n\n
Mistranslation\n\nIncorrect. Does not preserve the meaning or function of the source.\n\n
Omission\n\nPart of the source text was left untranslated.\n\n
Deviation in sentiment\n\nDoes not preserve the sentiment of the source (e.g. does not match the expressed excitement), or negates the sentiment (e.g. from positive to negative).\n\n
Locale convention\n\nViolates locale convention, e.g. currency and date format.\n\n
Fluency\n\nContains punctuation, spelling and grammar errors.\n\n
Context
Incorrect gender\n\nMisgenders the speaker or the addressed person(s).\n\n
Incorrect plurality\n\nIncorrectly refers to a single person when a group is addressed, or vice versa.\n\n
Wrong formality\n\nExpressed in informal style or uses informal addressing when should use formal, or vice versa.\n\n
Other inconsistency with video\n\nContains inconsistencies with the video material not falling within any of the above.\n\n
Style
Subtitle formatting violation\n\nViolation of the subtitle blocking guidelines.\n\n
Other style sheet non-compliance\n\nDoes not conform to the provided style sheet.\n\n
Awkward style\n\nThe style of the translation does not reflect the style of the source sentence and/or the context.\n\n
Subjective style changes\n\nThe translation is acceptable but the editor suggests improvements in style.\n\n
Other\n\nError of type not found above (use text box provided).\n\n
\n
\n
Table 1: List of errors provided to the human evaluators during the campaign.
\n
", "perturb_sentence_id": [ 12, 13 ], "output": { "perturbed_statement": "[paragraph id = 12] For this project, we created a bespoke taxonomy of errors (Table 1) based solely on translation errors reported in the original list of issues already present in the ZOOSubs system and relevant errors from previous work (§2). Error categories from the aforementioned sources were compiled together and curated to fit the study requirements. We uploaded a draft taxonomy to ZOOSubs, and the first author performed a test evaluation against a stream with segments to validate the list.", "perturbed_explanation": "The original explanation highlights the sources used to create the bespoke taxonomy of errors, which includes translation errors reported in previous work by Freitag et al., 2021, Sharou and Specia, 2022, the original list of issues from the ZOOSubs system, and relevant errors from previous work. 1. The statement is incorrect because it claims that the taxonomy was based solely on the original list of issues present in the ZOOSubs system and relevant errors from previous work, but it omits that the taxonomy was also based on translation errors reported in previous work by Freitag et al., 2021, and Sharou and Specia, 2022." } }, { "path": "table_paper/2407.00108v1.json", "table_id": "2", "section": "3.2", "all_context": [ "The human evaluation aspect of the study is interpreted as the effort required to post-edit the translations to a production standard, and captured in the number of errors, keystrokes and total edit time.", "The task was performed by professional HTs and PEs using ZOOSubs, an in-house software application belonging to ZOO Digital, built to facilitate manual translation of video material (Figure 1 ).", "The software s interface displays the video material along with timed subtitles in the original language.", "The target stream, i.e.", "the set of text boxes provided to the right of the source stream, is where the HTs input their translations to the desired language.", "It can optionally be pre-populated with “draft” translations – a setting we opted for in this study – allowing post-editors to edit, divide or combine the segments as they see fit.", "To make amendments to a segment, the PE needs to click on its box.", "From that point, the system tracks the time spent editing the box and the number of keystrokes made.", "These metrics are recorded for each box separately and taken into account only if the post-edited text differs from the original.", "After applying modifications, an Issues for event window appears for the user to specify the purpose of the changes by selecting errors from a predefined list, optionally providing text commentary.", "We leveraged this functionality of ZOOSubs to measure the total and average time and number of keystrokes made by HTs and PEs given some pre-existing translations.", "We also measured the number of selected errors.", "For this project, we created a bespoke taxonomy of errors (Table 1 ) based on translation errors reported in previous work [Freitag et al., 2021 , Sharou and Specia, 2022 ], the original list of issues already present in the ZOOSubs system and relevant errors from previous work (§2 ).", "Error categories from the aforementioned sources were compiled together and curated to fit the study requirements444We uploaded a draft taxonomy to ZOOSubs, and the first author performed a test evaluation against a stream with segments to validate the list.", "As a result, some errors were split into more granular categories, some were renamed and some generalised.", "The PEs operated on seven episodes from three TV series of varying genres: a fictional series about space exploration, a documentary exploring aspects of everyday life, and a family cooking competition show.", "They were unaware that some of the text they worked with was machine translated, but were told that it was for a research project and asked to relax some constraints such as adhering to the reading speed limits.", "In addition, we asked four HTs (two to German, two to French) to translate one episode of the cooking show from scratch in ZOOSubs so we could compare their effort to that of post-editors.", "For each of the seven episodes, the PEs were asked to post-edit one out of four versions of the text, corresponding to the list outlined in §3 .", "We included the human references (Ref) to account for the fact that PEs can sometimes post-edit a translation even when the original one is valid.", "Our setup ensured that the same PE evaluated the output for each episode exactly once (i.e.", "does not see two different versions of the same text) (Table 2 ).", "When referring to individual PEs, we use the notation PE.", "[L][i], where L {G (German), F (French)}, and i denotes the PE ID .", "The recruited PEs and HTs were professionals within the subtitle domain and freelance employees of ZOO Digital.", "They were informed that the undertaken work was carried out for a research project, but nevertheless, they were paid for their effort at competitive PE and HT rates, standard within the company for this type of work.", "Information about the PEs and HTs years of experience (YOE) was collected to shed more light on the findings (Table 3 ).", "They also answered a short survey about their views regarding machine translation, discussed in detail in §5.3 : Which one would you prefer: translating a stream from scratch or completing a quality check on (post-editing) a stream?", "What are your views on the use of machine translation in the industry?", "In your view, are there benefits to post-editing translations over translating from scratch?", "All French HTs had training in post-editing, and three out of four preferred it to translating from scratch, while no German HTs had received such training in the past, and all but one strictly preferred FST.", "All PEs had at least one YOE in post-editing and one and a half in the subtitle domain.", "Although the HTs within both pairs had a similar amount of experience in translation in general and in the subtitle domain ( for French vs for German), the French HTs had the advantage in terms of YOE in both subtitling (a mean difference of YOE) and post-editing (a mean difference of YOE).", "" ], "target_context_ids": [ 20, 21, 22 ], "selected_paragraphs": [ "[paragraph id = 20] Our setup ensured that the same PE evaluated the output for each episode exactly once (i.e.", "[paragraph id = 21] does not see two different versions of the same text) (Table 2 ).", "[paragraph id = 22] When referring to individual PEs, we use the notation PE." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
SeriesABC
Ep. IDA1A2B1B2C1C2C3
PE.1RefMTCueGoogleBase-NMTRefMTCueGoogle
PE.2Base-NMTRefMTCueGoogleBase-NMTRefMTCue
PE.3GoogleBase-NMTRefMTCueGoogleBase-NMTRef
PE.4MTCueGoogleBase-NMTRefMTCueGoogleBase-NMT
HT.1From Scratch
HT.2From Scratch
\n
\n
Table 2: Work assignment to PEs and HTs in the human evaluation campaign used for both language pairs.
\n
", "perturb_sentence_id": [ 21, 22 ], "output": { "perturbed_statement": "[paragraph id = 21] sees two different versions of the same text) (Table 2). When referring to individual PEs, we use the notation PE.", "perturbed_explanation": "1. The original statement explains that the paragraph does not see two different versions of the same text. 2. The statement incorrectly claims that the paragraph does see two different versions of the same text, which contradicts the context provided in paragraph id 20, where it was ensured that the same PE evaluated the output for each episode exactly once, implying there should be no exposure to multiple versions of the same text for evaluation." } }, { "path": "table_paper/2407.00108v1.json", "table_id": "2", "section": "5.1", "all_context": [ "An initial inspection of the results indicated that each PE marked a significantly different total number of errors (e.g.", "PE.F1 marked errors total while PE.F4 marked ).", "This made direct comparison of the error counts across systems unreliable as each PE also post-edited a different number of segments for each system (cf.", "Table 2 ).", "With seven episodes and four different versions of the text, for each PE there is a version of text they would only have seen one episode from.", "For example, in Table 2 , PE.1 is assigned two episodes for Ref, MTCue and Google, but only one for Base-NMT.", "In this example, if PE.1 generally marked fewer errors than others, Base-NMT would be disproportionately rewarded.", "To make the measurements comparable, we normalised them by computing a normalisation coefficient for each PE and then multiplying their error counts for each category by their .", "Let denote the number of errors within the category for the -th PE.", "We compute the normalised count as described by Equation 1 .", "We report the total error counts as well as the normalisation multipliers in Table 4 .", "To facilitate post-editing in ZOOSubs, MT outputs had to be adapted to match the subtitle format.", "Quality checks of translations conducted in ZOOSubs normally require the users not just to ensure the correctness of translations but also that the subtitles comply with strict guidelines555This includes adhering to reading speed and length limits, balancing the length of the top and bottom subtitle, disambiguation of speaker turns with colours or dashes, and applying appropriate formatting, as specified by a style sheet..", "Typical MT systems, like the ones used in this project, are not designed to create translations conforming to these stringent guidelines, and the primary goal of this study was to identify the impact of the translation errors alone.", "To faithfully replicate the normal work environment of the PEs, we applied a greedy reformatting tool (built into ZOOSubs) to reformat our translations as subtitles.", "We made it clear that the project is centred on the correctness of translations, not the subtitle formatting.", "Still, to ensure that the translation and non-translation errors are kept separate, we included two environment-specific errors for the workers to select from: Subtitle formatting violation covering cases where the subtitle is not split to optimally adhere to segmentation guidelines; and Other style sheet non-compliance where a rule outlined in the style sheet from the client company was not followed, such as custom punctuation conventions.", "In some instances, a PE would encounter both translation and non-translation errors within the same segment, as exemplified in Example 1 , where both translation errors (Mistranslation of by any chance and Formal/informal mismatch of you re doing) and non-translation errors (Subtitle formatting violation of the position of the subtitle break) are present.", "In such cases, we (i) disregard the non-translation error counts, and (ii) correct the effort rates (editing time and keystrokes) to account solely for translation-related errors.", "To precisely gauge the latter, we employed a correction method: let and be the total effort expended by a PE on a segment that had only non-translation and only translation errors marked, respectively.", "We calculated translation share (TS) as follows: We then used it to calculate the estimated share of the effort spent on translation in segments that had both errors marked by multiplying TS by the total effort spent on a segment with both error types.666For example, if a PE took three seconds for translation errors and two seconds for non-translation errors on average, where they marked both types we multiplied their total effort for that segment by .", "Finally, since the Other category was used substantially, we parsed the contents of the optional description text box.", "The most commonly reported Other errors were “Grammar”, “Punctuation”, “Timing”, “SGP” (spelling, grammar, punctuation) and “Literal translation”.", "Such errors () were removed from the Other category and pigeonholed as appropriate (e.g.", "“Grammar” as Fluency).", "More complex comments such as “wissen Sie should not be in the translation” were left categorised as Other ().", "The calculated normalised counts of errors within each category (Table 5 ) suggest that MTCue performs no worse than both non-contextual MT systems overall (row Total), while performing significantly better in the Context and Style categories in en-fr, pointing to gains related to the use of context information.", "The most frequently flagged errors in both language pairs were consistently Mistranslation and Fluency.", "Mistranslation was reported a similar number of times for all three machine translation systems in en-de and three times less frequently for post-editing Ref.", "This gap was similar in en-fr, though within the MT systems themselves, the Google system had a significantly higher error rate for Mistranslation errors ( mean) than the next best system, i.e.", "Base-NMT (); the contextual MTCue achieved an even lower rate of .", "Interestingly, MTCue also produced outputs of higher Fluency than other systems, even surpassing Ref for en-fr, though insignificantly at the selected confidence interval ().", "In both language pairs, the Omission error was consistently marked the fewest times in Google-generated text (see Translation quality Omission).", "In both cases, Ref scored significantly above the mean.", "This is unsurprising: translations authored by the general-purpose Google engine tend to be overly literal and faithful to the source, while in the domain of dialogue, the HT often needs to let go of individual features of the source text or opt for alternative expressions to maintain the brevity and dynamics of the source dialogue, leading to spontaneous omissions in the reference translations.", "To exemplify, Google consistently unnecessarily translated the English “(…), you know,” to “(…), wissen Sie,” in German, necessitating additional post-editing in our study.", "A similar error was typically avoided by the other systems, due to their data-learned preference for brevity and dynamically expressive language.", "As a result, both systems were marked with Omission more times than Google.", "In fact, MTCue scored even more Omissions than Base-NMT, suggesting that MTCue s omission behaviour more closely matches that of professional HTs.", "Other Translation quality errors were relatively infrequent and with insignificant differences between systems.", "To capture context-related issues, we provided categories for the most frequent contextual errors: Incorrect gender, Plural/singular form and Formal/informal mismatch.", "Since the perception of speaking style in dialogue is subjective and difficult to gauge, we did not provide explicit ways for the PEs to mark speaker style errors to avoid biasing them towards thinking in terms of what is a characteristic way of expression for the given speaker.", "Instead, we provided loose categories for Style, with the intention of collecting measurements of how often the PEs feel the need to alter the style of the translations.", "Since all of the post-edited content is dialogue, the style of the translation can be directly associated with the style of the speaker s expression.", "Our findings regarding some Context categories (Incorrect gender, Formal/informal mismatch) are consistent between the two language pairs, and MTCue was found to be superior in most categories in both cases, with the overall score for the Context category being significant at confidence for en-fr.", "The Plural/singular form error required few corrections in en-de (where Base-NMT was found superior to MTCue) and more in en-fr (where MTCue was found superior).", "The findings from the Style category also work in favour of contextual MT, where it was found comparable to non-contextual systems for the en-de pair and significantly better than them for the en-fr pair, requiring the fewest style-based adjustments, even fewer than Ref.", "Within the en-de pair, Subjective style changes were flagged only up to times per segments for any system, and a consistent number of times between systems, and Awkward style was flagged the fewest times for Ref ( on average), much less frequently than for the other systems, among which Google required the most edits and Base-NMT the fewest.", "Overall, our error count analysis suggests that within the en-fr pair, MTCue has significantly reduced the number of errors marked for contextual and stylistic reasons compared to non-contextual systems, while not degrading overall translation quality.", "The findings within the en-de pair are too variable to yield definitive conclusions but entail no degradation of quality leading from the inclusion of context, a significant improvement for contextual phenomena compared to Google, and highlight that MTCue makes the fewest contextual errors overall.", "" ], "target_context_ids": [ 3, 5, 6 ], "selected_paragraphs": [ "[paragraph id = 3] Table 2 ).", "[paragraph id = 5] For example, in Table 2 , PE.1 is assigned two episodes for Ref, MTCue and Google, but only one for Base-NMT.", "[paragraph id = 6] In this example, if PE.1 generally marked fewer errors than others, Base-NMT would be disproportionately rewarded." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
SeriesABC
Ep. IDA1A2B1B2C1C2C3
PE.1RefMTCueGoogleBase-NMTRefMTCueGoogle
PE.2Base-NMTRefMTCueGoogleBase-NMTRefMTCue
PE.3GoogleBase-NMTRefMTCueGoogleBase-NMTRef
PE.4MTCueGoogleBase-NMTRefMTCueGoogleBase-NMT
HT.1From Scratch
HT.2From Scratch
\n
\n
Table 2: Work assignment to PEs and HTs in the human evaluation campaign used for both language pairs.
\n
", "perturb_sentence_id": [ 5, 6 ], "output": { "perturbed_statement": "[paragraph id = 5] For example, in Table 2 , PE.1 is assigned three episodes for Ref, MTCue and Google, but only one for Base-NMT. In this example, if PE.1 generally marked fewer errors than others, Base-NMT would be disproportionately rewarded.", "perturbed_explanation": "Original Explanation: The statement highlights that PE.1 has fewer episodes assigned for Base-NMT than for Ref, MTCue, and Google. This implies that any error marking tendency of PE.1 could affect the perceived performance of Base-NMT, possibly rewarding it disproportionately. 2. The statement is incorrect because it mentions three episodes assigned to PE.1 for Ref, MTCue, and Google, whereas the original context specifies two episodes. This change misrepresents the actual distribution of episodes in Table 2, altering the comparative analysis originally presented in the statement." } }, { "path": "table_paper/2407.00108v1.json", "table_id": "3", "section": "3.2", "all_context": [ "The human evaluation aspect of the study is interpreted as the effort required to post-edit the translations to a production standard, and captured in the number of errors, keystrokes and total edit time.", "The task was performed by professional HTs and PEs using ZOOSubs, an in-house software application belonging to ZOO Digital, built to facilitate manual translation of video material (Figure 1 ).", "The software s interface displays the video material along with timed subtitles in the original language.", "The target stream, i.e.", "the set of text boxes provided to the right of the source stream, is where the HTs input their translations to the desired language.", "It can optionally be pre-populated with “draft” translations – a setting we opted for in this study – allowing post-editors to edit, divide or combine the segments as they see fit.", "To make amendments to a segment, the PE needs to click on its box.", "From that point, the system tracks the time spent editing the box and the number of keystrokes made.", "These metrics are recorded for each box separately and taken into account only if the post-edited text differs from the original.", "After applying modifications, an Issues for event window appears for the user to specify the purpose of the changes by selecting errors from a predefined list, optionally providing text commentary.", "We leveraged this functionality of ZOOSubs to measure the total and average time and number of keystrokes made by HTs and PEs given some pre-existing translations.", "We also measured the number of selected errors.", "For this project, we created a bespoke taxonomy of errors (Table 1 ) based on translation errors reported in previous work [Freitag et al., 2021 , Sharou and Specia, 2022 ], the original list of issues already present in the ZOOSubs system and relevant errors from previous work (§2 ).", "Error categories from the aforementioned sources were compiled together and curated to fit the study requirements444We uploaded a draft taxonomy to ZOOSubs, and the first author performed a test evaluation against a stream with segments to validate the list.", "As a result, some errors were split into more granular categories, some were renamed and some generalised.", "The PEs operated on seven episodes from three TV series of varying genres: a fictional series about space exploration, a documentary exploring aspects of everyday life, and a family cooking competition show.", "They were unaware that some of the text they worked with was machine translated, but were told that it was for a research project and asked to relax some constraints such as adhering to the reading speed limits.", "In addition, we asked four HTs (two to German, two to French) to translate one episode of the cooking show from scratch in ZOOSubs so we could compare their effort to that of post-editors.", "For each of the seven episodes, the PEs were asked to post-edit one out of four versions of the text, corresponding to the list outlined in §3 .", "We included the human references (Ref) to account for the fact that PEs can sometimes post-edit a translation even when the original one is valid.", "Our setup ensured that the same PE evaluated the output for each episode exactly once (i.e.", "does not see two different versions of the same text) (Table 2 ).", "When referring to individual PEs, we use the notation PE.", "[L][i], where L {G (German), F (French)}, and i denotes the PE ID .", "The recruited PEs and HTs were professionals within the subtitle domain and freelance employees of ZOO Digital.", "They were informed that the undertaken work was carried out for a research project, but nevertheless, they were paid for their effort at competitive PE and HT rates, standard within the company for this type of work.", "Information about the PEs and HTs years of experience (YOE) was collected to shed more light on the findings (Table 3 ).", "They also answered a short survey about their views regarding machine translation, discussed in detail in §5.3 : Which one would you prefer: translating a stream from scratch or completing a quality check on (post-editing) a stream?", "What are your views on the use of machine translation in the industry?", "In your view, are there benefits to post-editing translations over translating from scratch?", "All French HTs had training in post-editing, and three out of four preferred it to translating from scratch, while no German HTs had received such training in the past, and all but one strictly preferred FST.", "All PEs had at least one YOE in post-editing and one and a half in the subtitle domain.", "Although the HTs within both pairs had a similar amount of experience in translation in general and in the subtitle domain ( for French vs for German), the French HTs had the advantage in terms of YOE in both subtitling (a mean difference of YOE) and post-editing (a mean difference of YOE).", "" ], "target_context_ids": [ 25, 34, 35 ], "selected_paragraphs": [ "[paragraph id = 25] They were informed that the undertaken work was carried out for a research project, but nevertheless, they were paid for their effort at competitive PE and HT rates, standard within the company for this type of work." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
English-to-FrenchEnglish-to-German
PE.F1PE.F2PE.F3PE.F4PE.G1PE.G2PE.G3PE.G4
Translation YOE158320718817
YOE in subtitles861.5207587
YOE in post-editing863105513
Post-editing training?
Prefer post-editing?\n/\n
\n
\n
Table 3: Details regarding employed PEs.
\n
", "perturb_sentence_id": [ 25, 26 ], "output": { "perturbed_statement": "[paragraph id = 25] They were informed that the undertaken work was carried out for a research project, but nevertheless, they were paid below standard PE and HT rates, standard within the company for this type of work. Information about the PEs and HTs years of experience (YOE) was collected to shed more light on the findings (Table 3 ).", "perturbed_explanation": "They were informed that the undertaken work was carried out for a research project, but nevertheless, they were paid for their effort at competitive PE and HT rates, standard within the company for this type of work. Information about the PEs and HTs years of experience (YOE) was collected to shed more light on the findings. 1. The work was completed for a research project, aligning with the context of research being the nature of the work. 2. Payment was provided at competitive rates, noting that the remuneration was aligned with company standards for the type of work involved. The statement is incorrect because it falsely claims they were paid below standard rates, while the original context specifies that they were compensated at competitive rates. Thus, the discrepancy lies in the level of payment being misrepresented." } }, { "path": "table_paper/2407.00108v1.json", "table_id": "4", "section": "5.1", "all_context": [ "An initial inspection of the results indicated that each PE marked a significantly different total number of errors (e.g.", "PE.F1 marked errors total while PE.F4 marked ).", "This made direct comparison of the error counts across systems unreliable as each PE also post-edited a different number of segments for each system (cf.", "Table 2 ).", "With seven episodes and four different versions of the text, for each PE there is a version of text they would only have seen one episode from.", "For example, in Table 2 , PE.1 is assigned two episodes for Ref, MTCue and Google, but only one for Base-NMT.", "In this example, if PE.1 generally marked fewer errors than others, Base-NMT would be disproportionately rewarded.", "To make the measurements comparable, we normalised them by computing a normalisation coefficient for each PE and then multiplying their error counts for each category by their .", "Let denote the number of errors within the category for the -th PE.", "We compute the normalised count as described by Equation 1 .", "We report the total error counts as well as the normalisation multipliers in Table 4 .", "To facilitate post-editing in ZOOSubs, MT outputs had to be adapted to match the subtitle format.", "Quality checks of translations conducted in ZOOSubs normally require the users not just to ensure the correctness of translations but also that the subtitles comply with strict guidelines555This includes adhering to reading speed and length limits, balancing the length of the top and bottom subtitle, disambiguation of speaker turns with colours or dashes, and applying appropriate formatting, as specified by a style sheet..", "Typical MT systems, like the ones used in this project, are not designed to create translations conforming to these stringent guidelines, and the primary goal of this study was to identify the impact of the translation errors alone.", "To faithfully replicate the normal work environment of the PEs, we applied a greedy reformatting tool (built into ZOOSubs) to reformat our translations as subtitles.", "We made it clear that the project is centred on the correctness of translations, not the subtitle formatting.", "Still, to ensure that the translation and non-translation errors are kept separate, we included two environment-specific errors for the workers to select from: Subtitle formatting violation covering cases where the subtitle is not split to optimally adhere to segmentation guidelines; and Other style sheet non-compliance where a rule outlined in the style sheet from the client company was not followed, such as custom punctuation conventions.", "In some instances, a PE would encounter both translation and non-translation errors within the same segment, as exemplified in Example 1 , where both translation errors (Mistranslation of by any chance and Formal/informal mismatch of you re doing) and non-translation errors (Subtitle formatting violation of the position of the subtitle break) are present.", "In such cases, we (i) disregard the non-translation error counts, and (ii) correct the effort rates (editing time and keystrokes) to account solely for translation-related errors.", "To precisely gauge the latter, we employed a correction method: let and be the total effort expended by a PE on a segment that had only non-translation and only translation errors marked, respectively.", "We calculated translation share (TS) as follows: We then used it to calculate the estimated share of the effort spent on translation in segments that had both errors marked by multiplying TS by the total effort spent on a segment with both error types.666For example, if a PE took three seconds for translation errors and two seconds for non-translation errors on average, where they marked both types we multiplied their total effort for that segment by .", "Finally, since the Other category was used substantially, we parsed the contents of the optional description text box.", "The most commonly reported Other errors were “Grammar”, “Punctuation”, “Timing”, “SGP” (spelling, grammar, punctuation) and “Literal translation”.", "Such errors () were removed from the Other category and pigeonholed as appropriate (e.g.", "“Grammar” as Fluency).", "More complex comments such as “wissen Sie should not be in the translation” were left categorised as Other ().", "The calculated normalised counts of errors within each category (Table 5 ) suggest that MTCue performs no worse than both non-contextual MT systems overall (row Total), while performing significantly better in the Context and Style categories in en-fr, pointing to gains related to the use of context information.", "The most frequently flagged errors in both language pairs were consistently Mistranslation and Fluency.", "Mistranslation was reported a similar number of times for all three machine translation systems in en-de and three times less frequently for post-editing Ref.", "This gap was similar in en-fr, though within the MT systems themselves, the Google system had a significantly higher error rate for Mistranslation errors ( mean) than the next best system, i.e.", "Base-NMT (); the contextual MTCue achieved an even lower rate of .", "Interestingly, MTCue also produced outputs of higher Fluency than other systems, even surpassing Ref for en-fr, though insignificantly at the selected confidence interval ().", "In both language pairs, the Omission error was consistently marked the fewest times in Google-generated text (see Translation quality Omission).", "In both cases, Ref scored significantly above the mean.", "This is unsurprising: translations authored by the general-purpose Google engine tend to be overly literal and faithful to the source, while in the domain of dialogue, the HT often needs to let go of individual features of the source text or opt for alternative expressions to maintain the brevity and dynamics of the source dialogue, leading to spontaneous omissions in the reference translations.", "To exemplify, Google consistently unnecessarily translated the English “(…), you know,” to “(…), wissen Sie,” in German, necessitating additional post-editing in our study.", "A similar error was typically avoided by the other systems, due to their data-learned preference for brevity and dynamically expressive language.", "As a result, both systems were marked with Omission more times than Google.", "In fact, MTCue scored even more Omissions than Base-NMT, suggesting that MTCue s omission behaviour more closely matches that of professional HTs.", "Other Translation quality errors were relatively infrequent and with insignificant differences between systems.", "To capture context-related issues, we provided categories for the most frequent contextual errors: Incorrect gender, Plural/singular form and Formal/informal mismatch.", "Since the perception of speaking style in dialogue is subjective and difficult to gauge, we did not provide explicit ways for the PEs to mark speaker style errors to avoid biasing them towards thinking in terms of what is a characteristic way of expression for the given speaker.", "Instead, we provided loose categories for Style, with the intention of collecting measurements of how often the PEs feel the need to alter the style of the translations.", "Since all of the post-edited content is dialogue, the style of the translation can be directly associated with the style of the speaker s expression.", "Our findings regarding some Context categories (Incorrect gender, Formal/informal mismatch) are consistent between the two language pairs, and MTCue was found to be superior in most categories in both cases, with the overall score for the Context category being significant at confidence for en-fr.", "The Plural/singular form error required few corrections in en-de (where Base-NMT was found superior to MTCue) and more in en-fr (where MTCue was found superior).", "The findings from the Style category also work in favour of contextual MT, where it was found comparable to non-contextual systems for the en-de pair and significantly better than them for the en-fr pair, requiring the fewest style-based adjustments, even fewer than Ref.", "Within the en-de pair, Subjective style changes were flagged only up to times per segments for any system, and a consistent number of times between systems, and Awkward style was flagged the fewest times for Ref ( on average), much less frequently than for the other systems, among which Google required the most edits and Base-NMT the fewest.", "Overall, our error count analysis suggests that within the en-fr pair, MTCue has significantly reduced the number of errors marked for contextual and stylistic reasons compared to non-contextual systems, while not degrading overall translation quality.", "The findings within the en-de pair are too variable to yield definitive conclusions but entail no degradation of quality leading from the inclusion of context, a significant improvement for contextual phenomena compared to Google, and highlight that MTCue makes the fewest contextual errors overall.", "" ], "target_context_ids": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ], "selected_paragraphs": [ "[paragraph id = 0] An initial inspection of the results indicated that each PE marked a significantly different total number of errors (e.g.", "[paragraph id = 1] PE.F1 marked errors total while PE.F4 marked ).", "[paragraph id = 2] This made direct comparison of the error counts across systems unreliable as each PE also post-edited a different number of segments for each system (cf.", "[paragraph id = 3] Table 2 ).", "[paragraph id = 4] With seven episodes and four different versions of the text, for each PE there is a version of text they would only have seen one episode from.", "[paragraph id = 5] For example, in Table 2 , PE.1 is assigned two episodes for Ref, MTCue and Google, but only one for Base-NMT.", "[paragraph id = 6] In this example, if PE.1 generally marked fewer errors than others, Base-NMT would be disproportionately rewarded.", "[paragraph id = 7] To make the measurements comparable, we normalised them by computing a normalisation coefficient for each PE and then multiplying their error counts for each category by their .", "[paragraph id = 8] Let denote the number of errors within the category for the -th PE.", "[paragraph id = 9] We compute the normalised count as described by Equation 1 .", "[paragraph id = 10] We report the total error counts as well as the normalisation multipliers in Table 4 ." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
English-to-GermanEnglish-to-French
PE IDError countPE IDError count
PE.G1PE.F1
PE.G2PE.F2
PE.G3PE.F3
PE.G4PE.F4
\n
\n
Table 4: Error counts and values of for each PE.
\n
", "perturb_sentence_id": [ 1, 2 ], "output": { "perturbed_statement": "[paragraph id = 1] PE.F1 marked errors total while PE.F2 marked ).This made direct comparison of the error counts across systems unreliable as each PE also post-edited a different number of segments for each system (cf.", "perturbed_explanation": "1. Each post-editor (PE) marked a significantly different total number of errors, leading to unreliable comparisons of error counts across systems, as each PE edited a different number of segments for each system. 2. The statement is incorrect because it falsely claims that PE.F2 marked errors, instead of PE.F4. This introduces a factual inconsistency with the context which states that PE.F4 was involved in marking, while PE.F2 is not mentioned." } } ]