Reveal to Present Research on Survey Translation at Premier Forum

In the article below, the Reveal Team set out to answer the question “how do individuals perceive and interpret AI-translated texts compared to professionally translated texts?” The research was selected to be presented at the American Association for Public Opinion Research (AAPOR) 80th Annual Conference in May 2025.

AI-Assisted Survey Translation: A Mixed Methods Approach

Taylor J. Wilson, Yezzi Angi Lee, Nicole Cabrera, and Aruna Peri

The Problem

Collecting data from non-English speaking populations presents a significant challenge for the federal statistical system. This process necessitates careful translation of survey materials, instructions, and sometimes even requires second-language proficiency for interview-based data collection. Failure to address these needs can result in hard-to-count (HTC) populations, leading to complications in survey processing, statistical weighting, and generating adequately representative sample sizes. Recognizing this issue, statistical agencies have taken proactive steps, such as producing Spanish-language survey materials and securing additional funding for targeted outreach. However, creating accurate and effective translations remains a difficult task. Professional translations are costly, and in the context of federal surveys, finding the necessary funds to achieve high-quality translations can be challenging. 

In the context of the United States, Spanish is the most spoken language other than English. According to the American Community Survey, 22% of people speak a language other than English at home, 13.4% speak Spanish at home, and 8.4% of people speak English less than very well.[1] Ensuring survey materials that target nationally representative samples are available in Spanish is critical. Other non-Spanish language communities that exist in the United States also need to be accurately captured. These statistics highlight the importance of language translation for collecting the most robust and accurate information possible from respondents.

Large Language Models (LLM) for Language Translation

Our team identified mass translation as an early use case for LLMs, focusing on their application across different teams that handle Spanish-language materials. To support the data collection phase of the survey lifecycle, we tested the models on survey questionnaires to assess their translation accuracy. The goal is to determine if these models could be cost effective ways of producing supplementary survey materials to help tackle the problem described. The LLM space evolves rapidly, with new models with multi-lingual capabilities being developed frequently. Our goal was to assess the best-in-class models at the time of this research and compare them to professionally translated versions of the survey materials. Given that we expect new models to continue being produced, we wanted to provide a reusable framework for evaluation that could be applied as the models continued to improve and capture the key nuances which we determined to be important in the survey space.

Mixed Methods Approach

Quantitative measures of translation accuracy center around a well-maintained source of truth to which an AI translation would be compared. The simplest approach is to measure character and word error rates in the two translations. In this case, error means that we assume the professionally translated item is correct and deviance from it would imply a lower model quality. Rather than this framing, we assume the AI translation agent is simply acting as another translator. To frame this, we think about translation disagreement rather than error. From a quantitative perspective, the closer the two translations are in words and characters, the less disagreement is observed and therefore the AI agent is at least on par with a professional translator.

This presents the interesting research question; how do individuals perceive and interpret AI-translated texts compared to professionally translated texts? To test this, we designed an experiment that leverages cognitive interviewing, a method in qualitative research that asks participants to “think aloud” as they interpret survey items. The approach uncovers how respondents mentally process a question, what words or phrases create confusion, and whether their interpretation matches the construct the survey is meant to capture. By comparing responses between professionally translated items and LLM-generated items, we can identify not only if the translations differ but why those differences matter. We chose the Behavioral Risk Factor Surveillance System (BRFSS) out of the CDC to conduct this evaluation due to the questionnaires having been professionally translated into Spanish already. We assessed several potential LLMs to do this work, but we ended up using GPT-4o for ease of use and availability.

To ensure an unbiased evaluation, we presented both the AI and professionally translated question to the participant and masked which was which. We wanted them to determine whether they had any strict preference in translation (i.e., was one clearly better than the other in their minds or were they at least equivalent). We varied this across several different question categories that we assigned, including complexity level (e.g., amount of nuance, multi-part), ambiguity, and cultural sensitivity. For the initial exercise, we targeted Puerto Rican respondents because of existing response rate issues on the island. However, we have since extended the research to rarer languages in the United States such as Telugu with plans to evaluate LLM translation performance across a wider spectrum of languages. Participation in the evaluation was incentivized with a $20 Amazon gift card.

Results

A total of 25 Puerto Rican Spanish-speaking adults were recruited for the study. Qualitative analysis of the cognitive interviews revealed four major themes:

  1. Terminology Confusion and Mismatches: Regardless of translation type, certain survey terms were difficult to understand. Respondents often identified errors or unnatural phrasing in the professional translations, suggesting a need for more accessible and culturally appropriate terminology.

  2. Language Formality and Cultural Alignment Affects Perception of Translated Texts: The Puerto Rican respondents noted differences in formality and tone between translations, with professional translation often being perceived as more formal while AI translations more casual. Reactions to these formality differences were mixed with some respondents highlighting that formal language was appropriate in official or unfamiliar settings while other respondents preferred more casual phrasing that felt more natural or like every day Puerto Rican Spanish. They noted that the setting, context, and purpose of the survey would be important for determining preference, indicating professional situations where formality might be expected would influence how they might select the translation they liked best.  

  3. Cognitive Barriers and Aides to Comprehension: The respondents noted differences in length, sentence structure, and phrasing across both translation types, influencing how easily the translated questions were understood. Longer or more complex sentences, which were more common for professional translations, were sometimes described as harder to follow, while shorter or more concise phrasing, often found in AI translations, was perceived as easier to comprehend by some respondents.  

  4. AI vs. Professional Translation Preferences and Perceptions: Respondents overall did not gravitate toward one type of translation method when asked blindly. Preference for translation methods was often shaped by contextual understanding, cultural appropriateness and tone, and familiarity with the terminology used in the survey items.

Results from the surveys indicated no major difference in participants’ comprehension of AI vs. Professionally translated survey items overall. However, further analysis revealed meaningful insights, such as educational attainment, which may influence how participants perceive translation clarity, quality, and their overall preferences. These findings indicate that AI-translated survey questions perform just as well as professionally translated texts, offering comparable levels of comprehension, and can be leveraged as a cost-effective, alternative method of translation for survey data collection.


[1] Language Spoken at Home | American Community Survey | U.S. Census Bureau

Previous
Previous

Reveal Data Scientists and Partners Lead Conversations at Premier Statistics Conference

Next
Next

Reveal Data Scientist Wins Prestigious Research Award