Reveal to Showcase Large Language Model Autocoder at FedCASIC 2025

Reveal senior data scientist Jackson Chen will be presenting how his team has improved the U.S. Census Bureau’s American Community Survey autocoder at FedCASIC on April 22, 2025. This project improves the autocoder by leveraging large language models to enhance semantic understanding and improve performance. Read more about Reveal’s work on the methods used to develop the autocoder and how to continuously monitor the performance below.

Improving the American Community Survey (ACS) Autocoder with Large Language Models 

By Jackson Chen, Nicole Cabrera, Jiahui Xu, Yezzi Angi Lee 

Reveal Global Consulting developers are improving the U.S. Census Bureau’s American Community Survey (ACS) autocoder by utilizing large language models (LLM) and natural language processing techniques. The ACS plays a vital role in helping local community leaders understand demographic changes and make informed decisions to better serve their communities. The ACS’ current autocoder was developed in 2012 and utilizes logistic regression to autocode write-in responses into occupation and industry codes. Currently, 30% of the write-in responses are autocoded, while the other 70% are sent to the National Processing Center (NPC) to be clerically coded.   

The new autocoder will be developed in Python to modernize the Census Bureau’s technology as the current autocoder uses Statistical Analysis System (SAS). By switching the autocoder development to Python, we are moving towards open-source, long-term, and supported technology that promotes longevity and saves cost. We will be leveraging LLM’s capabilities to improve the technology and minimize the cost associated with clerical coding by sending less write-in responses to them. 

Additionally, Reveal Global Consulting is improving the quality control of the autocoder. The current quality control monitors and improves the autocoding performance with a few checks on the autocoding process. We want to provide more in-depth analysis and statistics to see how the autocoder is assigning occupation and industry codes. With more statistics on how the autocoder is performing, errors and trends can be recognized quickly, and fixes can be deployed to keep the autocoder performing consistently. 

Leveraging Large Language Models  

LLMs have been on the rise in the last few years and are demonstrating a great leap in the natural language processing area. Part of the ACS asks about occupation and what industry the occupation is in. The write-in responses include the occupation title, occupation description, company name, and industry description. Additionally, it has other basic demographic information, but we focus mainly on the information they provide on their occupation and industry.  The current autocoder has been using logistic regression which suggests that the write-in responses can be classified, but that is not the case as human language is much more complex and cannot be classified easily.  

Our team utilizes the embeddings generated by LLMs to do semantic search. There are many ways to describe someone’s job and what industry it is in; semantic search helps us find what the closest matching occupation and industry is to the response.  

How semantic search works

We utilized an occupation and industry index that contains many occupation and industry titles that are shared on the Census Bureau public website. The index is then converted into embeddings and stored in a vector database. The LLM queries the response and searches through the vector database and returns the top N results. 

Afterwards, we go through the process of hardcoding and industry restrictions. We provide certain hardcoding as some occupations may be bound by education level and/or industry. An example is an accountant with a bachelor’s degree or higher has a different code than an accountant that has an associate’s degree or lower. For industry restrictions, there are certain occupations that can be in certain industries or occupations that have multiple industries tied to it. An example of this are “agents” who could be in real estate or entertainment industry and based on which industry they’re in they have a different occupation code assigned. 

LLMs can be fine-tuned to adjust to the domain it’s being used for and provide a better performance than the pre-trained base model. We fine-tuned our model to better understand the job titles and industry it’s being given. To further improve performance, we produced two fine-tuned models, one model fine-tuned on occupation data and another model fine-tuned on industry data. This process has provided a significant boost in performance for our autocoder. 

Autocoding process with large language model.

Adding Additional Metrics to Quality Check 

As occupations and industries come and go, our autocoder will have to handle those changes and adapt. Misspellings and complex responses are additional challenges that the autocoder faces and quality checks can identify these problems so that the necessary fixes can be made to the autocoder. 

Quality checks play a role in monitoring and improving the autocoder. Coders at NPC are trained rigorously to correctly code write-in responses and therefore, we want the autocoder to match how the clerical coders are coding. We compare the autocoder results and the clerical coder results to form a match rate that represents how well the autocoder matches with the clerical coders. 

A few additional metrics that we have introduced are: 

  1. Most common false pairs 

    Output two codes: 1) Correct code 2) Incorrect code.  With the two codes, we also count how many times it has appeared. This metric helps us identify any code that keeps getting autocoded wrongly. 

  2. Matching rate by occupation and industry sectors 

    We separate the results and group them by their respective sectors to see the performance of the autocoder in different sectors. Some sectors perform better than others and this allows us to make decisions on how to improve the performance whether it is the autocoder or the occupation/industry index. 

  3. Single word matching 

    The write-in monthly data contains two columns that contain the respondent’s occupation and what industry the occupation is in. Most of these responses are usually one word. With this information, we can see how many different codes are coded to these responses and we can analyze how difficult or not it is to autocode. 

With these new metrics, the autocoder’s performance can be checked every month and any trend or errors will be identified. NPC costs and the burden will go down as we shift more write-in responses to the autocoder. Through the continuous check and improvement, quality checks help maintain and sustain the autocoder in the long term.  

Looking Ahead 

The use of LLMs has proven to be a huge leap for technology to understand the complex human language and be able to decipher semantic meanings. While the ACS’ autocoder is just one of the surveys in the Census Bureau to begin testing and utilizing LLMs, the application of LLMs can be applied to different surveys and adapt to whatever domain it may be with fine-tuning. The Reveal team is grateful to be working alongside the Census Bureau to modernize its operations and looks forward to continuous engagement.  

Previous
Previous

Reveal Data Scientist Wins Prestigious Research Award

Next
Next

Reveal to Debut SAS to Python Translation Product at FedCASIC 2025