I spent 3 months working with the Internal Revenue Service building a model development system to help prioritize incoming reports on possible tax violations. Coordinated through the Coding it Forward Civic Digital Fellowship program, I was partnered with the Tax-Exempt/Government Entities office to build a model that prioritizes incoming tips based on estimated tax-relevant merit. These categorizations help IRS agents prioritize which referrals to inspect first, reducing the time it takes for the most serious fraud concerns to be investigated.
I had access to several years worth of previous reports and their eventual outcomes. Reports came in a variety of formats, from scanned documents to official form-fillable PDFs. After standardizing the data input, I created a modular system to train and test various machine learning model types--from regressions to neural nets. My system permutates through available feature sets, learners, and model hyper-parameters to create a short list of best model candidates. I then created an interactive tool to evaluate this short list of models, and guide the user in selecting the best fit.
The pilot project displayed early promise. Using the preliminary front-runner model I developed, high priority referrals would be seen by IRS agents on average 51 days faster than a first-in-first-out process. Future iterations aim to incorporate historical data beyond the reports themselves, and other open source information.