The 2020 census revealed that Cleveland’s population contracted another 6% since 2010. It’s down 22% since 2000. The surrounding population in Cuyahoga County dropped from 2000 to 2010, but recovered somewhat over the next decade. Despite Cleveland’s struggle, its downtown core has been growing. In fact, it doubled from about 6,300 in 2000 to over 13,000 in 2020. You can see maps and other details on my GitHub page.
img:hover { transform: scale(1.05); } Today I fine tuned an image recognition deep learning model, created a web interface with Gradio, and deployed it on Hugging Face Spaces. This project was the subject of lesson 2 of fast.ai’s Deep Learning for Coders course.
1. Fine-tuning a Model in Kaggle Deep learning models need to train on a GPU (Graphics Processing Unit) because CPUs are too slow. I followed fast.
I created a chatbot in Shiny by coding along with James Wade’s three-part YouTube tutorial series (#1, #2, and #3), and added ideas I learned from Alejandro AO’s embedding tutorial in python.
You can try it out on shinyapps.io. Unfortunately, the OpenAI API service is not free. I pay a fraction of a penny for every call to it, but the pennies add up. I made the API key a setting the user must enter.
William Poundstone’s Rock Breaks Scissors (Amazon) shows how people are often predictable in their efforts to be random. For instance, from the title, a good strategy for playing Rocks, Paper, Scissors is to start with paper because your opponent is likely to lead with the aggressive rock. The insights are neat throughout, but I’m especially delighted with the simple applications of probability. Let’s play with a few from the chapter on Chapanis’s random number experiment
Retrosheet is one of several sources of detailed baseball statistics.1 It is unique in that it is the only one (that I know of) that curates play-by-play event files of games. Their event files reach back to 1914. Unlike other sources that summarize teams or games, these files summarize individual plays all the way down to pitch sequences. For example, I used my database to find out whether changing pitchers mid-inning is related to game duration.
There are three techniques usually used to deal with missing values:
Ignore them and just use complete cases. This is acceptable if <5% of cases are incomplete and missingness is random (Azur, 2004)
Impute values with their mean, median, or mode. A mean imputation leaves the overall mean unchanged, but artificially reduces the variable’s variance (Alice, 2018).
Impute values with multivariate imputation by chained equations (MICE). This method creates multiple predictions of each missing value, allowing the researcher to account for uncertainty in the imputations.
For my new survival analysis project on baseball career longevity, I am using the TR Plaza font adopted by the Cleveland Guardians for the 2022 season. Style choices shouldn’t be gratuitous, but I feel like using the Guardians font for my graphics connects statistical analysis to the underlying topic of baseball. Anyway, it’s cool, so I’m doing it.
Before After I ran into difficulties adding the font, so I’m capturing the process here to smooth the way for you and for future me.