Practicing Data Analysis On Restaurant Inspection Data
In my quest to improve my data wrangling and data analysis skills I’ve been learning the data analysis toolkit called Pandas. To put into practice what I‘ve learned I looked for a fairly basic dataset to dig into. I found just that in the Restaurants and Markets Violations dataset from L.A. County’s Open Data Website, launched in 2015.
I figured there would be some interesting tidbits garnered from this dataset. Perhaps I could find what were the most common violations issued? Or maybe that data would show how often restaurants get inspected?
Local media NBC4 created an easy-to-use web interface for the dataset and on that page they assert that “LA County policy requires three inspections per year of full-service restaurants.” Validating that assertion seemed like a good place to start on my analysis. With Pandas, just a few lines of Python code later, and the result is:
Whoah, that’s not three per year. By far, most of the businesses that inspectors, er …, Environmental Health Specialists, visit get just one or two inspections per year. Does that mean the requirement specified by this alleged County policy is not being met? Well, L.A. County Environmental Health department inspects businesses that include not just restaurants but convenience stores, food trucks, food markets and a number of other types of businesses as well. The dataset does have a Program Element column but there are various categories, including “low risk”, “moderate risk” or “high risk” even just for the “restaurant” categories so I’ll need to learn more about the details on this supposed County policy before being making any conclusions.
But overall, most of these businesses see an inspector every 180 days, or even more frequently, as shown:
And if you look at the duration that has passed since the last inspection, the majority of businesses were inspected in the past 90 days:
I plan to check with the the County to see if I can learn more about this policy and more about what factors determine the different “risk” classifications for the Program Element data column. Likely it means that I can exclude the categories for donut shops, juice bars and other businesses that probably don’t qualify as needing an inspection but maybe once a year. [Update: The County’s project manager responded to my e-mail inquiry and described how there are the three risk classifications. The inspection frequency varies based on which risk category, with the “high risk” category being the only one requiring three inspections per year .]
Some businesses don’t average even one visit per year. For instance, Starbucks. There were 58 locations that weren’t even inspected at all in the past year. But at the same time, the lowest grade for the last inspection at any of the 582 Starbucks locations was an A (score 90 or above). Thus less frequent inspections of Starbucks might seem a prudent use of resources. It was quite a joy learning how small of a task it was to get Pandas to yield this type of information.
All routine inspections occur unannounced. But would the County make a surprise inspection on a weekend, I wondered? Generally, nope:
Likely there are a small number of businesses open only (or primarily) on the weekend as well, which would explain why there are any inspections at all occurring on Saturdays and Sundays. Assuming County health inspectors generally don’t work the observed holidays, which are oftentimes recognized on either Monday or Friday, it might make sense that over the course of a year those two days of the week see fewer inspections than would days in the middle of the week. But the holidays alone wouldn’t explain the degree of the decrease in inspections. It’s probable that Mondays and Friday happen to also be the days that workers take their personal time off as well.
Seeing that there are differences between days of the week for one type of criteria, I became curious if there were any differences in another — average score given from one weekday to the next weekday.
And certainly, there is a difference. The grade of B is more likely from an inspection on a Monday than it is on a Tuesday, and that chance of a B decreases progressively each day through to the weekend. I’m curious to know why this occurs but would need to take a closer look at the data. I would bet that the types of violations would give a clue as to what is different on a Monday versus a Thursday.
Without breaking it down by weekday, that violation data shows there isn’t just one specific violation that stands out significantly in its frequency. There is commonality though — the top several violations are all related to not meeting the standards for cleanliness:
I then put on my entrepreneurial hat and wondered if there might be any business opportunities that present themselves after wrangling and analyzing this data.
Certainly, no restaurant operator wants a placard showing a low grade posted next to the front door. Some of these restaurant operators might even be willing to pay to help increase the likelihood of getting an A the next time the inspector comes around. There are food safety consultants who provide training or perform private inspection services, and there are providers of online training as well — all of whom might find this data useful in prospecting for clients.
The best prospect for such a food safety consultant, I figure, would be a restaurant that is due (or past due) for an inspection and still has a B grade (or worse) placard in the window. Again, with a minimal amount of effort Pandas delivered. Here’s a snippet from resultant list of restaurants matching that criteria:
Bus. Id Name City Last Insp. Grade
--------- -------------- --------------- ---------- -----
PR0151304 GALLER******** ROWLAND HEIGHTS 2016-08-03 B
PR0168975 TASTY ******** ALHAMBRA 2016-09-16 B
PR0168629 TARA'S******** LOS ANGELES 2016-09-20 B
PR0036788 5i IND******** CULVER CITY 2016-10-04 B
PR0009018 RANCHO******** LOS ANGELES 2016-10-11 B
PR0174628 PHO 87******** LOS ANGELES 2016-10-12 B
PR0170530 BIG SH******** LANCASTER 2016-10-12 B
PR0173321 CALI N******** WEST COVINA 2016-10-24 B
That about does it for the analysis I’ve performed on this dataset. If there are any problems you see with my analysis or there is something else you would like me to take a look at, please let me know. My contact info is at the bottom of this post. [Update: I’ve since written another post on inspection grading.]
Yelp’s LIVES Restaurant Inspection Data Standard
While this was one of the first datasets I’ve analyzed using Pandas, it was a familiar experience. The dataset has missing data, duplicate data, data with erroneous values, inconsistencies, and more. That’s not atypical for most enterprise data though. Additionally, the dataset is denormalized so as to appear as a single table which causes the size for the download file to be relatively large — 221MB!
I then learned that there exists a subset of this data, normalized (thus a smaller download, a ~5M zip file) and available from a data feed created for Yelp called LIVES. L.A. County offers it openly to the public.
So that is the source of the data I decided to use when performing my analysis. Pandas has some capabilities such as “merge” that made quick work of some of the rudimentary data cleaning that needed to be done, which was a nice discovery. Additionally, using this data source now means the same analysis done by my Pandas code for L.A. County needs little change to work for any of the other nearly two dozen municipalities that have partnered with Yelp on LIVES.
A read-only view of the Jupyter Notebook with my Pandas code will show the commands that produced the charts and other analysis for this post. If you would like to play around with this dataset and Pandas yourself, the project is on Github.
Hire me, I’m available — full time, part time, or contract/freelance.
Stephen Gornick
sgornick@gmail.com
+1.310.356.9912
About.me
https://www.linkedin.com/in/stephen-gornick-b8601311b
Github | Resumé