My Golf Data — Storage and Entry

Christopher Johnson
Geek Culture
Published in
7 min readJun 24, 2021

--

As I progress in my career in data, my eyes are always opened wider and wider to what is possible not only professionally but personally as well. For my next personal project, I wanted to build something that combined my career experience and a passion of mine, golf. I have been golfing off and on my whole life but as I get older I spend more and more time caring about how well I play, I thought it was time to put some analysis techniques on my game.

The first step in researching this project was to start looking at golf applications that will keep score and other data for me. There are a handful of applications that can keep score on your phone or a smartwatch but, I wanted to pull down my scores to analyze them personally. None of these applications I look at had an advertised public API, so I emailed or reached out in some way to ask if I could access the scores I put into their application. I received a could of responses that were “No”, in so many words, and never got a response from the rest of the companies. I even tried to reach out to the USGA (United States Golf Association) to see if they had an API I could use to gather information about courses and that led to another “No” as well. This led me down the direction of trying to build my own system to keep track of my scores.

After figuring out that I am going to need to build a system completely from scratch, I set out to design the database that would store the data. I landed on a two-schema system. The first one contains information about courses I played and the second for information about each round I played. The first schema, “Courses”, contains four tables; course_name, course_version, course_info, hole_info. This schema holds information that could be found on a scorecard and the most important tables are course_info and hole_info. Course_info holds all the information about a course at its tee level. This would be front nine and back nine distance, total distance, rating, and slope. Rating and slope are two different difficulty measures used to compare courses. Most golf courses have different tees that a person can play from. This tee difference will affect the distance and difficultly of the course being played. Even though I have the tees I enjoy the most I do play from different tees on the same course depending on who I am played with. Hole_info stores data about each hole at a course/tee level. Par, distance, and handicap being columns that should be highlighted in this table. Hole handicap is the ranking of how hard that particular hole is on the course. Side note — I think it would be interesting to try to debunk this ranking with my golf scores and statistical analysis. Couse_name holds identifiable information about a course and course_version is there in case a course ever goes through any renovations or changes. Below is a diagram of what the Courses schema looks like.

The second schema revolves around each round that is played and contains three tables; round_info, hole_scores, and note_definations. Round_info stores information about where each round was played, what tee was used, and even the weather during the round. Hole_scores holds data about each hole that was played during a round. It does not only contain the score of each hole but also, putts, fairways and greens in regulation hit, and sides missed on fairways and greens. There are also two note columns for each hole if necessary as well. I used notes to mark that something abnormal happened, usually bad. These notes are stored in abbreviations but the full name and its definition can be found in the note_definations table. Below is a diagram of what the Scores schema looks like.

These two schemas can connect by using course_id, course_version_id, and tee_played in the round_info table back to the course_info or hole_info tables.

Next up was data entry. I wanted my data entry process to be as simple as possible but, something I could easily repeat. Finding a data entry solution inside jupyter notebooks seemed like it would be ideal for my situation. I stumbled across a python package called ipywidgets and found what I was looking for. Ipywidgets is a python package that lets you build interactive HTML widgets inside of a jupyter notebook. These are very simplistic widgets that let you enter values, pick dates or select values from a list. This was my first time using the ipywidgets package but I would like to use this package again and make more complex widgets in the future. Most of the tables have their own jupyter notebook for data entry. Course_info and course_version share a notebook and round_info and hole_score share a notebook as well. Above and to the left is an example of what data entry looks like for a single hole. The full notebooks with all data entry can be found on my GitHub here: https://github.com/CBJohnson30/Golf-Scores

All of this database setup and data entry leads to analysis. As the data only comes in as I play golf, it will take time to create enough data to have proper analysis. This includes played individual courses enough to drive analysis from a single course and to compare courses. Luckily for my golf game, I live in Colorado where there are tons of different and great courses that I get to play but, that means it will most likely take until October for me to play the same course five times.
What I have done is some preliminary analysis around what type of hole I play the best when it comes to the par of that hole. I have data on 162 holes that I have played this year; 38 Par 3’s, 90 Par 4’s, and 34 Par 5’s. I started by looking at my average strokes above par for each type of hole.

As you can see above I play Par 5’s the best and Par 4’s the worst. None of this is surprising to me but I think the data may be a little skewed because of the amount of Par 4’s I have played compared to the amount of the other two types of holes. Even with playing few as the 90 holes, I am sure to have a few blow-up holes in there. What is a more interesting graph is where I compare each type of hole to the strokes I am compared to par. This can be seen below.

Two points jump off the graph to me. The first one is the variance of all three hole types. Par 4’s are spread across all six strokes to par while Par 3’s only appear on 4 of them. The other point is the closeness between the strokes to par of +2 and +3 for Par 5’s. These are only roughly .06% off. Meaning that I am almost as likely to triple a par 5 as I am to double the hole as well. This means I mostly need to keep my head up when I have a blow-up hole and improve my scrambling when I am struggling on a hole as well. I will be looking into this further and using the notes that I make on these holes to give me further insights.

I want to highlight one of the biggest things that I learned while creating and filling the Golf Scores database. This would be the difference in treating ID numbers as integers compared to varchars. In my previous experience, I have always been particular about treating identifiable or categorical values as strings even if they could be an integer. I use to work at a place where we would get data that could have an ID number with leading a zero or two. It is vital to the integrity of our data that those leading zeros were not dropped in the ETL pipeline. That habit carried over to this project as well. My ID numbers were created with the use of a smallserial data type when inserted into a table. That same smallserial ID would be a varchar datatype in a connecting table. If you look at my SQL queries you will find me casting a column at an int or varchar in my join statements. In this small database, this action takes no time at all but I would be worried about this datatype choice causing efficiency problems in a much larger database. In future projects, I will need to unify these data types to make my database easier to understand and query.

This is one of those projects that went in a different direction than when I started but, I glad it went this way. Building out a database myself is going to allow me to do more than I would be able to than If I was piggybacking off another application. Throughout this summer and fall, I continue to enter rounds of golf I play and once I have a little more data I will be driving more insights from this project this winter.

--

--

Christopher Johnson
Geek Culture

Data Scientist and Data Analyst trying out new techniques and always exploring new datasets. https://www.linkedin.com/in/christopher-johnson/