In my previous posts, I documented my findings from studying my blood pressure (BP) data collected from a consumer blood pressure cuff and correlating it to data from other activities, including travel and exercise. Part One covered the purpose and procedure. Part Two presented the results and conclusions. In this entry, I’ll describe the tools I used to perform the analysis and highlight some lessons learned.
Marshaling the Data
The data for this project came from a variety of sources:
- Blood pressure data from the Withings iPhone app
- Exercise data from an iPhone app called MapMyRun
Note: walk/run and BP data are also available through Apple’s Health app
- Miscellaneous data recorded in spreadsheets or on paper
Just aggregating the data into a single format proved to be a challenge. MapMyRun does not support data export in the free version, and Apple’s app exports data in an inconvenient XML format. I ended up entering exercise data by hand into a spreadsheet, which was manageable for the 100+ days of data I had available.
The existing spreadsheet data was easy to use, but the data recorded on paper needed to be manually added to the spreadsheet. Only the Withings app made the export process easy.
Conveniently, all of the data was keyed by date, enabling me to create a spreadsheet that had one row for each date, followed by columns for each data source. The BP data was a minor exception because it had multiple readings per day. I condensed those into a single line by averaging the values.
In the end, I had one spreadsheet containing all the data. A “rectangular” data set makes it easy to use tools like Excel and Orange, rather than having to write a program to manage a more complicated data format.
Rectangular, largely numeric data is right at home in Excel, so that’s a natural starting point for any analysis.
Much of the data required minor clean-up, such as splitting a date/time field into separate fields for date and time and averaging it. Pivot tables and graphs were useful as a way to quickly explore the data set to see if anything stood out. Some of Excel’s graphs are limited to 256 data points, and I had 333 for this analysis, so I couldn’t make full use of those features.
There are some tricks and best practices that dramatically increase Excel’s value. Spendan hour with Joel Spolsky (former program manager for Excel) to find out how to maximize Excel’s functionality.
R is the natural choice for any sort of data analysis, but for this exercise, I was interested in exploring Orange, a graphical tool backed by python for analyzing data.
Orange makes it really easy to browse a data table, visualize the data, and even apply machine learning techniques to the data set. I used Orange to generate all of the graphs in my previous blog post.
Some of the features require an understanding of data science, but a lot of the statistical functions allow you to easily compare subsets of data to find patterns that would be difficult to discover in Excel.
This exercise in real-world data analytics taught several lessons.
- Do a test run first! It’s worth planning out your analysis and running it on a small set of data before scaling to the full data set. Trial and error on 100 data points is far less painful than if you have 1000.
- Tool knowledge is essential. Mastery of your tool set leads to better analysis and quicker results. But having a problem to work on is also an opportunity to learn more about the tools when you run into problems and figure out how to solve them.
- Take a lot of notes. Exporting, aggregating and analyzing data requires numerous manual steps. In a formal environment, you’d use technology to automate the steps and make them repeatable. It’s not worthwhile to do that for a small process you’re only going to repeat a few times. But without an automated process, it’s essential to write things down: the steps taken, solutions created and any “gotchas” you encountered. Especially the gotchas.
They say “all science is becoming data science.” Perhaps in the next decade, the fundamental techniques required to do data science will be incorporated into the tools that scientists use. In the meantime, there’s a disparate set of increasingly powerful and free (or cheap) tools to help you wrangle data and a growing universe of data sets available for learning.