Mining Strava's cycling data

keithg · on May 26, 2014

If you guys are interested in data-mining Strava, here is some interesting info on their metro project and how they are using it to affect change.

http://blog.strava.com/arent-we-all-people-for-bikes-7783/

http://metro.strava.com/

nl · on May 26, 2014

Interesting. You should link to the segments you are talking about. (Edit: I see you did now, sorry. Edit2: Wow, Ryan Sherlock has that KOM in Ireland and the KOM on Old La Honda in California. Nice work by a non-Protour rider!)

Amongst the questions that I thought it might be interesting to attack are the following..

I think you'll find many people are using Strava data to attack the problem of developing a model for maximal power output over time. Veloclinc[1] and DJConnel[2] are probably the best places to start for reading about this.

The "cleaning data" section surprised me. In my experience maybe 10% of (manually inspected) efforts have significant problems, but you seem to have thrown away a lot more than that on Col du Tourmalet and Alpe d’Huez. Stocking Lane looks roughly in the order I'd expect. Any idea what is going on there?

it is likely that a person will cycle a 15% gradient towards the end of a climb more slowly than the 15% near the beginning.

Not really. Judging your maximum speed on a climb is a significant problem and it goes both ways (especially on faster cyclists). Hence the "Chris Froome looking at his stem" meme[3] - he is notorious for climbing based on power output (which maximises your overall velocity).

Regarding altitude, you are best to look at the device the data came from. Phones are notorious for getting altitude data wrong, whereas Garmin devices are pretty good. Using Google data is mixed: the gradient of the road changes significantly depending what line you take, and that isn't replicated in Google's data.

Regarding your power equation, you realize Strava already does this (as power estimates), right? [4] is a pretty decent Javascript version. The problem is that it is very weight dependant, and getting an exact weight for a bike+rider+clothes is pretty hard from random Strava data.

[1] Hard to read, but eg: http://veloclinic.tumblr.com/post/85194606798/ward-smith-cp-...

[2] eg: http://djconnel.blogspot.com.au/2014/05/numerical-testing-of...

[3] http://chrisfroomelookingatstems.tumblr.com/

[4] http://www.kreuzotter.de/english/espeed.htm

thrownaway2424 · on May 26, 2014

Android vs. Garmin altitude data seems like an irrelevant distinction, as Strava keeps and maintains forever the altitude profile of the very first person who rode from A to B, no matter how ridiculous the data. On the route I ride most often, up Tunnel Rd in Oakland, California, it's a climb of about 1000 feet over about 4 miles, depending on where you reckon the start and end, but on the whole length it's just monotonically uphill. There are no down grades. But the Strava data is all over the place. It goes up and down on small scales constantly. It says there's a -47% grade at one point. The profile looks like the business end of a backsaw, when in reality it's just up.

To make any sense of this data requires substantial low-pass filtering over their altitude data. I don't know why Strava doesn't try to clean this up.

paulmach · on May 26, 2014

Using the API you can get both the Segment altitude data and the Segment Effort altitude data. The segment data is what's displayed on the website and can be bad. But the Segment Effort data would be a subset of the data from the activity. So you can get many many versions of the altitude for the segment and do any type of analysis you wish.

nl · on May 26, 2014

Can I ask why Strava doesn't do something about the Segment altitude data?

paulmach · on May 26, 2014

Priorities....

nl · on May 26, 2014

I think the Strava Streams API call[1] can include the altitude data. Haven't tried the v3 API though so I'm sit sure, but the documentation indicates it is there.

[1] http://strava.github.io/api/v3/streams/

ocfnash · on May 26, 2014

Author here. I just had a look at those links and they're super. I know what I'll be reading when I get back from this morning's cycle!

Regarding the fact that I threw out so many segment efforts for the Col du Tourmalet and Alpe d’Huez (especially compared to Stocking Lane): a large part of this is explained by the fact that I only keep efforts for people who do not stop before the end. I just reran the data cleaning code to get the stats for Col Du Tourmalet. The first round of cleaning (which is the most severe) throws out 1,643 segment efforts and the breakdown is: isInconsistentDistances (difference of first and last 'distance' numbers in effort more than 2% away from official segment distance) : 442 isInadmissableAvgSpeed (journeys that are too slow) : 36 isInconsistentTimes (difference of first and last 'time' numbers in effort not equal to reported time for that effort) 1 isNotAlwaysMoving ('elapsed_time' != 'moving_time' in Strava's effort summary) 1164

Regarding the altitude data, thanks for the tip. If I return to do more, I may see if it's possible to use only data from Garmin devices. I would guess that for the curves for faster riders (the upper decile curves) a lot of the riders are using Garmins so I might even be able to see the effect if I tried to look for it.

Regarding the power equation, I did not know that! I'll have to have a look at what Strava are doing. They continue to impress me with the number of features they have.

ocfnash · on May 26, 2014

Oh dear my table of effort censoring frequencies got all messed up and is annoying to read. Here is a slightly less garbled version:

isInconsistentDistances (difference of first and last 'distance' numbers in effort more than 2% away from official segment distance) : 442

isInadmissableAvgSpeed (journeys that are too slow) : 36

isInconsistentTimes (difference of first and last 'time' numbers in effort not equal to reported time for that effort) : 1

isNotAlwaysMoving ('elapsed_time' != 'moving_time' in Strava's effort summary) : 1164

nl · on May 26, 2014

On the power thing, Djconnel has some derivatives for the effects of random wind on speed, and also grade variation on VAM which may interest you.

nl · on May 26, 2014

Regarding the speed over time thing for climbs, here's a good example. This is a local 1.3 km climb at an average of 12%: http://www.strava.com/segments/872359?filter=overall

The equal KOM used a power meter to keep his power as flat as possible. You can see how his speed varied with variations in the gradiant more than the distance: http://www.strava.com/activities/114411950/analysis