Age and Sex Analysis Of Microsoft USA MVPs
All text and code copyright (c) 2016 by Jamie Dixon. Used with permission.
Original post dated 2016-12-25 available at https://jamessdixon.wordpress.com/2016/12/25/age-and-sex-analysis-of-microsoft-usa-mvps/
By Jamie Dixon
A couple of weeks ago, this came across my Twitter
I participated in this hackathon (well, helped run the F# one). My response was:
I was surprised that I got into this exchange with a Microsoft PM:
That last comment by me was inspired by Mark Twain: “never wrestle with a pig. You just get dirty and the pig likes it.” But it did get me to thinking about the composition of the US MVPs. I did an analysis a couple of years ago of the photos of the Microsoft MVPs (found here and here) so it made sense to follow up on that code and see if I was wrong about my “middle age white guy” hypothesis. I could get the photos from the MVP site and pass them into the Microsoft Cognitive Services API for facial analysis for age/sex data. Using F# made the analysis a snap.
A nice thing about the Microsoft MVP website is that it is public and has photos of the MVPs. Here is one of the pages:
and when you look at the source of the page, each of those photos has a distinct uri:
I opened up Visual Studio and created a new F# project. I went into the script file and brought in the libraries to do some http requests. I then created a couple of functions to pull down the HTML of each of the 19 pages and put it into 1 big string:
let getPageContents(pageNumber:int) =
let uri = new Uri("http://mvp.microsoft.com/en-us/search-mvp.aspx?lo=United+States&sl=0&browse=False&sc=s&ps=36&pn=" + pageNumber.ToString())
let request = WebRequest.Create(uri)
request.Method <- "GET"
let response = request.GetResponse()
use stream = response.GetResponseStream()
use reader = new StreamReader(stream)
reader.ReadToEnd()
let contents =
[|1..19|]
|> Array.map(fun i -> getPageContents i)
|> Seq.reduce(fun x y -> x + y)
(OT: Since I did a map..reduce on lines 12 and 13, does that mean I am working with “Big Data”?)
I then created a quick parser to find only the uris of the photos in all of the HTML.
let getUrisFromPageContents(pageContents:string) =
let pattern = "/PublicProfile/Photo/\d+"
let matchCollection = Regex.Matches(pageContents, pattern)
matchCollection
|> Seq.cast
|> Seq.map(fun (m:Match) -> m.Value)
|> Seq.map(fun v -> "https://mvp.microsoft.com/en-us" + v + "?language=en-us")
|> Seq.toArray
let uris = getUrisFromPageContents contents
Sure enough, I got 684 uris for MVP photos. I then wrote another Web Request to pull down each of the photos and save them to disk:
let saveImage uri =
use client = new WebClient()
let id = Guid.NewGuid()
let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos\" + id.ToString() + ".jpg"
client.DownloadFile(Uri(uri),path)
uris
|> Seq.iter saveImage
And I now have all 684 photos on disk.
I did not bring down the names of the MVPs – instead using a GUID to randomize the photos, but a name analysis would also be interesting. With the photos now local, I could then upload them to Microsoft Cognitive Services API to do facial analysis. You can read about the details of the API here. I created a third web request to pass the photo up and get the results from the API:
let getOxfordResults path =
let queryString = HttpUtility.ParseQueryString(String.Empty)
queryString.Add("returnFaceId","true")
queryString.Add("returnFaceLandmarks","false")
queryString.Add("returnFaceAttributes","age,gender")
let uri = "https://api.projectoxford.ai/face/v1.0/detect?" + queryString.ToString()
let bytes = File.ReadAllBytes(path)
let client = new HttpClient()
client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key","xxxxxxxxxxx")
let response = new HttpResponseMessage()
let content = new ByteArrayContent(bytes)
content.Headers.ContentType <- MediaTypeHeaderValue("application/octet-stream")
let result = client.PostAsync(uri,content).Result
Thread.Sleep(TimeSpan.FromSeconds(5.0))
match result.StatusCode with
| HttpStatusCode.OK -> Some (result.Content.ReadAsStringAsync().Result)
| _ -> None
Notice that I put a 5 second sleep into the call. This is because Microsoft throttles the requests to 20 per minute. Also, since some of the photos do not have a face, I used the F# option type. The results come back from the Microsoft Cognitive Services API as Json. To parse the results, I used the FSharp Json Type Provider:
type FaceInfo = JsonProvider<Sample="[{\"faceId\":\"83045097-daa1-4f1c-8669-ed012e9b5975\",\"faceRectangle\":{\"top\":187,\"left\":209,\"width\":214,\"height\":214},\"faceAttributes\":{\"gender\":\"male\",\"age\":42.8}}]">
let parseOxfordResuls results =
match results with
| Some r ->
let face = FaceInfo.Parse(r)
match Seq.length face with
| 0 -> None
| _ -> let header = face |> Seq.head
Some(header.FaceAttributes.Age,header.FaceAttributes.Gender)
| None -> None
So now I can get estimated age and gender from Microsoft Cognitive Services API. I was disappointed that the API does not estimate race. I assume they have the technology but from a social-acceptance point of view, they don’t make it publically available. In any event, a look though their photos show that a majority are white people. In any event, I went ahead and ran this and went out to work on my sons stock car while the requests were spinning.
#time
let results =
let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos"
Directory.GetFiles(path)
|> Array.map(fun f -> getOxfordResults f)
|> Array.map(fun r -> parseOxfordResuls r)
#time
When I came back, I had a nice sequence of a tuple that contained ages and genders.
To analyze the data, I pulled in Math.NET. First, I took a look age:
Seq.length results //684
let ages =
results
|> Seq.filter(fun r -> r.IsSome)
|> Seq.map(fun o -> fst o.Value)
|> Seq.map(fun a -> float a)
let stats = new DescriptiveStatistics(ages)
let count = stats.Count
let largest = stats.Maximum
let smallest = stats.Minimum
let mean = stats.Mean
let median = Statistics.Median(ages)
let variance = stats.Variance
let standardDeviation = stats.StandardDeviation
let kurtosis = stats.Kurtosis
let skewness = stats.Skewness
let lowerQuartile = Statistics.LowerQuartile(ages)
let uppserQuartile = Statistics.UpperQuartile(ages)
Here are the results.
I got 620 valid photos of the 684 MVPs – so a 91% hit rate and I have enough observations to make the analysis statistically valid. It looks like Cognitive Services made at least 1 mistake with an age of 4.9 years –> perhaps someone was using a meme for their photo? In any event, the mean is estimated at 41.95 and the median is 40.95, so a slight skew left. (Note I mislabeled it on the screen shot above)
I then wanted to see the distribution of the ages so I brought in FSharp charting and ran a basic histogram:
open FSharp.Charting
let chart = Chart.Histogram(ages,Intervals=10.0)
Chart.Show(chart)
So the ages look very Gaussian.
I then decided to look at gender:
let gender =
results
|> Seq.filter(fun r -> r.IsSome)
|> Seq.map(fun o -> snd o.Value)
gender
|> Seq.countBy(fun v -> v)
|> Seq.map(fun (g,c) -> g, c, float c/float count)
With the results being:
So there are 12% females and 88% males. With an average age 42 years old and 88% male, “middle age white guy” seems like an appropriate label and I stand by my original tweet – we certainly have work to do in 2017.
You can find the gist here.