K-means clustering of text from keywords and titles of recent papers written by statistics and actuarial science faculty

An undergraduate research project by Yu Zhang (BSc SFU 2013)

Tuesday, March 4, 2014

Here we showcase the research areas of strength and diversity of faculty members in the Department of Statistics and Actuarial Science at SFU. Words extracted from titles and keywords of recent publications were clustered to form 20 groups. These clusters were considered as research categories within the department. Researchers were then assigned a score based on the proportion of their words that fall into each category. The results are presented in two ways; the first shows distribution of research for an individual faculty across the categories, and the second shows a departmental view of where members relate when comparing scores across two clusters.

DATA

All publications from research faculty members published between 2002 and 2013 as taken from their respective webpages and publicly available cv were collected. Titles and keywords were extracted and counts of the number of times the words were used per faculty member was tabulated. Originally, there were 2127 unique words. Non-statistical words such as ‘of’, ‘the’, and ‘regime’ were deleted. Known related words were combined, for example, ‘chromosome’ and ‘DNA’ were combined into ‘genetic’, while ‘salmon and ‘falcon’ were combined into ‘animal’. In the end, the list was paired into 126 unique research keywords.

The word counts were then standardized and grouped into 20 clusters by a k-means algorithm in R, in which the sum of squared distance from word counts to the mean of the cluster is minimized. Clusters were named based on our interpretation of the themes that were found.

RESULTS

Introduction

Data

Kmeans