Friday, August 27, 2010
Keyword Strength from a More Rigorous Perspective
By P.J. Hinton Director of Engineering
With this week's release came a new look and new behavior for Compendium's Keyword Strength Meter, a widget that appears in several places in our application for indicating the quality of keyword usage in a body of content.
For one thing, we migrated the algorithm from the browser to the server, so that it is computed and updated on draft save operations. While this sacrifices the immediate feedback that many customers have come to love, it allows us to provide this feature for
all of our clients, including those which have very large target keyword pools. It also allows us to expose the
keyword strength algorithm as a web service API call.
The meter itself got a makeover that replaces the shifting gradient to a progress bar. The new look is more accessible to the color blind.
What hasn't changes is the math behind the meter.
Every once in a while, we get a question about how the keyword strength score is computed. I've written previously about the
objectives of the algorithm, how it attempts to find a balance between diverse keyword usage and detrimental keyword stuffing, but that's a pretty high level discussion.
When I talk with team members about the meter's algorithm, I've always downplayed the complexity, because it's always seemed like a pretty straightforward calculation. As an amusement, I decided to recast the algorithm using more precise terminology, resembling that of what a mathematician or computer scientist might use. Here is what I wound up with.
Keyword StrengthFor the purposes of this discussion, a
token is a contiguous sequence of characters within a string that contains no whitespace.
Let
T be a string of characters consisting of white space delimited tokens.
Let
K be a vector of
n strings containing white space delimited tokens.
Let
Ki denote the
ith element of
K.
Let
L(
x) be a function that computes the number of tokens present in the string
x.
For the purposes of defining
M and
D below, appearance will be determined by a case-insensitive character comparison.
Let
M(
T,
Ki) be a function which returns
ni, the the number of times where
Ki appears in
T.
Let
D(
T,
Ki) be a function which returns 1 if
Ki appears anywhere in
T and 0 otherwise.
Let
C(
T,
K) be a function that computes the
concentration, a measure of how much of string
T is comprised of the tokens in
K following definitions:
Let s be a three element vector of scoring functions that have the following formulas:
Where
Mceil,
Dceil, and
Copt are adjustable parameters.
Let
w be a three-element vector of weights between 0 and 1, such that:
Then the
keyword strength of a string
T relative to a vector of keyword phrase strings
K is determined by the dot product of the scoring function and weight vectors.