Businesses often want to better understand their customers by segmenting them along a common set of attributes. In a previous post, we explored how to build segments based on customers’ trajectories of interactions with a brand. In this post, we’ll show how to build segments based purely on the products that customers have purchased; this approach has the added bonus of not just segmenting your customers, but your products too! And this isn’t just a strategic intelligence tool—finding groups of similar customers can lead to advanced recommendations systems that are personalized for each one of your customers.

How one can find groups of similar customers that purchase similar products? How do we define what “similar” means in an assortment of thousands (or tens of thousands) of SKUs? Flip the question on its head: how do we find products that are purchased by similar customers? How do we define similarity between customers?

As you’ll see—asking either one of those questions individually is a bit like asking which blade of the scissors cuts the paper. In a product segmentation, customers are said to be similar when they purchase from the same set of products; and products are similar when they are purchased by the same set of customers.

Ok—let’s dive into how it’s done.

We start with what we’re trying to explain, which is our observed customer-by-product matrix:

SKU1 SKU2 SKU3 ...
CUST1 0 1 1
CUST2 0 0 1
CUST3 1 0 0


To keep things really simple, we’re going to put a 1 in the cell if the customer has ever purchased that product, and 0 if they never have.

Again, let’s remember that we’re trying to explain this data in terms of customers(rows) and products(columns)—sounds like we’re trying to split this matrix in two! In fact we are—the tool that we use to explain our observed data is called non-negative matrix factorization. It is a group of algorithms that simplify the original matrix of data (let’s call it V) by 2 other matrices (Wand H), which, when multiplied together, come to (approximately) the original matrix.

So, let’s say you have 10,000 customers and sell 1,000 products. Your customer-by-product matrix is going to be 10,000 rows and 1,000 columns. But, you could factor this matrix into a:

  • 10,000 row by 2[†] column matrix (W), and a
  • 2[†] row by 1,000 column matrix (H)

†2 is arbitrary—but is typically determined by trying a number of different options

This would mean that for each customer, you have two pieces of information that tell you what kinds of products they purchase (instead of 1,000); and for each product, you have two pieces of information that tell you which kinds of customers purchase them.


Link: License:


This approach can be thought of as a multidimensional scaling algorithm that has better features than Principal Component Analysis as it is well defined for non-negative values in the data (and counts of purchases are always non-negative).

Working backwards, if you work out the sums, a single cell of the matrix that you arrive at when you multiply the customer- and product-segment matrices together is:

CS1 * PS1 + CS2 * PS2 + … + CS5 * PS5

Where CS1 is that customer’s score for segment 1, and PS1 is that product’sscore for being in segment 1—and so on through to segment 5. So if a customer and product have high scores for the same segments, then our factorization is implying that this cell in the customer-by-product matrix has a high value.


Visually, the results can be shown in the form of a heatmap, which shows each customer’s score by each segment (the charts below use the word "basis" in lieu of segment).

The dark entries (in column 2, for example) mean that those customers preferentially buy products from segment 2. And which products are those? Well, we have a corresponding heatmap for our product segments, that looks like this:

Now we have the two pieces of information together—the customers that tend to purchase similar products, and the products that tend to be purchased by similar customers.

With this information in hand, you can start using these scores as the basis for strategic decisions and marketing enhancements, like:

  • Developing product-based customer segments to build need-based personas
  • Deciding which products should be offered together as a bundle
  • Building a product recommendation engine that uses a customer’s segment to determine which products should be merchandised

Code Through

If you’re interested—check out our code-through here.


Working through such an analysis, a common challenge is having a very sparse dataset. Typically there are many products, and customers tend to only purchase a few of them. This can usually be addressed by applying some expert knowledge to develop a hierarchy of information about the product—from brand, to style, to size (or SKU), and choosing the appropriate level of the hierarchy to use.

In addition, non-negative matrix factorization takes a number of options and parameters—there a number of different algorithms and choices to make for each. One needs to determine which loss function to use and how to determine the starting state of the matrix in the estimation process.

Not to mention, you have to choose the number of segments for your analysis—this is typically done by trying a range of possible segments and comparing how well they explain the data (by comparing their errors) and how well they perform for you, the analyst. Too many segments, and the information is hard to digest; too few, and you are not explaining the data well.

Finally, the computations are expensive and tricky. We use the NMF library in R, which is as performant and flexible as they come—but even still we often encounter tricky and hard-to-diagnose errors all the time.