How grocery stores are becoming more effective at marketing through data science
Association analysis is a hot topic in data science right now. By discovering relationships between items within large quantities or networks of data, we can glean insights in many areas. These include uncovering unconscious consumer buying patterns at a particular store or through an app (known as market basket transactions), finding interesting patterns in text mining, or potentially discovering patterns in healthcare, transportation, or survey-based data. These relationships may be represented by what we call association rules, which are typically written in the form {A} → {B}, where the two items A and B exhibit some sort of relationship.
This article is intended for the data science-curious who are passingly familiar with data science, but might not have a deep knowledge of the industry. In this article, you will learn:
- Vocabulary / Definitions
- Overview of the Apriori Algorithm
- Common Association Rules
- Implementation in R
- Statistical Pitfalls
The Instacart dataset is a set of three million real transactions that was released to the public by Instacart in 2017. In addition to the orders, the original dataset also includes data on the number of days since an individual’s previous order as well as information about item placement at the grocery store.
For the purposes of this analysis and simplicity of explanation, we will be using a reduced portion of the dataset that had already been partially cleaned. However, it is good general practice to extensively investigate and clean your datasets before modeling.
Most datasets you will encounter in the real world will be imperfect. As a result, data cleaning is one of the most important steps in creating a model that is representative of the information in front of you. Your first step should always be to take a look at the data before starting analysis. Although it sounds overly simple, it gets you comfortable with the subject and gives you a general idea of how the data is structured. This knowledge could potentially give you clues to the best organizational tweaks to apply as well as potential pitfalls you can address up front.
We quickly see that all of our columns are in the format we will need to do market basket analysis in that they are already numeric as indicated by the <dbl> type label.
In this specific dataset, we also need to ensure that no two items have identical product_names and product_ids attached. If there is overlap, we could potentially misrepresent the strength or weakness indicated by an association rule.
As you work your way through this article some terms will be familiar to you and others will not. We suggest making sure to carefully read over the following definitions so you don’t find yourself confusing similar-sounding terms. We also provide more detailed definitions of some association rules in subsequent sections (theoretical underpinnings, code implementation, etc.) so not to worry if some of these definitions leave you wanting more!
- Apriori, translated from Latin as “from the former”, is an algorithm that generates association rules.
- Association rules represent relationships between individual items or item sets within the data. These are often written in {A}→{B} format.
- A market basket is a group of one or more items that a customer purchases in one transaction.
- Confidence (denoted as c) is an estimate of the conditional probability that one item (for example, {A} is in the basket given that another {B} is already present.
- High confidence rules meet or exceed a predefined confidence threshold.
- An item is an individual unit, especially when described as part of a set, list, basket, or other grouping.
- Support (denoted as s) of an individual item is “the fraction of transactions that contain [it],” or the frequentist probability with which it occurs within a set of transactions. It is, therefore, an estimate of the proportion of future baskets that will contain the item.
- An item set is any grouping of one or more items.
- Frequent item sets are individual item sets that meet or exceed a given minimum support threshold.
- Support between item sets is an association rule that gives the frequency at which both item sets occur in the same basket. It is also an estimate of the proportion of future baskets that will contain the item sets.
And these are all the important definitions you need to know for association analysis!
You’ve cleaned the Instacart dataset, so now it is time to start analyzing! One of the best ways to perform association analysis in R is with the arules package. This package provides the functions needed to mine frequent item sets, maximal frequent item sets, association rules, and more!
A great use of the arules package is to implement the Apriori algorithm. The Apriori algorithm is an efficient way to discover frequent item sets (those that appear in ≥ s number of baskets) AND high confidence rules (those with confidence ≥ c).
Without the Apriori algorithm you have to use brute force to perform different types of association analysis. For example, given there are i items within set I, the total number of possible association rules becomes:
This brute force method can become computationally intensive, so instead we suggest using the Apriori algorithm to compute it instead.
The Apriori principle states that if an item set is frequent then all of its subsets will also be frequent, and vice versa. Therefore, if item i does not appear in at least s baskets (i.e. is not frequent), then no item sets that include item i will appear in at least s baskets either.
This entire process breaks down into two steps: 1) find all frequent item sets within the dataset, and 2) find rules, using the frequent item sets, that also have high confidence.
Ex. Consider a lattice containing all possible combinations of only 5 products: A = apples, B = bananas, C = cookies, D = donuts, and E = eggs (graphical depiction below):
The Apriori algorithm scans and determines the frequency of individual items and excludes those that do not meet the threshold. For example, if item set {A,B} is not frequent, then we can exclude, or “prune,” all item set combinations that include {A,B}. This leads to lots of computational savings in the long run, allowing us to focus more resources on the item sets that do appear together most frequently.
Now it is time to implement this in R. We will use the same Instacart data we detailed above and the function apriori()
within the arules package. The default values for the parameter are minimum support of 0.1, minimum confidence of 0.8, and maximum of 10 items (maxlen
). You can also set the parameters to adjust the number of rules you will get (e.g. if you want stronger rules you can increase the value of the confidence).
Step 1 for the Apriori algorithm is to find the frequent item sets. For this example we’ll stick to the basics and only use a support parameter. To implement the Apriori algorithm we need run the following where:
- trans is the transaction object
- support is set to 0.01
- target is “frequent” and indicates we only want frequency, not rules
Below are the results arranged by highest support level. Looks like we can expect someone to have bananas in their basket since Bananas and Organic Bananas have the top two support values! Note that if we wanted to only include item sets with more than one item we could have added minlen = 2
as additional parameter argument. We’ll add minlen in Step 2 so you can see how this impacts things.
Step 2 for the Apriori algorithm is to find the association rules. We can find this by running the following, where:
- trans is our transaction object
- support is set to 0.01
- confidence is set to 0.50
- minlen is set to 2 so all item sets need to have at least two items
- target is “rules” and indicates we only want rules, not frequency
Below are the results, arranged by highest confidence level. Looks like if someone has Organic Avocados, Organic Raspberries, and Organic Strawberries there is an increased chance they also have a bag of Organic Bananas in their cart.
Now, if we want to visualize these association rules we need to break out the arulesViz
R package. This package provides some plotting functionality that will help you further understand the results of the association analysis.
Specifically, with arulesViz
you can:
- Create both static and interactive visualizations (this includes scatterplots, matrices, graphs, mosaic plots, parallel coordinate plots, and more!)
- Integrate interactive rule exploration using
ruleExplorer
Here’s an example of a static scatter and an interactive graph! Don’t worry if your plots take a while to load, it is worth the wait!
Scatter Plot:
Interactive Plot:
Congratulations, you’ve successfully worked through the implementation of the Apriori algorithm and identified frequent item sets at the grocery store!
In the business world, time is money, so only the most interesting data patterns will be targeted. In association analysis (considered a subset of data mining), we call these rules Measures of Interestingness. Based on the client’s aims, we can optimize on different measures to sell more products, increase profit margins, decrease costs, identify product substitutes and complements — all with association analysis!
Below we will dive into the three most commonly-used measures of interest, but there are many more:
For two products A and B, lift is the confidence of A to B divided by the support for B:
And since confidence is a function of support, which is most granularly a probability estimate, we can rewrite the equation entirely in probabilistic terms:
In gist, we can calculate the lift both using association analysis tools or through traditional statistics!
In the second equation, notice that the function is simply a ratio of the probability of being in the overlap of A and B over the probability of being in A and B separately. For this reason, for values greater than 1 we say A and B have a positive association, and for values less than 1 we say A and B have a negative association.
Due to lift being essentially a ratio, as the support for B shrinks (grows) or the confidence of A to B grows (shrinks) then lift will grow (shrink).
Confidence asks, “given you bought item A, what’s the probability you also bought item B?” Similarly, added value asks “given you bought item A, how does that change the probability you bought item B?” Statistically speaking, it is the difference between the conditional and the unconditional:
Added value is an important metric that can be used to increase profit for store owners. Let’s say you are a maple syrup salesman, and decide to introduce a pancake bar to your store. You want to know if offering pancakes will increase the amount of maple syrup they also buy while at the store. The added value equation would look like so:
If the value is positive, then added value suggests that pancakes do increase the amount of maple syrup being bought. The pancake bar was a great idea!
The leverage (also called the Piatetsky-Shapiro measure) is the difference between observed and expected, however it is more generally used to check for independence. If A and B are truly independent, leverage will be zero; otherwise, the leverage will be a value between -1 to 1 (similar to correlation but not normalized). Positive leverage suggests a positive association between A and B, negative values suggest a negative association, and a leverage of 0 suggests no association at all.
If leverage is positive, a store owner could boost sales by placing the items next to each other on the shelf — if a customer is picking up peanut butter, it’s most convenient if jam is right there too!
Using the Instacart data (above), we can now choose to sort by any of the three measures of interest. I also included two additional measures, Conditional Entropy and Mutual Information, to show how easy it can be to train on a range of different measures.
Not all interest measures are created equal; in fact, you may just want to run ?interestMeasure
yourself to figure out which works the best for your problem!
In this section we will briefly discuss common problems that are encountered when performing association analysis.
When we find patterns and associations of interest, we want to find those that are repeatable, meaning they are likely to occur in future transactions. If future data comes from a different distribution than the observed training data, watch out!
For example, changes in a regular customer base or newly differing shopping habits based on price changes at other stores may impact observed spending habits. To this end, we can use confidence intervals and statistical significance to measure the integrity of repeatable patterns that appear within the data.
Let I and J be two different item sets divided by the total number of transactions N. The margin of error is the product of the confidence level (in this case 95% confidence) and the standard error of the point estimate.
Additional Note: One should always make sure that observations within the data are independent before performing tests of significance. The confidence interval will not be fully accurate if independence is violated and any uncertainty estimates calculated may end up being off.
There can often be a conflict between statistics uncovered at the granular level vs. the aggregate; we call this occurrence Simpson’s Paradox. As a result, aggregating data can be dangerous since it can give us misleading patterns. Let us take a look at an example of a department store who sells apparel items such as watches and belts. First we’ll view the overall totals:
Translation:
confidence({Watch=Yes} → {Belt=Yes})
- The confidence is 106/185 = 57%, meaning given that the customer bought a watch we are 57% confident they will buy a belt.
confidence({Watch=No} → {Belt=Yes})
- The Confidence is 50/128 = 39%, meaning given that the customer did not buy a watch, we are 39% confident that they will buy a belt.
However, when we break things down by customer group, we see get a different takeaway (Simpson’s Paradox):
For Women:
- confidence({Watch=Yes} → {Belt = Yes}) = 2/10 = 20%,
- confidence({Watch=No} → {Belt = Yes}) = 3/38 = 7.8%,
For Men:
- confidence({Watch=Yes} → {Belt=Yes}) = 104/175 = 59.4%,
- confidence({Watch=No} → {Belt=Yes}) = 47/90 = 52.2%.
The presence of confounding variables/hidden variables caused observed relationships to be hidden or reversed in its direction, meaning, our second table contradicts the first. Why is that? Customers who buy the most watches and belts are men. Moreover, since men customers encompass a large proportion of belts and watches shoppers, the rule, c({Watch=Yes} → {Belt=Yes}) is stronger in the aggregate data than granularly.
If we factor out the hidden variable, we see that c({Watch=Yes} → {Belt=Yes}) does not in fact have a direct relationship.
Skewed support distributions occur when most items appear at a very low frequency, with a select few items appearing at a very high frequency. Additionally, you can run into a variety of problems when the user-defined support threshold is set too low or too high.
If you set the support threshold too high, you miss out on valuable transaction information of items seldom bought by the consumer but that have interesting patterns (example: expensive items). On the other hand, if you set the support threshold too low, computation costs increase and cross-support patterns occur (detailed below).
Cross-support patterns occur when you have patterns that show up that relate high frequency items with low frequency items and their association is due to the high frequency item. For example, many people buy milk regularly on their trips to the grocery store (high frequency item). Sometimes, people also decide to splurge on caviar while doing their regular grocery shopping (low frequency item). Even though people are not in fact eating caviar and milk together (ew!), it will appear that they are strongly associated because milk is such a popular item overall.
To determine if cross support patterns are appearing within a transaction list, we will let r(X) be the support ratio of an item set X = {i_1, i_2,…,i_k} such that:
Given a user-specified threshold hc, item set X is a cross-support pattern if r(X) < hc.
You can also eliminate cross-support patterns by using h-confidence:
H-confidence finds the lowest confidence available for a frequent item set. If the h-confidence ≤ hc then we know the items are truly associated with one another (and don’t mistake caviar with milk as a tasty treat).
Throughout this article we showed you how to use association analysis to discover all types of relationships between items. With applications ranging from retail to text mining, there are plenty of use cases for this type of analysis within the data science profession.
By reading through this tutorial you now have an introductory understanding of association analysis and can get started performing analysis yourself. Mission accomplished!
- arulesViz Documentation, https://www.rdocumentation.org/packages/arulesViz/versions/1.3-3
- Market Basket Analysis using R 2018, https://www.datacamp.com/community/tutorials/market-basket-analysis-r#code
- The Instacart Online Grocery Shopping Dataset 2017 Data Descriptions, https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b
- Lift, https://en.wikipedia.org/wiki/Lift_(data_mining)#:~:text=In%20data%20mining%20and%20association,a%20random%20choice%20targeting%20model
- Complete guide to Association Rules (1/2), Sep 2018, https://towardsdatascience.com/association-rules-2-aa9a77241654
- Mutual Information, https://en.wikipedia.org/wiki/Mutual_information
- Conditional Entropy, https://en.wikipedia.org/wiki/Conditional_entropy
- Professor Porter Association Analysis PDF, https://mdporter.github.io/SYS6018/lectures/09-association.pdf
- Entropy Information Theory, https://en.wikipedia.org/wiki/Entropy_(information_theory)
- Compsci 650 Applied Information Theory, Lecture 2, Jan 2016, https://people.cs.umass.edu/~arya/courses/650/lecture2.pdf
- Conditional entropy, Oct 2015, https://www.quantiki.org/wiki/conditional-entropy
- Mutual information, https://en.wikipedia.org/wiki/Mutual_information
- Research on Association Rule Mining, https://michael.hahsler.net/research/association_rules/
- Association Rule Learning, https://en.wikipedia.org/wiki/Association_rule_learning
- A-priori Definition, merriam-webster.com/dictionary/a%20priori
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Hastie, Trevor,, Robert Tibshirani, and J. H Friedman.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer, 2009. (section 9.3 “Bump Hunting”, and section 14.2 “Association Rules”)
- Association Rule Learning by Bump Hunting, 2018, https://ahmedlahloum.github.io/posts/bump-hunting/
- Introduction to Data Mining (Second Edition) by Tan, Steinbach, Karpatne, and Kumar, Chapter 6, http://infolab.stanford.edu/~ullman/mmds/ch6.pdf
This piece was originally created by Amanda West, Clair McLafferty, Oretha Domfeh and Bev Dobrenz, who are all M.S. in Data Science students at the University of Virginia. We’re grateful for you for taking the time today to read our article today, and thankful for Professor Michael Porter at the University of Virginia for allowing us to use some of his base functions in this tutorial.
FAQs
What is association analysis explain? ›
Association analysis is the task of finding interesting relationships in large datasets. These interesting relationships can take two forms: frequent item sets or association rules. Frequent item sets are a collection of items that frequently occur together.
What is association rule analysis explain with the help of example? ›So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the rules that govern how or why such products/items are often bought together. For example, peanut butter and jelly are frequently purchased together because a lot of people like to make PB&J sandwiches.
What is the main objective of association analysis? ›For example, the central task of association analysis [1], [2] is to discover sets of binary variables (called items) that co-occur together frequently in a transaction database, while the goal of feature selection is to identify groups of variables that are highly correlated with each other, or with respect to a ...
What is association analysis explain the concepts of support and confidence with an example? ›Association analysis is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. For example, given a table of market basket transactions. TID. Items.
What is an association give an example? ›Association has to do with things that are together, whether in a formal group like the National Basketball Association or just two things that are related, like the association between dogs and parks. Any time people or things are connected, there are associations.
What is the purpose of an association? ›Associations are created to establish strength and unity in working toward common goals in virtually every profession. They are nonprofit organizations formed to promote the economic, scientific or social well being of their members.
How do you interpret Association statistics? ›In statistics, a perfect positive association is represented by the value +1.00, while a 0.00 indicates no association. An example of positive association is, the more time you study the higher the chances that you will get a good grade (although it's not necessarily a perfect correlation!).
What are the three measures of association? ›Some measures of association are Pearson's correlation coefficient, the Spearman rank-order correlation coefficient, and the chi-square test.
What are the essential principles of association? ›In psychology, the principal laws of association are contiguity, repetition, attention, pleasure-pain, and similarity. The basic laws were formulated by Aristotle in approximately 300 B.C. and by John Locke in the seventeenth century.
How is support and confidence used to determine association rules? ›The confidence of an association rule is its strength or reliability. The confidence is defined as the percentage of transactions supporting the rule out of all transactions supporting the rule body. A transaction supports the rule body if it contains all the items of the rule body.
How do you find the support and confidence of an association rule? ›
...
- Support(s) – ...
- Support = (X+Y) total – ...
- Confidence(c) – ...
- Conf(X=>Y) = Supp(X Y) Supp(X) – ...
- Lift(l) – ...
- Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
The confidence of an association rule is a percentage value that shows how frequently the rule head occurs among all the groups containing the rule body. The confidence value indicates how reliable this rule is.
What is the purpose of association rule learning? ›Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.
What type of analytics is association analysis? ›Association analysis is an unsupervised data mining technique where there is no target variable to predict. Instead, the algorithm reviews each transaction containing a number of items (products) and extracts useful relationship patterns amongst the items in the form of rules.
How do you represent association rules? ›Association rules (Pang-Ning et al., 2006) are usually represented in the form X → Y, where X (also called rule antecedent) and Y (also called rule consequent) are disjoint itemsets (ie, disjoint conjunctions of features). Rule quality is usually measured by rule support and confidence.
What are the four different types of association? ›These four types of associations-performance, sociable, symbolic (or ideological), and productive-are quite distinct species of social organization, and each represents a distinct form of social integration.
What are the four characteristics of association? ›- (1) A group of People: ...
- (2) Organization: ...
- (3) Common Aims and Objectives: ...
- (4) Some rules and regulations: ...
- (5) Co-operative Spirit : ...
- (6) Voluntary Membership : ...
- (7) Degree of Permanency: ...
- (8) Legal Status :
A word association is a relationship between a word and another words that show the semantic relationship among these words. The word presented is called the stimulus word, while the word that appears in the mind firstly after reading/hearing the stimulus word is called the response word.
What makes an association successful? ›Data, strong relationships (e.g. with staff, volunteer leaders and external partners) and investing in the right staff are critical factors to realize an association's long-term plans. In fact, having a thoughtful approach to collaboration and partnerships is important to many of the 10 top-performing associations.
How do you explain an association between two variables? ›Association between two variables means the values of one variable relate in some way to the values of the other. It is usually measured by correlation for two continuous variables and by cross tabulation and a Chi-square test for two categorical variables.
How do you analyze an association between two variables? ›
A correlation coefficient measures the association between two variables. A correlation matrix measures the correlation between many pairs of variables. Inferences about the strength of association between variables are made using a random bivariate sample of data drawn from the population of interest.
How do you describe an association between two variables? ›Correlation is a statistical technique that is used to measure and describe a relationship between two variables. Usually the two variables are simply observed, not manipulated. The correlation requires two scores from the same individuals. These scores are normally identified as X and Y.
What is the most commonly used measure of association? ›A Measures of Association
The most commonly used measure of association between dietary intake and disease risk is relative risk (RR).
Examples of measures of association include risk ratio (relative risk), rate ratio, odds ratio, and proportionate mortality ratio.
What is the most appropriate measure of association? ›Pearson's correlation coefficient
Each of these two characteristic variables is measured on a continuous scale. The appropriate measure of association for this situation is Pearson's correlation coefficient, r (rho), which measures the strength of the linear relationship between two variables on a continuous scale.
The association's bylaws provide the legal basis for the association's organizational structure—naming categories of membership, governing bodies, standing committees, chief appointed and elected officers, and other units of the association, and designating roles, responsibilities, and qualifications for each.
What are required in the definition of an association? ›In general, an association is a group of persons banded together for a specific purpose. To qualify under section 501(a) of the Code, the association must have a written document, such as articles of association, showing its creation. At least two persons must sign the document, which must be dated.
What is simply the power of association? ›In view of the Power of Association, we should observe the following rules: 1. We must always seek to associate with people more successful than we are, and wiser, too. 2. We should get together with like-minded people, avoiding negative types who speak against our vision and dreams.
What is Association in statistics example? ›Association is a statistical relationship between two variables. Two variables may be associated without a causal relationship. For example, there is a statistical association between the number of people who drowned by falling into a pool and the number of films Nicolas Cage appeared in in a given year.
What does association mean in statistics? ›Technically, association refers to any relationship between two variables, whereas correlation is often used to refer only to a linear relationship between two variables. The terms are used interchangeably in this guide, as is common in most statistics texts.
What does Association mean in data? ›
A data association is a user-defined grouping of related groups and elements. It can consist of one or more groups along with some or all of the elements within those groups. Before creating Rules of Visibility, it is useful to first isolate different groups and elements that you deem to be related in some fashion.
How do you measure association? ›It is calculated by taking the risk difference, dividing it by the incidence in the exposed group, and then multiplying it by 100 to convert it into a percentage.
Does Association mean correlation? ›Technically, however, association is synonymous with dependence and is different from correlation (Fig. 1a). Association is a very general relationship: one variable provides information about another. Correlation is more specific: two variables are correlated when they display an increasing or decreasing trend.
What are three main types of association? ›The three types of associations include: chance, causal, and non-causal.