I had the opportunity this summer in 2017 to work as a Software Engineer Intern at Optimizely. I worked on the Web Squad, who owns the Optimizely web frontend dashboard located on app.optimizely.com. In 12 weeks, I met new friends, worked on a project that removed a major pain point for Optimizely customers, and hacked on an awesome internal tool for onboarding. It was a really fulfilling summer and the best part was I was right in the heart of San Francisco, a city I've never visited before!
I wanted an internship that gave me insight on the work a full-time software engineer did and allowed me to work on meaningful projects. I think Optimizely was able to provide me with that.
Well, the snippet can exist on pages the customers don't want to track:
These are two cases where customers definitely don't want to track visitor data. For the first case, we don't want the origin entries to be overly permissive or else there could be misrepresented data. In the second case, that means the customer would want a way to manually override any currently determined origins, in the case that they're still working on anything under a specific domain or subdomain. Considering these two things, there's no easy way to solve this problem without some user guidance, so we're going to need a way to let the customer tell us which sites to track events.
The old cross-origin tracking settings took a restrictive-first approach in handling entries--customers needed to add entries manually to the list. Optimizely does provided some nice ways to allowlist a group of domains with match types, but this was still somewhat confusing to users. Organizations are now deploying their sites to multiple domains and protocols (especially with TLS/HTTPS), so it is necessary to be able to track visitor data between all of those sites correctly. This became a pain point for Optimizely customers, and they ultimately need help from Customer Support Engineers to sort out their problem with tracking data.
To minimize interactions with cross-origin tracking and reduce the instances where customers are inputting the same piece of information twice, we need a way to automatically determine the origin entries for customers with the information (URLs in the pages they added for experimentation) they already provided. Additionally, for power users where they have staging environments or any pages they don't want to track on, they'll need the flexibility to override the automatically determined origins.
As a result, there's a visual disconnect between the customer dashboard and the snippet they use on their site. In the case where there are cross-origin tracking entries buried in one project that is affecting another project, it quickly gets out of hand.
This means before I can make cross-origin tracking smart, I have to make cross-origin tracking right. And, these are the steps:
Starting off, the old cross-origin settings data is associated with the projects of an account, even though there is only one snippet per account. That means the data model needed to be modified for both projects and accounts, moving the field from projects to accounts. Since the old cross-origin tracking was an existing feature that many, if not all, customers use rely on for their experiments, the current data within the account will need to be preserved and eventually moved to the project settings. This mean that the old settings cannot be simply removed and replaced with the new version.
To accomplish this, I needed to write a migration script that took the data from the old fields, remove the duplicates (in theory, an account can have multiple projects with the same origin entries), and insert them into the new account-based fields. Once the data migrated successfully, I can flip the feature flag on the account to switch them over to the new settings.
Once cross-origin tracking data exist on the account-level, I can continue to implement the functionality to automatically determine the cross-origins for an account. When an experiment is created on Optimizely, the customer is assumed to want to take advantage of the features of Optimizely and enable cross-origin tracking for all subdomains and protocols on the same domain. Sounds simple enough, so let's move onto the next step.
Like I mentioned earlier, the entries for cross-origin tracking have a property called match types, which helps customers identify and allowlist a set of URLs given a pattern. I can leverage this property match types have to help customers automatically set their origin settings. There is a match type called Substring match that, as the name implies, will match and allowlist any website with a given substring. For example:
Substring match "map"
If I were to enable cross-origin tracking for all subdomains and protocols on the same domain, substring match works well. I can take an experiment's URL and add it as an entry. However, there is a case where Substring match will match with unintended URLs. As an example:
Experiment URL: http://optimizely.com/some/path/here
Origin extracted: optimizely.com
Substring match "optimizely.com"
Now, clearly notoptimizely.com and companyevil.com are unintended matches as they exists outside of the optimizely.com domain. We really only want to match the suffix of a root domain and not just the substring in this case. The solution is to introduce a new match type, Suffix match, that prevents matching any domains outside of a given URL's root domain.
The way that Substring match is implemented is basically a preset Regex match, so Suffix match should be implemented the same way. The changes are to insert an explicit start or a period _(^|\\.) at the beginning of the origin and insert an explicit end $ at the end of the origin. This will make sure the root domain of any URL will always be the origin and nothing else.
By implementing Suffix match, it can significantly reduce the amount of data noise customers need to see on their dashboard (as opposed to seeing every entry), while maintaining useful information about their cross-origin tracking settings.
Now that the Suffix match type is set up for use, the last part is to backtrack a little and actually determine the origin to extract from a given experiment URL. A naive solution is to take the root domain from all the experiments the customer is running, dedupe the list, and add them as cross-origin tracking entries with Suffix match. However, there are cases where the user might want to run an experiment on a URL where they don't own the entire root domain, but a subdomain (e.g. GitHub pages, Heroku instances, AWS instances etc). For example:
Experiment URL: http://optimizely.github.io/some/more/paths
Naive solution yields: github.io
In this case we definitely don't want to set github.io as an origin entry. This can even extend into further subdomain "chunks" (turns out there's not really a word to describe the subdomain parts separated by periods), so we need to be a generalized solution. We want the find the set of least permissive origins for the customer's experiments that accomplishes the goal of removing the configuration step.
As to how an automatically discovered origin entry is determined, I decided that if an account runs two or more experiments with URLs that contain the same subdomains chunks, I can assume that they control the subdomain chunk one level lower (can be the root domain too), and the algorithm will provide a cross-origin tracking entry that encompasses those URLs.
That's a lot of words and might not make sense, but the behavior is actually quite simple. Here's a few examples:
Experiment URL: http://three.two.one.happynewyear.com
Auto Origin: three.two.one.happynewyear.com
Auto Origin: two.one.happynewyear.com
Auto Origin: happynewyear.com
Thankfully this step can be completed in linear complexity, with a dictionary of root domains to n-ary trees, where the n-ary tree stored the subdomain chunks into each node. Once all of the URLs are broken down into chunks and inserted into the right nodes, it traversed the tree to find the deepest node with only one child for each domain. The result was then added as a Suffix match entry into an account's cross-origin tracking setting. This allows us to help the customer set the least permissive origins that encompasses all of their domains. Yay! We solved cross-origin tracking!
Instead, a more sensible approach is to introduce Blocked Origins, a set of origin entries that trump and block the automatically discovered origins and manually added origins. In the off chance that a good majority of the automatically discovered origins are not suitable, the customer can always disable the feature entirely and revert back to manual input.
With all of these parts combined, this makes up the new version of Cross-Origin Tracking for Optimizely X Web! The feature is now live on Optimizely's dashboard, and hopefully you'll never need to touch the settings page!
I had a lot of fun working on my project! I was given a lot of responsibility and at the end it was very rewarding. Although I was the primary engineer for this feature, it wouldn't have been possible without the help and advice from my mentor, other engineers, product managers, and my manager. This post is getting a little long, so I'll end it here. I am really grateful that Optimizely brought me out to San Francisco this summer and I will never forget my experience here.
February 18, 2021
Are the articles up-to-date or still relevant? Explainer here →