If you are responsible for analysing a list of URL’s in Excel, then you may need to extract the domain from each URL. After extracting the domain you can perform useful reporting tasks such as grouping the domains, or counting the number of unique domains in the list.
This can be quite a difficult task especially if the URL’s are not consistent. In this article we explore a few different approaches to extract the domain from a URL. The method you choose will depend on the appearance of the URL’s in the list and the format you would like to extract the domain in.
This article covers;
- Extract the domain including the URL protocol.
- Extract the domain ignoring URL protocol.
- Extract the domain without the www.
- Extract root URL’s that don’t end with a slash (/).
- Using helper columns to break up complex formulae.
Extract the Domain including the URL Protocol
Let’s start with a simple way that extracts the domain with the URL protocol (http, https, etc).
The formula below will extract all the text up to the slash (/) after the domain.
=LEFT(A2,SEARCH(“/”,A2,9)-1)
The LEFT function is used to extract the domain from cell A2.
The SEARCH function has been used to find the position of the slash (the FIND function works equally well for this). Minus 1 removes the slash from the extracted text. The search starts from character 9 to avoid the slashes that come with the URL protocol.
Extract the Domain Excluding the URL Protocol
To ignore the URL protocol when extracting domains you can use the MID function. This function is used to extract text from the middle of a string of characters.
The formula below extracts all text between the URL protocol and the slash after the domain.
=MID(A2,SEARCH(":",A2)+3,SEARCH("/",A2,9)-SEARCH(":",A2)-3)
The first SEARCH function detailed below is used to find the first character of the domain. It does this by returning the position of the colon (:) in the URL. 3 is then added to skip the two slashes (//) and find the position of the first domain character.
SEARCH(":",A2)+3
You then need to determine how many characters the domain is. This is done by finding the position of the slash after the domain like in the first example, and subtracting the position of the first character of the domain (this is done by finding the colon and subtracting 3 to account for the two slashes before the domain and the one after it).
Return the Domain Without the www
You may require the domain without the www part. The www is not present in all of the URL’s in the list, but if it is then it needs to be excluded from the result.
The formula below uses the SUBSTITUTE function twice to remove the www.. This is the only change from the previous formula that extracts it with the “www.”. The two SUBSTITUTE functions have been emboldened so they are easier to identify within the formula.
=MID(SUBSTITUTE(A2,"www.",""),SEARCH(":",A2)+3,SEARCH("/",SUBSTITUTE(A2,"www.",""),9)-SEARCH(":",A2)-3)
The first instance of SUBSTITUTE is used to provide the string without the “www.” for the MID function to extract from. The second instance is used when calculating how many characters are in the domain name. The “www.” is removed again here to ignore it in the calculation.
Extract the Domain including Root Domains without the Ending Slash
In the example below, one of the URL’s does not end with a slash. All of the previous formulas have used this character to identify the end of the domain.
To work with these URL instances in a big list, we will need include another way of returning the domain.
The formula below includes the IF, ISERROR and a LEN function to get this job done. The differences between this formula and the previous one have been highlighted in bold.
=MID(SUBSTITUTE(A2,"www.",""),SEARCH(":",A2)+3,IF(ISERROR(SEARCH("/",A2,9)),LEN(SUBSTITUTE(A2,"www.",""))-SEARCH(":",A2)-2,SEARCH("/",SUBSTITUTE(A2,"www.",""),9)-SEARCH(":",A2)-3))
Now this is a big formula, so let’s break down how it works.
Firstly, just like in the previous example, the MID function is supplied with a URL without the “www.” and told to extract characters starting with the first domain character (3 characters after the colon).
=MID(SUBSTITUTE(A2,"www.",""),SEARCH(":",A2)+3
Then to calculate how many characters in the domain name, this is when the IF function is inserted to check if it has a slash at the end of the domain. The SEARCH function looks for this slash, and the ISERROR function reports back to IF to say if there is not one.
IF(ISERROR(SEARCH("/",A2,9))
If there is not a slash, then the LEN function is used to find the last character of the string (last character of the domain name). The rest of the formula is similar to the previous example to find domain length, except this one subtracts 2 and not 3. This is because it does not have to account for the ending slash like the others.
LEN(SUBSTITUTE(A2,"www.",""))-SEARCH(":",A2)-2
Using Helper Columns to Break up Complex Formulae
When formulas get large and complex, you may find it easier to break it up into smaller chunks and use helper columns.
These helper columns can be used to store the position of the important characters in these tasks. These columns can then be referred to in our formula, but hidden on the spreadsheet.
The formula below has been entered into cell F2 and uses the helper columns of B, C, D and E.
=MID(B2,C2,E2)
Column B has been used to remove the “www.” from the URL string, if it is present, using the SUBSTITUTE function.
=SUBSTITUTE(A2,"www.","")
Column C has been used to identify the position of the first character of the domain. This is done using the SEARCH function to find the colon and add 3.
=SEARCH(":",B2)+3
Column D is used to determine the end of the domain name part of the URL. Here the IF function is used with ISERROR and SEARCH to check if a slash exists at the end of the domain name. If it does then return the position of that slash.
But if not, then use the LEN function to return how many characters in the URL string. 1 is then added to this total to compensate for the missing slash.
=IF(ISERROR(SEARCH("/",B2,9)),LEN(B2)+1,SEARCH("/",B2,9))
Finally column E simply subtracts the results of columns D and C to return how many characters are in the domain name.
=D2-C2
Due to breaking the formula up into multiple different tasks, the finishing MID function is quite simple. It uses the URL’s in column B, and returns the number of characters in column E, from the character in column C.
=MID(B2,C2,E2)
Leave a Reply