Jump to content

How To Craft A Regex Value For Www.google.com?


Dashan

Recommended Posts

Hi Guys,

This is my first post to this forum, I hope I am following standard rule structure and not violating any of them.

Seek for your favor here.

Can any one of you please help me craft a RegEx value for www.google.com ?

I have gone through some examples, however I was not able to extract google.com string from that.

Following is the example I have referred.

^([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}$

Could you please help me to understand how can I use this string to extract www.google.com using it?

waiting for expert advice here.

Regards,

Darshan

Link to comment
Share on other sites

if you are looking specifically for www.google.com in a perl script then I would use

m/(www\.google\.com)/i

If looking for it at the start of a URL I would use

m/^https?:\/\/(www\.google\.com)/i

If you trying to pull out the hostname from a URL then something like this

m/https?:\/\/(\w+\.[\w\.\-]+)[\/$]/

Note I am writing these of the top of my head so they may need some tweaking for your specific requirements, or to even get them working.

Link to comment
Share on other sites

If you're looking for this text within a larger string, place your regexp between two catch-alls like this:

.*PATTERN.*

.* is a catch-all as it says it matches any character (the dot) any number of times (the asterisk).

Then the pattern for www.google.com which I assume you want to match regardless of case.

Start with www and decide if it _must_ be that, or really any word. If it must be www then the pattern is [wW]3 (the character 'w' or 'W' exactly 3 times). If it can be any alphabetical character or any number any amount of times, the pattern would be [a-zA-Z0-9]* (any combination of any of the characters from the ranges a-z, A-Z or 0-9, regardless how many times.

Then you need to match the dot. Since a dot is a special character that matches any character (including a dot) you must escape it by preceding it with a backslash: \.

Next is the word google witch you want to match in full, but in a case-insensitive way. The case-insensitive bit hurts you here and some languages have special constructs for it, but what is certain to work is this: [gG][oO]2[gG][lL][eE]

again an escaped dot followed by the extention, again in a case-insensitive way: [cC][oO][mM]

So as a whole, your pattern should be something like this:

.*[wW]3\.[gG][oO]2[gG][lL][eE]\.[cC][oO][mM].*

Link to comment
Share on other sites

So as a whole, your pattern should be something like this:

.*[wW]3\.[gG][oO]2[gG][lL][eE]\.[cC][oO][mM].*

I think you may have over complicated the regular expression there, as well as missing a few key bits. There is no need to put the .* at the start or end of the regular expression for just matching as it will match anywhere in the string unless you explicitly tell it to match the start and/or end of the line. You should only put .* at the start or end of the line if you explicitly want to catch what is there (e.g. if using a regular expression for replacing everything up to or after a match).

Having said that it is useful for anybody dealing with regular expressions to understand how .* and .+ will match (a + is similar to the * for repeating matches, except that it it one or more repeats instead of the *'s none or more repeats)

As you say you can also drop specifying both the upper an lower case of each letter and instead simply set the regular expression to be case insensitive, I have yet to find a regular expression engine that doesn't have that option (though I am sure that they do exist they are rare these days).

Finally you when specifying repeats you need to include them in braces, { and }, or it will try to match the value 'w3.go2gle.com' and not 'www.google.com'. Braces can be either a fixed number of matches, e.g m/w{3}/ will match www, or a range of matches, e.g. m/w{1,3}/ will match w, ww, and www.

Well done though on pointing out the need to escape some characters if you want to match them as that can catch out a lot of beginners. Common characters that need escaping in regular expressions are, but not limited to, any of the following:

[ ] { } \ . + * | ? ^ $ ( )

Link to comment
Share on other sites

"I have yet to find a regular expression engine that doesn't have that option (though I am sure that they do exist they are rare these days)."

I was about to finger the Java Pattern class as a leading culprit, but in the process found that it indeed does have a flag for matching case-insensitively.

So thank you. Learned something new today.

Link to comment
Share on other sites

I've recently been using regular expressions in vb.net, I am not sure if I am missing something in the original post, but if you are just looking for "www.google.com" and not trying to pull domains then you should be able to use the following;-


"\b(www.google.com)"
[/CODE]

I saved a copy of this cheat sheet for my future reference, it may be helpful for you http://www.addedbyte...oad/regular-expressions-cheat-sheet-v1/png/

Edit: Just to note that you would need to convert your search string to lowercase, otherwise my code won't pick up on any capitalization

Edited by MRGRIM
Link to comment
Share on other sites

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
  • . (a period) -- matches any single character except newline '\n'
  • \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b -- boundary between word and non-word
  • \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r -- tab, newline, return
  • \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end -- match the start or end of the string
  • \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...