Python Regex solution?

Question

I have a string:

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

I want to get a result like this :

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]

I tried like :

match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(:\d+|(?=[\\\/]))", line)

And than i got :

[["https", "dbwebb.se", ""], ["ftp", "bth.com", ":32"], ["file", "localhost", ":8585"], ["http", "v2-dbwebb.se", ""]]

There is one diffrence, you can se ":32" and ":8585". How can i do to get just "32" and "8585" and not the stupid ":" Thanx!

Wiktor Stribiżew · Accepted Answer · 2017-09-25 15:16:33Z

1

I suggest

import re
line = line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall(r"([fh]t*ps?|file)://([^/]*?)(?::(\d+))?(?:/|$)", line)
print(match)

See the Python demo

The main point is (?::(\d+))?(?:/|$ part where : and 1+ digits part is optional ((?...)? matches 1 or 0 times) and (?:/|$) matches a / or end of string.

Details

([fh]t*ps?|file) - Group 1 (the first item in the tuple): a literal
- [fh]t*ps? - f or h, zero or more t, p and 1 or 0 ss
- | - or
- file - file substring
:// - a literal substring
([^/]*?) - Group 2 (the second item in the tuple): any 0 or more chars other than /
(?::(\d+))? - an optional sequence of:
- : - a colon
- (\d+) - Group 2 (the third item in the tuple): one or more digits
(?:/|$) - a : or end of string.

edited Sep 25, 2017 at 15:16

answered Sep 25, 2017 at 13:56

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Casimir et Hippolyte · Accepted Answer · 2017-09-25 14:01:28Z

1

Regex isn't the good tool to parse urls, there's a dedicated library to do this complicated task urllib:

from urllib.parse import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

result = []
for i in line.split(', '):
    o = urlparse(i)
    result.append([o.scheme, o.hostname, o.port])

answered Sep 25, 2017 at 14:01

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Comments

Jon Clements · Accepted Answer · 2017-09-25 14:03:45Z

Instead of a regex, why not split on the , and then use Python's urllib.parse.urlparse, eg:

from urllib.parse import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
output = [urlparse(url) for url in line.split(', ')]

Gives you:

[ParseResult(scheme='https', netloc='dbwebb.se', path='/kunskap/uml', params='', query='', fragment='sequence'),
 ParseResult(scheme='ftp', netloc='bth.com:32', path='/files/im.jpeg', params='', query='', fragment=''),
 ParseResult(scheme='file', netloc='localhost:8585', path='/zipit', params='', query='', fragment=''),
 ParseResult(scheme='http', netloc='v2-dbwebb.se', path='/do%hack', params='', query='', fragment='')]

Then filter out the elements you want:

wanted = [(url.scheme, url.hostname, url.port or '') for url in output]

Which gives you:

[('https', 'dbwebb.se', ''),
 ('ftp', 'bth.com', 32),
 ('file', 'localhost', 8585),
 ('http', 'v2-dbwebb.se', '')]

Collectives™ on Stack Overflow

Python Regex solution?

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related