0

I have a string:

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

I want to get a result like this :

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]

I tried like :

match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(:\d+|(?=[\\\/]))", line)

And than i got :

[["https", "dbwebb.se", ""], ["ftp", "bth.com", ":32"], ["file", "localhost", ":8585"], ["http", "v2-dbwebb.se", ""]]

There is one diffrence, you can se ":32" and ":8585". How can i do to get just "32" and "8585" and not the stupid ":" Thanx!

0

3 Answers 3

1

I suggest

import re
line = line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall(r"([fh]t*ps?|file)://([^/]*?)(?::(\d+))?(?:/|$)", line)
print(match)

See the Python demo

The main point is (?::(\d+))?(?:/|$ part where : and 1+ digits part is optional ((?...)? matches 1 or 0 times) and (?:/|$) matches a / or end of string.

Details

  • ([fh]t*ps?|file) - Group 1 (the first item in the tuple): a literal
    • [fh]t*ps? - f or h, zero or more t, p and 1 or 0 ss
    • | - or
    • file - file substring
  • :// - a literal substring
  • ([^/]*?) - Group 2 (the second item in the tuple): any 0 or more chars other than /
  • (?::(\d+))? - an optional sequence of:
    • : - a colon
    • (\d+) - Group 2 (the third item in the tuple): one or more digits
  • (?:/|$) - a : or end of string.
Sign up to request clarification or add additional context in comments.

Comments

1

Regex isn't the good tool to parse urls, there's a dedicated library to do this complicated task urllib:

from urllib.parse import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

result = []
for i in line.split(', '):
    o = urlparse(i)
    result.append([o.scheme, o.hostname, o.port])

Comments

1

Instead of a regex, why not split on the , and then use Python's urllib.parse.urlparse, eg:

from urllib.parse import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
output = [urlparse(url) for url in line.split(', ')]

Gives you:

[ParseResult(scheme='https', netloc='dbwebb.se', path='/kunskap/uml', params='', query='', fragment='sequence'),
 ParseResult(scheme='ftp', netloc='bth.com:32', path='/files/im.jpeg', params='', query='', fragment=''),
 ParseResult(scheme='file', netloc='localhost:8585', path='/zipit', params='', query='', fragment=''),
 ParseResult(scheme='http', netloc='v2-dbwebb.se', path='/do%hack', params='', query='', fragment='')]

Then filter out the elements you want:

wanted = [(url.scheme, url.hostname, url.port or '') for url in output]

Which gives you:

[('https', 'dbwebb.se', ''),
 ('ftp', 'bth.com', 32),
 ('file', 'localhost', 8585),
 ('http', 'v2-dbwebb.se', '')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.