2

I have a test file that looks like this

5002 2014-11-24 12:59:37.112 2014-11-24 12:59:37.112 0.000 UDP ...... 23.234.22.106 48104 101 0 0 8.8.8.8 53 68.0 1.0 1 0.0 0 68 0 48

Each line contains a source ip and destination ip. Here, source ip is 23.234.22.106 and destination ip is 8.8.8.8. I am doing ip lookup for each ip address and then scraping the webpage using xidel. Here is the script.

egrep -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" test-data.csv | sort | uniq | while read i #to get network id from arin.net
do
xidel http://whois.arin.net/rest/ip/$i -e "//table/tbody/tr[3]/td[2] " | sed 's/\/[0-9]\{1,2\}/\n/g'
done | sort | uniq | egrep -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" | 
while read j ############## to get other information from ip-tracker.org
do
xidel http://www.ip-tracker.org/locator/ip-lookup.php?ip=$j -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[2]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[3]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[4]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[5]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[6]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[7]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[8]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[9]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[10]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[11]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[12]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[13]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[14]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[15]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[16]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[17]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[18]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[19]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[20]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[21]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[22]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[23]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[24]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[25]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[26]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[27]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[28]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[29]"
done > abcd

The first xidel is used to scrap arin and second xidel is used to scrap this

The output of first xidel is network id. The ip lookup is done based on network id. The output of second xidel is like this

IP Address: 8.8.8.0
[IP Blacklist Check]
Reverse DNS:** server can't find 0.8.8.8.in-addr.arpa: SERVFAIL
Hostname: 8.8.8.0
IP Lookup Location For IP Address: 8.8.8.0
Continent:North America (NA)
Country: United States    (US)
Capital:Washington
State:California
City Location:Mountain View
Postal:94040
Area:650
Metro:807
ISP:Level 3 Communications
Organization:Level 3 Communications
AS Number:AS15169 Google Inc.
Time Zone: America/Los_Angeles
Local Time:10:51:40
Timezone GMT offset:-25200
Sunrise / Sunset:06:26 / 19:48
Extra IP Lookup Finder Info for IP Address: 8.8.8.0
Continent Lat/Lon: 46.07305 / -100.546
Country Lat/Lon: 38 / -98
City Lat/Lon: (37.3845) / (-122.0881)
IP Language:    English
IP Address Speed:Dialup Internet Speed
[
Check Internet Speed]
IP Currency:United States dollar($) (USD)
IDD Code:+1

As of now, it takes 6 hours to complete this task when there are 1.5 million lines in my test file. This is because the script is running serially.
Is there any way I can divide this task so that the script runs in parallel and the time is reduced significantly. Any help with this would be appreciated.

P.S: I am using a VM with 1 processor and 10 GB RAM

9
  • 1
    Have you tried GNU parallel. Your 1 processor VM machine gives me pause however since these downloads are likely IO-bound, there' hope yet Commented Apr 19, 2016 at 19:39
  • What do you mean by IO-bound? @1_CR Commented Apr 19, 2016 at 19:41
  • See IO Bound. The majority of the 6 hours your script takes is probably spent waiting on network input Commented Apr 19, 2016 at 19:47
  • splitting the file in n parts and launching the script n times is not a good solution for you ? Commented Apr 19, 2016 at 20:09
  • @mazs I can try that! Commented Apr 19, 2016 at 21:04

1 Answer 1

1

Adjust -jXXX% as needed:

PARALLEL=-j200%
export PARALLEL

arin() {
    #to get network id from arin.net
    i="$@"
    xidel http://whois.arin.net/rest/ip/$i -e "//table/tbody/tr[3]/td[2] " |
    sed 's/\/[0-9]\{1,2\}/\n/g'
}
export -f arin

iptrac() {
    # to get other information from ip-tracker.org
    j="$@"
    xidel http://www.ip-tracker.org/locator/ip-lookup.php?ip=$j -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[2]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[3]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[4]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[5]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[6]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[7]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[8]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[9]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[10]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[11]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[12]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[13]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[14]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[15]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[16]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[17]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[18]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[19]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[20]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[21]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[22]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[23]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[24]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[25]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[26]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[27]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[28]" -e "//table/tbody/tr[3]/td[2]/table/tbody/tr[29]"
}
export -f iptrac

egrep -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" test-data.csv | sort | uniq | 
parallel arin |
sort | uniq | egrep -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" | 
parallel iptrac > abcd
3
  • Can you explain first 2 lines of the code ? Where is PARALLEL(first line) being used? Commented Apr 20, 2016 at 17:16
  • 1
    gnu.org/software/parallel/man.html#ENVIRONMENT-VARIABLES If you prefer you can just write -j200% after parallel. It will do the same. Commented Apr 20, 2016 at 20:28
  • It worked! With one core it was fast but there were packet loss. With 4 cores 6 hours of time reduced to mere 7 mins. Thank you. Commented Apr 21, 2016 at 3:36

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.