Just my blog
Blog about everything, mostly about tech stuff I made. Here is the list of stuff I'm using at my blog. Feel free to ask me about implementations.
- Mobaxterm SSH RDP FTP...
- Thunderbird Email client
- Filezilla FTP client/server
- Nirsoft Win utils
- Sysinternals Win utils
- Pi-Hole AD block by DNS
- NUT UPS manager
- Rpi MON Raspberry monitoring
- Free CAD 3D modelling
- Free Commander Far-like filemanager
- Bitwarden Password manager
- Django web framework
- celery multi-tasking
- celery-beat Celery + Django
- celery-results Celery + Django
- Pillow Python image lib
- wsgi mod Apache + Python
- requests best in WEB requests
- openpyxl make Excell docs
- p4python Perforce + Python
- paramiko SSH + Python
- pyvmomi ESXi Vcenter + Python
I'm using these libraries so you can ask me about them.
Python HTMLParser
How to spent two days if you know nothing about Python:
- need parse HTML page code, where VK id and username of every person who shared post stores
with open('test.html', 'r', encoding='utf-8') as content_file: read_data = content_file.read() from html.parser import HTMLParser import re class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): vk_id = str(attrs) for line in vk_id: vk = re.findall('/\S+$', vk_id) vk_fnd = str(vk) if re.search('/\w+\'\)\]', vk_fnd): global vk_read vk_read = vk_fnd for ch in ['/', ')', '[', ']', '"', "'"]: if ch in vk_read: vk_read = vk_read.replace(ch, "") def handle_data(self, data): global vk_name vk_name = str(data) assert isinstance(data, object) for line in vk_name: if re.match('\S+\s+\S+$', vk_name): print("@{0} - {1}".format(vk_read, vk_name)) break parser = MyHTMLParser() parser.feed(read_data)
Now I know more. First bug and first fix: UnicodeEncodeError: 'charmap' codec can't encode character '\u0406' in position 15: character maps to <undefined>
with open('test.html', 'r', encoding='utf-8') as content_file: read_data = content_file.read() ''' 1. Replased error with charset by replase character ''' from html.parser import HTMLParser import re, sys class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): vk_id = str(attrs) for line in vk_id: vk = re.findall('/\S+$', vk_id) vk_fnd = str(vk) if re.search('/\w+\'\)\]', vk_fnd): global vk_read vk_read = vk_fnd for ch in ['/', ')', '[', ']', '"', "'"]: if ch in vk_read: vk_read = vk_read.replace(ch, "") def handle_data(self, data): global vk_name vk_name = str(data) for line in vk_name: if re.match('\S+\s+\S+$', vk_name): for ch in ['\u0456', '\u0406']: if ch in vk_name: vk_name = vk_name.replace(ch, "?") print("@{0} - {1}".format(vk_read, vk_name)) break parser = MyHTMLParser() parser.feed(read_data)
UPD2: I have found another one solution for this code, just re-thinking it's logic. That way I use in vk = re.findall('/\S+$', vk_id) can be simplified. I use vk_id = str(attrs) to convert list to string and then found there something matches regex. ! BUT I should just address to value from list ! example:
attrs = [('href', '/id168265578'), ('class', 'like_img_cont')] print "attrs[1]: ", attrs[1]
Will upgrade this section later.