The task is to parse people’s id from web page where reposter’s id stores.
Main problems were:
- web-page code is loading dynamically so there is no simple way to get ids from it, the best solution was – save section where id stores in .html file
- I wanted to catch id + nickname but list of pairs was not a good decision when random works
- I can’t create a list which stores all found ids, it wiped every iteration
- I have some unsupported chars in nicknames and they’d broke iteration
- I’ve get a lot of junk while scan .html so I used regex to avoid them
- I can’t add various ids in list without adding one id to list recursively – and guess what? Yes, it’s broke the iteration
What I’ve learned:
Here will be a huge list of different things for indexing for further search.
Here is my ‘most last last try’ where all topics are present:
So you can see what I`m trying to do and how. At the end of this post you’ll see the last version worked.
Lets dive in topics one by one:
How to open file in Python:
This construction will open file for read, but usually it can produce encoding errors, so I’ve add ‘encoding=utf-8’ to protect from them. Usually you should close the file, but I didn’t use it in my task because it short and will finish job as soon as find all ids
How to make global variables in Python
Just add ‘global’ in body of script, before you give any necessary value to it.
How to parse html in Python
For Python 3.4 you’ll use HTMLParser library. It can parse almost all tags from the raw html and you can do nothing else but just sort them. I have sort it using lists and regex.
Do not forgot to read all the docs, for example, I’ve struggle a lot, because lost this ‘The attrs argument is a list of (name, value) pairs’ from doc.
How to sort variable with regex in Python
- re.findall helps me to find all values from numbers or raw strings with tag, not sort them, just find it by given pattern ‘(‘/\S+$’, href)’ and keep it for further processing
- re.search helps to find each symbol from previous result and then make action on each of them
- if re.match help me also to match only given pattern results. ‘def handle_data(self, data):’ has a lot of null strings, so I’ve sorted it and also remove all not unicode symbols like in above example
Construction ‘for’ in python
Can help you to make loop till ‘something’ found in ‘something2’ or make ‘action’ for each ‘line, string, list’ from given variable until it ends.
Construction ‘if’ in Python
Can help you to make some action if something is true, if something is found by pattern, if something is not in list.
‘elif’ – is just another variant of ‘if’, IF this ‘if’ cannot be found and pattern can be different.
Construction ‘else’ in Python
Make the same job as above but if something is not true, was not found or not present in list.
How to make a replace for character in Python
How to remove unprintable symbols in Python
Replace each in ‘[“/dimka_keystin’)]”]’ where any of this [‘/’, ‘)’, ‘[‘, ‘]’, ‘”‘, “‘”] found:
Replace some not unicode chars from list of names:
How to add something to list in Python
Different way I’ve found when working on it, but the best solution for my example is:
list.append() this will add value to the end of list and it can collect all founded values as I need it in this task
list.insert(i, x) – can add value to the any needed location on list, but it can erase previous which stored there and also it work slowly.
list.extend(L) – helps me to add list in list in lists but it can produce a lot of lists in one, this is not useful for my example, because python random can show something that I do not need to.
How to export data to csv in Python
In my example I’ve just declare variable with needed result, this variable stores the list of people ids and then it can be write in file.
Here I get ‘random_id’ from id list ‘vk_ids’ but I can also export any data from any variable, just change ‘random_id’ to ‘vk_ids’ in write.writerow([random_id]) and I will get list of all founded ids.
I have add brackets  to declare it as list.
I did not close the file again, because it will close after script finished work.
How to get random in Python
As described above, just use the variable with list and add ‘random.choice()‘
Convert string to list in Python
In my situation I just need to declare variable with(as) empty list above the for construction and then add to it all strings from each iteration.
That’s all for now, folks, I need go. Thanks for watching!
This is how I finished it: