Python HTMLParser and Vkontakte randomizer

Finally I’ve finish my first “program” on Python.py_vk

The task is to parse people’s id from web page where reposter’s id stores.

Main problems were:

  • web-page code is loading dynamically so there is no simple way to get ids from it, the best solution was – save section where id stores in .html file
  • I wanted to catch id + nickname but list of pairs was not a good decision when random works
  • I can’t create a list which stores all found ids, it wiped every iteration
  • I have some unsupported chars in nicknames and they’d broke iteration
  • I’ve get a lot of junk while scan .html so I used regex to avoid them
  • I can’t add various ids in list without adding one id to list recursively – and guess what? Yes, it’s broke the iteration

What I’ve learned:
Here will be a huge list of different things for indexing for further search.

List of topics

(let google parse it, so you can find this in future)

  • How to open file in Python
  • How to make global variables in Python
  • How to parse html in Python
  • How to sort variable with regex in Python
    • re.findall in Python
    • re.match in Python
  • Construction ‘for’ in python
    • How to make a replace for character in Python
  • Construction ‘if’ in Python
  • Construction ‘else’ in Python
  • What is string in Python
  • What is list in Python
  • How to add something to list in Python
    • list.append in Python
    • list.extend in Python
    • list.insert in Python
  • How to export data to csv in Python
  • How to get random in Python
  • How to print in Python
  • How to remove unprintable symbols in Python
  • Convert string to list in Python

I will show you my drafts, some of them, usually, can looks not clear and readable, but please do not blame me, I just start it from nothing, I didn’t read any guide like ‘Python for gentlemen’ so my code can looks rude.

Here is my ‘most last last try’ where all topics are present:

So you can see what I`m trying to do and how. At the end of this post you’ll see the last version worked.

Lets dive in topics one by one:

How to open file in Python:

This construction will open file for read, but usually it can produce encoding errors, so I’ve add ‘encoding=utf-8’ to protect from them. Usually you should close the file, but I didn’t use it in my task because it short and will finish job as soon as find all ids

How to make global variables in Python

Just add ‘global’ in body of script, before you give any necessary value to it.

How to parse html in Python

For Python 3.4 you’ll use HTMLParser library. It can parse almost all tags from the raw html and you can do nothing else but just sort them. I have sort it using lists and regex.

Do not forgot to read all the docs, for example, I’ve struggle a lot, because lost this ‘The attrs argument is a list of (name, value) pairs’ from doc.

How to sort variable with regex in Python

In my situation I have different way to sort it ‘re.findall(‘/\S+$’, href)’ + if‘/\w+\’\)\]’, id_raw): and ‘if re.match(‘\S+\s+\S+$’, vk_name):’

  • re.findall helps me to find all values from numbers or raw strings with tag, not sort them, just find it by given pattern ‘(‘/\S+$’, href)’ and keep it for further processing
  • helps to find each symbol from previous result and then make action on each of them
  • if re.match help me also to match only given pattern results. ‘def handle_data(self, data):’ has a lot of null strings, so I’ve sorted it and also remove all not unicode symbols like in above example
    • Before:
    • After:

Construction ‘for’ in python

Can help you to make loop till ‘something’ found in ‘something2’ or make ‘action’ for each ‘line, string, list’ from given variable until it ends.

Construction ‘if’ in Python

Can help you to make some action if something is true, if something is found by pattern, if something is not in list.

‘elif’ – is just another variant of ‘if’, IF this ‘if’ cannot be found and pattern can be different.

Construction ‘else’ in Python

Make the same job as above but if something is not true, was not found or not present in list.

How to make a replace for character in Python


How to remove unprintable symbols in Python

Simple example:

Replace each in ‘[“/dimka_keystin’)]”]’ where any of this [‘/’, ‘)’, ‘[‘, ‘]’, ‘”‘, “‘”] found:

Replace some not unicode chars from list of names:

How to add something to list in Python

Different way I’ve found when working on it, but the best solution for my example is:

list.append() this will add value to the end of list and it can collect all founded values as I need it in this task

list.insert(i, x) – can add value to the any needed location on list, but it can erase previous which stored there and also it work slowly.

list.extend(L) – helps me to add list in list in lists but it can produce a lot of lists in one, this is not useful for my example, because python random can show something that I do not need to.

How to export data to csv in Python

In my example I’ve just declare variable with needed result, this variable stores the list of people ids and then it can be write in file.

Here I get ‘random_id’ from id list ‘vk_ids’ but I can also export any data from any variable, just change ‘random_id’ to ‘vk_ids’ in write.writerow([random_id]) and I will get list of all founded ids.

I have add brackets [] to declare it as list.

I did not close the file again, because it will close after script finished work.

How to get random in Python

As described above, just use the variable with list and add ‘random.choice()

Convert string to list in Python

In my situation I just need to declare variable with(as) empty list above the for construction and then add to it all strings from each iteration.

That’s all for now, folks, I need go. Thanks for watching!

This is how I finished it:









About trianglesis

Александр Брюндтзвельт - гений, филантроп, 100 гривен в кармане. Этот блог - "сток" моих мыслей и заметок. Достаточно одного взгляда на него, чтобы понять, что такой же бардак творится у меня в голове. Если вам этот бардак интересен - милости прошу.
Bookmark the permalink.