This is a project spotlight with Konstantin Slisenko a programmer and machine learning enthusiast.
Could you please introduce yourself?
My name is Konstantin Slisenko, I’m from Belarus. I graduated from the Belarusian State University of Informatics and Radioelectronics. I am currently taking a master course.
I’m a Java developer and work in JazzTeam company. I like to learn new technologies. I’m currently interested in big data and machine learning. I like to participate in conferences and meet new interesting people. I also like to travel and ride a bike.
What is your project called and what does it do?
My project is clustering data of stackoverflow.com website.
The goal is to group stackoverflow questions and answers. Once grouped, you can see a common picture of stackoverflow data with relationships between questions. This can help if you want to do a marketing research or write an article (or event book) about a specific problem.
I have ideas for improvements such as to mark “hot” topics, take into consideration users ratings, etc. to add more data to a common picture. Also I’m thinking about training classifier. This could help when we get updated data and want to put this update into the system.
How did you get started?
First of all I became interested in Apache Hadoop. After I made some Hadoop programs, I started to study it’s infrastructure and learned about Apache Mahout.
I started to dig into it and apply some examples to: prepare data, run algorithm, see output. One day I found materials about stackoverflow clustering by Frank Scholten. You can watch an interesting presentation of his. This topic was also mentioned in Mahout in Action.
I now use Frank’s code as base and apply my own improvements and tuning. The data processing includes following steps:
- Stackexchange source data are in XML format. Hadoop jobs are used to extract text.
- Then I process text data using custom Lucene analyzer: remove stop words, apply Porter Steamer, etc.
- Then I vectorize text using TF-IDF Mahout utilities.
- For clustering I now use K-Means algorithm from Mahout, but I want to try another algorithms in future.
- After this I store results in graph-oriented database Neo4j and use HTML and JavaScript to visualize them.
All visualizations are available here: Stackexchange clustering using Mahout.
What are some interesting discoveries you made?
The clustering quality depends on how you do perform data preparation. During this step you must pay a lot of attention to which stop-words you should remove.
The K-Means clustering algorithm requires you to set an initial number of clusters K. I want to do K calculations dynamically. For this reasons I plan to find another algorithm.
What do you want to do next on the project?
- Use date of post publication to determine topics which are “hot” now.
- Try some other clustering algorithms and also calculate number of clusters dynamically.
- Build classifier based on clustered data.
- Apply more different visualizations.
- Apply clusters evaluation to say which clusters are “good” and which are “bad”.
- Apply some indexed search for clustered data.
- I’m thinking of Apache Mahout contributions – provide utility for visualizing clustered data.
Learn More
- Project: Stackexchange clustering using Mahout
- Project source code on GitHub
- Konstantin on Google+ where he shares interesting links to machine learning and big data resources
- Konstantin’s Blog
Thanks Konstantin.
Do you have a machine learning side project?
If you have an interesting machine learning side project and are interested in being profiled
No comments:
Post a Comment