Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Wednesday, 2 August 2023

Project Spotlight: Stack Exchange Clustering using Mahout with Konstantin Slisenko

 This is a project spotlight with Konstantin Slisenko a programmer and machine learning enthusiast.

Could you please introduce yourself?

My name is Konstantin Slisenko, I’m from Belarus. I graduated from the Belarusian State University of Informatics and Radioelectronics. I am currently taking a master course.

Konstantin Slisenko

Konstantin Slisenko

I’m a Java developer and work in JazzTeam company. I like to learn new technologies. I’m currently interested in big data and machine learning. I like to participate in conferences and meet new interesting people. I also like to travel and ride a bike.

What is your project called and what does it do?

My project is clustering data of stackoverflow.com website.

The goal is to group stackoverflow questions and answers. Once grouped, you can see a common picture of stackoverflow data with relationships between questions. This can help if you want to do a marketing research or write an article (or event book) about a specific problem.

Stackexchange clustering using Mahout Tags

Stackexchange clustering using Mahout Tags

I have ideas for improvements such as to mark “hot” topics, take into consideration users ratings, etc. to add more data to a common picture. Also I’m thinking about training classifier. This could help when we get updated data and want to put this update into the system.

How did you get started?

First of all I became interested in Apache Hadoop. After I made some Hadoop programs, I started to study it’s infrastructure and learned about Apache Mahout.

I started to dig into it and apply some examples to: prepare data, run algorithm, see output. One day I found materials about stackoverflow clustering by Frank Scholten. You can watch an interesting presentation of his. This topic was also mentioned in Mahout in Action.

I now use Frank’s code as base and apply my own improvements and tuning. The data processing includes following steps:

  1. Stackexchange source data are in XML format. Hadoop jobs are used to extract text.
  2. Then I process text data using custom Lucene analyzer:  remove stop words, apply Porter Steamer, etc.
  3. Then I vectorize text using TF-IDF Mahout utilities.
  4. For clustering I now use K-Means algorithm from Mahout, but I want to try another algorithms in future.
  5. After this I store results in graph-oriented database Neo4j and use HTML and JavaScript to visualize them.

All visualizations are available here: Stackexchange clustering using Mahout.

What are some interesting discoveries you made?

The clustering quality depends on how you do perform data preparation. During this step you must pay a lot of attention to which stop-words you should remove.

Stack Exchange Clustering using Mahout by Konstantin Slisenko

Stack Exchange Clustering using Mahout by Konstantin Slisenko

The K-Means clustering algorithm requires you to set an initial number of clusters K. I want to do K calculations dynamically. For this reasons I plan to find another algorithm.

What do you want to do next on the project?

  • Use date of post publication to determine topics which are “hot” now.
  • Try some other clustering algorithms and also calculate number of clusters dynamically.
  • Build classifier based on clustered data.
  • Apply more different visualizations.
  • Apply clusters evaluation to say which clusters are “good” and which are “bad”.
  • Apply some indexed search for clustered data.
  • I’m thinking of Apache Mahout contributions – provide utility for visualizing clustered data.

Learn More

Thanks Konstantin.

Do you have a machine learning side project?

If you have an interesting machine learning side project and are interested in being profiled

No comments:

Post a Comment

Connect broadband