Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 2 August 2023

Project Spotlight: Stack Exchange Clustering using Mahout with Konstantin Slisenko

This is a project spotlight with Konstantin Slisenko a programmer and machine learning enthusiast.

Could you please introduce yourself?

My name is Konstantin Slisenko, I’m from Belarus. I graduated from the Belarusian State University of Informatics and Radioelectronics. I am currently taking a master course.

Konstantin Slisenko

I’m a Java developer and work in JazzTeam company. I like to learn new technologies. I’m currently interested in big data and machine learning. I like to participate in conferences and meet new interesting people. I also like to travel and ride a bike.

What is your project called and what does it do?

My project is clustering data of stackoverflow.com website.

The goal is to group stackoverflow questions and answers. Once grouped, you can see a common picture of stackoverflow data with relationships between questions. This can help if you want to do a marketing research or write an article (or event book) about a specific problem.

Stackexchange clustering using Mahout Tags

I have ideas for improvements such as to mark “hot” topics, take into consideration users ratings, etc. to add more data to a common picture. Also I’m thinking about training classifier. This could help when we get updated data and want to put this update into the system.

How did you get started?

First of all I became interested in Apache Hadoop. After I made some Hadoop programs, I started to study it’s infrastructure and learned about Apache Mahout.

I started to dig into it and apply some examples to: prepare data, run algorithm, see output. One day I found materials about stackoverflow clustering by Frank Scholten. You can watch an interesting presentation of his. This topic was also mentioned in Mahout in Action.

I now use Frank’s code as base and apply my own improvements and tuning. The data processing includes following steps:

Stackexchange source data are in XML format. Hadoop jobs are used to extract text.
Then I process text data using custom Lucene analyzer: remove stop words, apply Porter Steamer, etc.
Then I vectorize text using TF-IDF Mahout utilities.
For clustering I now use K-Means algorithm from Mahout, but I want to try another algorithms in future.
After this I store results in graph-oriented database Neo4j and use HTML and JavaScript to visualize them.

All visualizations are available here: Stackexchange clustering using Mahout.

What are some interesting discoveries you made?

The clustering quality depends on how you do perform data preparation. During this step you must pay a lot of attention to which stop-words you should remove.

Stack Exchange Clustering using Mahout by Konstantin Slisenko

The K-Means clustering algorithm requires you to set an initial number of clusters K. I want to do K calculations dynamically. For this reasons I plan to find another algorithm.

What do you want to do next on the project?

Use date of post publication to determine topics which are “hot” now.
Try some other clustering algorithms and also calculate number of clusters dynamically.
Build classifier based on clustered data.
Apply more different visualizations.
Apply clusters evaluation to say which clusters are “good” and which are “bad”.
Apply some indexed search for clustered data.
I’m thinking of Apache Mahout contributions – provide utility for visualizing clustered data.

Learn More

Project: Stackexchange clustering using Mahout
Project source code on GitHub
Konstantin on Google+ where he shares interesting links to machine learning and big data resources
Konstantin’s Blog

Thanks Konstantin.

Do you have a machine learning side project?

If you have an interesting machine learning side project and are interested in being profiled

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 2 August 2023

Project Spotlight: Stack Exchange Clustering using Mahout with Konstantin Slisenko

Could you please introduce yourself?

What is your project called and what does it do?

How did you get started?

What are some interesting discoveries you made?

What do you want to do next on the project?

Learn More

No comments:

Post a Comment

AI:List both scientific and natural methodologies exist till now to detox the body both male and female AI humanoid robotics available for it

Report Abuse

Labels

"Donate for a Noble Cause