සිංLingua is an advanced Python library designed to empower Sinhala text analysis by addressing the unique challenges of the language. With a focus on accurate and efficient processing, සිංLingua integrates essential techniques like stemming, stop word handling, and tokenization. Its specialized features enable effective text data cleaning and transformation, crucial for various natural language processing tasks.
The SinhalaStemmer class, an integral component of the Sinlingua library, is devised to elevate the accuracy and efficiency of Sinhala text preprocessing through a systematic four-step stemming process. Each procedure is meticulously crafted to address different aspects of stemming intricacies within the Sinhala language.
from sinhala_data_processor.preprocessor.stemmer import SinhalaStemmer
#Creating an object of SinhalaStemmer class
stemmer_obj = SinhalaStemmer()
input = '...' # your sentence
#Apply text stemming
output = stemmer_obj.stemmer(input)
print(output)
Words, such as “සහ” (and), “සමග” (with), and “ලෙස” (like), are part of a predefined list of stop words. The primary goal of this process is to eliminate words that add minimal semantic value to the text while focusing on the meaningful content.
from sinlingua.preprocessor.stopword_remover import StopWordRemover
input = '...' # your sentence
#Creating an object of StopWordRemover class
stopword_remover = StopWordRemover()
#Apply stopword remover
remaining_words = stopword_remover.remove_stop_words(input)
print(remaining_words)
Tokenization is a foundational process in Sinhala text preprocessing that involves breaking down a continuous stream of text into individual units known as tokens. These tokens typically correspond to words or subwords, and they serve as the building blocks for subsequent linguistic analysis and processing
from sinlingua.preprocessor.tokenizer import SinhalaTokenizer
input = '...' # your sentence
#Creating an object of SinhalaTokenizer class
tokenizer = SinhalaTokenizer()
#Tokenize text into tokens
tokenz = tokenizer.tokenize(input)
print(tokenz)
To use the Sinhala Text preprocessing component of සිංLingua, follow these steps:
pip install sinlingua
from sinlingua.preprocessor.tokenizer import SinhalaTokenizer
from sinlingua.preprocessor.stopword_remover import StopWordRemover
from sinlingua.preprocessor.stemmer import SinhalaStemmer
Initialize the preprocessor class based on your chosen approach:
stemmer_obj = SinhalaStemmer() # For Stemming
stopword_remover = StopWordRemover() # For removing stopwords
tokenizer = SinhalaTokenizer() # For tokenize a given paragraph into tokens