documentation

Sinhala Text Preprocessing

සිංLingua is an advanced Python library designed to empower Sinhala text analysis by addressing the unique challenges of the language. With a focus on accurate and efficient processing, සිංLingua integrates essential techniques like stemming, stop word handling, and tokenization. Its specialized features enable effective text data cleaning and transformation, crucial for various natural language processing tasks.

1. Sinhala text Stemming

The SinhalaStemmer class, an integral component of the Sinlingua library, is devised to elevate the accuracy and efficiency of Sinhala text preprocessing through a systematic four-step stemming process. Each procedure is meticulously crafted to address different aspects of stemming intricacies within the Sinhala language.

Stem Dictionary Lookup (Step One)
Suffix Removal (Step Two)
Inner Suffix Handling (Step Three)
Dependent Vowel Suffix Removal (Step Four)

from sinhala_data_processor.preprocessor.stemmer import SinhalaStemmer

#Creating an object of SinhalaStemmer class
stemmer_obj = SinhalaStemmer()

input = '...'  # your sentence

#Apply text stemming
output = stemmer_obj.stemmer(input)

print(output)

2. Sinhala stopword handling

Words, such as “සහ” (and), “සමග” (with), and “ලෙස” (like), are part of a predefined list of stop words. The primary goal of this process is to eliminate words that add minimal semantic value to the text while focusing on the meaningful content.

from sinlingua.preprocessor.stopword_remover import StopWordRemover

input = '...'  # your sentence

#Creating an object of StopWordRemover class
stopword_remover = StopWordRemover()

#Apply stopword remover
remaining_words = stopword_remover.remove_stop_words(input)

print(remaining_words)

3. Sinhala text Tokenization

Tokenization is a foundational process in Sinhala text preprocessing that involves breaking down a continuous stream of text into individual units known as tokens. These tokens typically correspond to words or subwords, and they serve as the building blocks for subsequent linguistic analysis and processing

from sinlingua.preprocessor.tokenizer import SinhalaTokenizer

input = '...'  # your sentence

#Creating an object of SinhalaTokenizer class
tokenizer = SinhalaTokenizer()

#Tokenize text into tokens
tokenz = tokenizer.tokenize(input)

print(tokenz)

Getting Started

To use the Sinhala Text preprocessing component of සිංLingua, follow these steps:

Install the සිංLingua library:
```
pip install sinlingua
```

Import the required classes for the chosen preprocessing approach:

   
from sinlingua.preprocessor.tokenizer import SinhalaTokenizer
from sinlingua.preprocessor.stopword_remover import StopWordRemover
from sinlingua.preprocessor.stemmer import SinhalaStemmer
   

Initialize the preprocessor class based on your chosen approach:

   
stemmer_obj = SinhalaStemmer() # For Stemming
stopword_remover = StopWordRemover() # For removing stopwords
tokenizer = SinhalaTokenizer() # For tokenize a given paragraph into tokens
   

Use the functions of the above mentioned classes accordingly

This site is open source. Improve this page.