Real Word Spelling Error Detection and Correction for Urdu Language

Artículo Materias > Ingeniería Universidad Europea del Atlántico > Investigación > Producción Científica
Fundación Universitaria Internacional de Colombia > Investigación > Artículos y libros
Universidad Internacional Iberoamericana México > Investigación > Producción Científica
Universidad Internacional Iberoamericana Puerto Rico > Investigación > Producción Científica
Universidad Internacional do Cuanza > Investigación > Producción Científica Abierto Inglés Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%. metadata Aziz, Romila; Anwar, Muhammad Waqas; Jamal, Muhammad Hasan; Bajwa, Usama Ijaz; Kuc Castilla, Ángel Gabriel; Uc-Rios, Carlos; Bautista Thompson, Ernesto y Ashraf, Imran mail SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, carlos.uc@unini.edu.mx, ernesto.bautista@unini.edu.mx, SIN ESPECIFICAR (2023) Real Word Spelling Error Detection and Correction for Urdu Language. IEEE Access. p. 1. ISSN 2169-3536

Texto
Real_Word_Spelling_Error_Detection_and_Correction_for_Urdu_Language.pdf
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Descargar (3MB)

URL Oficial: http://doi.org/10.1109/ACCESS.2023.3312730

Resumen

Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%.

Tipo de Documento:	Artículo
Palabras Clave:	Real-word errors, spelling correction, spelling detection, spell checker
Clasificación temática:	Materias > Ingeniería
Divisiones:	Universidad Europea del Atlántico > Investigación > Producción Científica Fundación Universitaria Internacional de Colombia > Investigación > Artículos y libros Universidad Internacional Iberoamericana México > Investigación > Producción Científica Universidad Internacional Iberoamericana Puerto Rico > Investigación > Producción Científica Universidad Internacional do Cuanza > Investigación > Producción Científica
Depositado:	14 Sep 2023 23:30
Ultima Modificación:	14 Sep 2023 23:30
URI:	https://repositorio.unincol.edu.co/id/eprint/8800

Acciones (logins necesarios)

Ver Objeto

open

Detecting hate in diversity: a survey of multilingual code-mixed image and video analysis

The proliferation of damaging content on social media in today’s digital environment has increased the need for efficient hate speech identification systems. A thorough examination of hate speech detection methods in a variety of settings, such as code-mixed, multilingual, visual, audio, and textual scenarios, is presented in this paper. Unlike previous research focusing on single modalities, our study thoroughly examines hate speech identification across multiple forms. We classify the numerous types of hate speech, showing how it appears on different platforms and emphasizing the unique difficulties in multi-modal and multilingual settings. We fill research gaps by assessing a variety of methods, including deep learning, machine learning, and natural language processing, especially for complicated data like code-mixed and cross-lingual text. Additionally, we offer key technique comparisons, suggesting future research avenues that prioritize multi-modal analysis and ethical data handling, while acknowledging its benefits and drawbacks. This study attempts to promote scholarly research and real-world applications on social media platforms by acting as an essential resource for improving hate speech identification across various data sources.

Producción Científica

Hafiz Muhammad Raza Ur Rehman mail , Mahpara Saleem mail , Muhammad Zeeshan Jhandir mail , Eduardo René Silva Alvarado mail eduardo.silva@funiber.org, Helena Garay mail helena.garay@uneatlantico.es, Imran Ashraf mail ,

Raza Ur Rehman

open

Ensemble stacked model for enhanced identification of sentiments from IMDB reviews

The emergence of social media platforms led to the sharing of ideas, thoughts, events, and reviews. The shared views and comments contain people’s sentiments and analysis of these sentiments has emerged as one of the most popular fields of study. Sentiment analysis in the Urdu language is an important research problem similar to other languages, however, it is not investigated very well. On social media platforms like X (Twitter), billions of native Urdu speakers use the Urdu script which makes sentiment analysis in the Urdu language important. In this regard, an ensemble model RRLS is proposed that stacks random forest, recurrent neural network, logistic regression (LR), and support vector machine (SVM). The Internet Movie Database (IMDB) movie reviews and Urdu tweets are examined in this study using Urdu sentiment analysis. The Urdu hack library was used to preprocess the Urdu data, which includes preprocessing operations including normalizing individual letters, merging them, including spaces, etc. concerning punctuation. The problem of accurately encoding Urdu characters and replacing Arabic letters with their Urdu equivalents is fixed by the normalization module. Several models are adopted in this study for extensive evaluation of their accuracy for Urdu sentiment analysis. While the results promising, among machine learning models, the SVM and LR attained an accuracy of 87%, according to performance criteria such as F-measure, accuracy, recall, and precision. The accuracy of the long short-term memory (LSTM) and bidirectional LSTM (BiLSTM) was 84%. The suggested ensemble RRLS model performs better than other learning algorithms and achieves a 90% accuracy rate, outperforming current methods. The use of the synthetic minority oversampling technique (SMOTE) is observed to improve the performance and lead to 92.77% accuracy.

Producción Científica

Komal Azim mail , Alishba Tahir mail , Mobeen Shahroz mail , Hanen Karamti mail , Annia A. Vázquez mail annia.almeyda@uneatlantico.es, Angel Olider Rojas Vistorte mail angel.rojas@uneatlantico.es, Imran Ashraf mail ,

Azim

open

Efficient CNN architecture with image sensing and algorithmic channeling for dataset harmonization

The process of image formulation uses semantic analysis to extract influential vectors from image components. The proposed approach integrates DenseNet with ResNet-50, VGG-19, and GoogLeNet using an innovative bonding process that establishes algorithmic channeling between these models. The goal targets compact efficient image feature vectors that process data in parallel regardless of input color or grayscale consistency and work across different datasets and semantic categories. Image patching techniques with corner straddling and isolated responses help detect peaks and junctions while addressing anisotropic noise through curvature-based computations and auto-correlation calculations. An integrated channeled algorithm processes the refined features by uniting local-global features with primitive-parameterized features and regioned feature vectors. Using K-nearest neighbor indexing methods analyze and retrieve images from the harmonized signature collection effectively. Extensive experimentation is performed on the state-of-the-art datasets including Caltech-101, Cifar-10, Caltech-256, Cifar-100, Corel-10000, 17-Flowers, COIL-100, FTVL Tropical Fruits, Corel-1000, and Zubud. This contribution finally endorses its standing at the peak of deep and complex image sensing analysis. A state-of-the-art deep image sensing analysis method delivers optimal channeling accuracy together with robust dataset harmonization performance.

Producción Científica

Khadija Kanwal mail , Khawaja Tehseen Ahmad mail , Aiza Shabir mail , Li Jing mail , Helena Garay mail helena.garay@uneatlantico.es, Luis Eduardo Prado González mail uis.prado@uneatlantico.es, Hanen Karamti mail , Imran Ashraf mail ,

Kanwal

open

Deep image features sensing with multilevel fusion for complex convolution neural networks & cross domain benchmarks

Efficient image retrieval from a variety of datasets is crucial in today's digital world. Visual properties are represented using primitive image signatures in Content Based Image Retrieval (CBIR). Feature vectors are employed to classify images into predefined categories. This research presents a unique feature identification technique based on suppression to locate interest points by computing productive sum of pixel derivatives by computing the differentials for corner scores. Scale space interpolation is applied to define interest points by combining color features from spatially ordered L2 normalized coefficients with shape and object information. Object based feature vectors are formed using high variance coefficients to reduce the complexity and are converted into bag-of-visual-words (BoVW) for effective retrieval and ranking. The presented method encompass feature vectors for information synthesis and improves the discriminating strength of the retrieval system by extracting deep image features including primitive, spatial, and overlayed using multilayer fusion of Convolutional Neural Networks(CNNs). Extensive experimentation is performed on standard image datasets benchmarks, including ALOT, Cifar-10, Corel-10k, Tropical Fruits, and Zubud. These datasets cover wide range of categories including shape, color, texture, spatial, and complicated objects. Experimental results demonstrate considerable improvements in precision and recall rates, average retrieval precision and recall, and mean average precision and recall rates across various image semantic groups within versatile datasets. The integration of traditional feature extraction methods fusion with multilevel CNN advances image sensing and retrieval systems, promising more accurate and efficient image retrieval solutions.

Producción Científica

Jyotismita Chaki mail , Aiza Shabir mail , Khawaja Tehseen Ahmed mail , Arif Mahmood mail , Helena Garay mail helena.garay@uneatlantico.es, Luis Eduardo Prado González mail uis.prado@uneatlantico.es, Imran Ashraf mail ,

Chaki

open

Fundus image classification using feature concatenation for early diagnosis of retinal disease

Background Deep learning models assist ophthalmologists in early detection of diseases from retinal images and timely treatment. Aim Owing to robust and accurate results from deep learning models, we aim to use convolutional neural network (CNN) to provide a non-invasive method for early detection of eye diseases. Methodology We used a hybridized CNN with deep learning (DL) based on two separate CNN blocks, to identify multiple Optic Disc Cupping, Diabetic Retinopathy, Media Haze, and Healthy images. We used the RFMiD dataset, which contains various categories of fundus images representing different eye diseases. Data augmenting, resizing, coping, and one-hot encoding are used among other preprocessing techniques to improve the performance of the proposed model. Color fundus images have been analyzed by CNNs to extract relevant features. Two CCN models that extract deep features are trained in parallel. To obtain more noticeable features, the gathered features are further fused utilizing the Canonical Correlation Analysis fusion approach. To assess the effectiveness, we employed eight classification algorithms: Gradient boosting, support vector machines, voting ensemble, medium KNN, Naive Bayes, COARSE- KNN, random forest, and fine KNN. Results With the greatest accuracy of 93.39%, the ensemble learning performed better than the other algorithms. Conclusion The accuracy rates suggest that the deep learning model has learned to distinguish between different eye disease categories and healthy images effectively. It contributes to the field of eye disease detection through the analysis of color fundus images by providing a reliable and efficient diagnostic system.

Producción Científica

Sara Ejaz mail , Hafiz U Zia mail , Fiaz Majeed mail , Umair Shafique mail , Stefanía Carvajal-Altamiranda mail stefania.carvajal@uneatlantico.es, Vivian Lipari mail vivian.lipari@uneatlantico.es, Imran Ashraf mail ,

Ejaz

Enlaces de interés

Enlaces de interés

Real Word Spelling Error Detection and Correction for Urdu Language

Resumen

Acciones (logins necesarios)

TEMÁTICA

ACCESO

IDIOMA

Filtros

Información