| po polsku | по русски | auf deutsch | in English |
|---|
Research on specific applications of AI was coupled with the development of inference and learning theories for uncertain and incomplete information (including Bayesian networks and Dempster-Shafer theory), the development of optimization methods inspired by nature (including immune networks, herd, genetic and extreme optimization algorithms), methods of extracting knowledge from numerical data, text and hypertext (new algorithms for cluster analysis and classification, including in the field of graph spectral analysis, new methods for extracting relationships of hierarchical concepts and simple relationships from natural language texts), new semantic methods of plagiarism detection and others.
Currently, the Team has undertaken the hottest and most important challenge of developing Explainable Artificial Intelligence (XAI) methods. XAI is a response to industry objections that artificial intelligence methods such as deep neural networks, evolutionary algorithms and other operate on the principle of a "black box", while only transparent methods are trusted by business. Our Team took on a particularly difficult challenge, i.e. achieving explainability in the field of cluster analysis of text documents, especially those clustered using spectral methods. The basic difficulty lies in the lack of a coherent axiomatic system for cluster analysis. What is more grievvant, spectral methods detach the representation of clusters from the textual content of documents. Our achievements in this area include:
stopped due to financial problems
The research group developed a massively parallel search engine NEKST to work with the Polish Internet resources in a novel way. Our specialty is systematizing online resources, and making their systematics perceivable to the user. Systematization is understood as automatic distribution of online resources into thematic groups, highlighting thematic channels in websites, labeling and categorizing documents and their groups. From the user's point of view, this translates into not only a more precise document identification - systematization enables also contextual search of both individual documents and their groups, such as channels or services, and diversification of the search engine response.
By diversification we mean variation in response, so that the user can see not only the best documents, but also the variety and thematic ambiguity, such as, for example, in the classic question regarding "game", which may either refer to playing or represent a term understandable for hunters only.
Taking into account the context is important when looking for a document that is comprehensible only in the context of other documents in a particular thematic channel. For example, when asking a search engine about the tires - which is fairly common in autumn and spring seasons - we would expect to receive in response links to websites of tire manufacturers or tire shops rather than, e.g., to sites about hard work that makes us tired. Making use of the context will allow the search engine to return links to documents with contents containing the word "tire" in which the word "car" does not occur.
Systematization understood in this way will be a useful tool for many groups of users. Scientists and entrepreneurs will be able to look for potential partners or competitors in the market. On the other hand, systematizing will help them identify interesting research areas or gaps in the market that can be exploited.
Our NEKST system, developed as part of the POIG.01.01.02-14-013/09 project, is an advanced technological solution enabling large-scale retrieval and semantic processing of data from the Polish Internet. There are approximately 2.5 million websites in Poland, storing over two billion documents.
This volume of documents presents a challenge for both data collection, indexing, and retrieval, especially since NEKST goes beyond traditional text processing, enriching it with semantic indexing, categorization, classification, fact retrieval, automatic knowledge graph creation from online documents, duplicate document and website detection, search for documents similar not only lexically but also semantically, innovative document analysis methods that allow for the identification of the origin of individual document fragments through rapid comparison with all online resources, and the ability to respond to queries in natural language (Polish), which required the development of globally innovative algorithms and methodologies.
The system not only utilizes human-generated semantic resources but also discovers IS-A semantic relationships from its own document database using advanced text analysis methods and proprietary solutions complementary to those known from the literature.
The search engine's design was closely linked to the development of new high-performance syntactic algorithms for Polish language analysis, cluster analysis methods for documents and websites, a proprietary high-performance document database, new fast document ranking methods that eliminate the practical shortcomings of classic PageRank, as well as the creation of engineering solutions for spider systems, indexes, and more.
The multi-scale nature, coverage of the entire Polish Internet, and semantic classification make the relevant components of the NEKST system a valuable tool for providing reference data to the Uniform Anti-Plagiarism System (JSA), which has been a mandatory tool for verifying the originality of all diploma theses (bachelor's, engineer's, master's) and doctoral dissertations in Poland since 2019. Creating a reference set requires, on the one hand, combing the entire Polish Internet, and on the other hand, filtering out irrelevant documents (e.g. stores and many others), which has a significant impact on the speed of JSA.
NEKST data is crucial for detecting plagiarism from Polish online sources, when they constitute a significant portion of theses. The system enables effective searches of Polish online resources, a goal previously difficult to achieve due to the limited performance capabilities of standard search engines. Between September 1, 2023, and August 31, 2024, the JSA system examined 319,656 works, of which 62,748 works (19.6%) contained results from NEKST sources, and 1.7% of the works had a degree of borrowing exceeding 70%.
Thanks to JSA with the NEKST component, the number of serious plagiarism cases has decreased by one-third in just three years. This will undoubtedly result in a general improvement in the level of education nationwide and, in the future, accelerate technological development and economic growth.
In summary, usage of the NEKST system has: