Skip to content

Dubclicated Students

Introduction

The Duplicate Students job is employed to identify potential duplications among the latest student records that haven't been previously assessed against the last N number of students specified in the OTAS system's settings. This job is scheduled as a cron job, running at specified intervals, and it scrutinizes potential duplications based on criteria such as name, email, phone number, and passport number. Additionally, configurable weights are applied to each of these criteria, which can be customized in the system's settings.

The Duplicate Students job is located within the plugins/otas/api/jobs/CheckDuplicateQueue.php directory.

The "Duplicate Student" job is enqueued and can be executed with two distinct priorities: "high" and "low." The "high" priority is employed when the job is directly initiated by the administrator, while the "low" priority is utilized when the job is triggered by the cron job scheduler.

Queue::push(CheckDuplicateQueue::class, [
"last_std" => $lastStudent,
"offset" => $offset,
"limit" => $limit,
"min_percent" => $minPercent,
"keys_weight" => $weights,
"duplicate_values" => $duplicateValues,
], $priority);

When new student registered, a job got queued with low periority.

$duplicateController = new DuplicateController();
$duplicateController->pushToQueue($student->id);

The DuplicateController includes a constructor responsible for loading the field weights, which are utilized in the calculation of similarity. These weights serve as parameters for both the PushLastStudentToQueue action and the CheckDuplicateQueue job.

Below are the weights and parameters employed to oversee the duplication-checking process.

$this->keysWeight = [
"passport_weight" => 70,
"phone_weight" => 15,
"email_weight" => 7.5,
"name_weight" => 7.5,
"limit_process" => 10,
"min_percent" => 70,
"enabled" => true,
"max_duplicate" => 3,
"start_from" => "01/01/2022",
"check_until" => "01/01/2022",
"diff_nationalities" => false,
"diff_genders" => false,
];
public function pushToQueue($std_id = false, $priority = 'low')
{
return (new PushLastStudentToQueue())->handle($this->keysWeight, $std_id, $priority);
}

Parameters of keysWeight

enabled

The enabled parameter is a boolean value utilized to determine whether the job should be executed or not.

passport_weight, phone_weight, email_weight, and name_weight

The values assigned to passport_weight, phone_weight, email_weight, and name_weight serve as weighting factors, each represented as a double number with a maximum value of 100. The sum of all these weights equals 100.

limit_process

The limit_process parameter denotes the number of students within the batch that are to be cross-referenced with the latest student record.

min_percent

The min_percent parameter represents the minimum percentage used to determine whether a student is a duplicate of the last student.

max_duplicate

The max_duplicate value serves as a stop condition, halting the continuous checking process when the potential duplication score of the last student equals or exceeds the specified max_duplicate value.

start_from and check_until

The start_from and check_until parameters are utilized as time-based factors, determining the starting point for the operation and setting a time limit for checking the last student until the specified check_until time is reached.

diff_nationalities

The diff_nationalities parameter is a boolean value that, when set to false, allows for potential duplications to occur among students from the same nationality. Conversely, when set to true, even if the data matches 100%, students from different nationalities will be marked as not duplicated.

diff_genders

The diff_genders parameter operates similarly to diff_nationalities, but it considers the gender of the students. When set to false, it allows for potential duplications among students of the same gender, while setting it to true will mark students as not duplicated if they have identical data but different genders.

Workflow

Once the job is enqueued and initiated, the initial step involves checking whether the last student's ID exists or not. If the student record does exist, the process proceeds to examine the duplicate_trace table. This table serves as a trace mechanism for the duplication process, employing limit and offset parameters to iterate through batches of students.

If there is no record for the student in the duplicate_trace table, the job retrieves the first batch starting from the current last student ID, continuing until the limit_process value is reached. Conversely, if there is an existing record in the duplicate_trace table, the batch selection is determined by the last offset value stored in the table row, extending until the limit specified in the table's "limit" column.

After the batch of students has been retrieved, a comparison is conducted with the last student's record, focusing on the matching of four specific attributes: email, name, phone, and passport number. Subsequently, the similarity score is computed, taking into account the predefined weights specified in the settings mentioned earlier.

Note: Phone numbers undergo a filtering process in which the country code is removed, allowing the comparison to be carried out solely on the actual phone number portion.

Note: In the case of emails, a filtering process is applied where all dots are removed, and if the email contains a plus symbol (+), everything after the plus mark is trimmed.

Following the computation of similarity scores for all values, the system proceeds to assess whether the total similarity percentage exceeds the specified min_percent. If it surpasses this threshold, the calculated result is then stored in the duplicate_trace table as a potential duplication between the last student and the current student in the batch.

Once all the students in the current batch have been compared with the last student, a new offset is computed by adding the old offset to the limit value. This updated offset is then stored or updated in the duplicate_trace table, enabling the process to start from the new offset in the next iteration. This continues until the check_until time is reached within the students table.

Note: When the priority is set to high the next round begins immediately, and the same workflow is repeated until the check_until time is reached within the students' table. However, when the priority is set to low the job concludes at this stage and remains inactive until it is triggered again by the cron job.

Views

The Duplicate Students job significantly impacts various parts of the interface. Here's a breakdown of how it works:

  • Initial Creation and Processing:

    • Upon creating a new student, they are initially marked with a low priority.
    • A Processing... badge appears next to the student's name throughout the system, indicating the ongoing check for duplicates. This badge is clickable, refreshing the page to update the student's duplication status.
  • Duplication Check Outcome:

    • If no duplicates are found, the badge changes to Not Duplicated, and it becomes non-clickable.
    • If potential duplicates are detected, the badge displays Possible Duplicate: {n}, where {n} represents the number of potential duplicate entries. This badge is clickable and leads to a detailed comparison page.
  • Reviewing Potential Duplicates:

    • The comparison page lists the student and potential duplicates, showcasing names, emails, phones, passport numbers, and similarity percentages for manual verification.
    • Users with appropriate permissions can mark a student as Duplicated or Not Duplicated. This decision updates the badge accordingly, and both outcomes are clickable, leading back to the comparison page for potential undo actions.
  • Marking as Duplicated:

    • Marking a student as Duplicated changes all their applications to a Student Duplicated status, locking any modifications to the applications or student profile.

This system ensures a rigorous check for duplicates, allowing for manual review and actions while maintaining data integrity and preventing duplication within the system.