Dubclicated Students
Introduction
The Duplicate Students
job is employed to identify potential duplications among the latest student records that haven't been previously assessed against the last N number of students specified in the OTAS system's settings. This job is scheduled as a cron job, running at specified intervals, and it scrutinizes potential duplications based on criteria such as name, email, phone number, and passport number. Additionally, configurable weights are applied to each of these criteria, which can be customized in the system's settings.
The
Duplicate Students
job is located within theplugins/otas/api/jobs/CheckDuplicateQueue.php
directory.
The "Duplicate Student" job is enqueued and can be executed with two distinct priorities: "high" and "low." The "high" priority is employed when the job is directly initiated by the administrator, while the "low" priority is utilized when the job is triggered by the cron job scheduler.
Queue::push(CheckDuplicateQueue::class, [ "last_std" => $lastStudent, "offset" => $offset, "limit" => $limit, "min_percent" => $minPercent, "keys_weight" => $weights, "duplicate_values" => $duplicateValues,], $priority);
When new student registered, a job got queued with low
periority.
$duplicateController = new DuplicateController();$duplicateController->pushToQueue($student->id);
The DuplicateController
includes a constructor responsible for loading the field weights, which are utilized in the calculation of similarity. These weights serve as parameters for both the PushLastStudentToQueue
action and the CheckDuplicateQueue
job.
Below are the weights and parameters employed to oversee the duplication-checking process.
$this->keysWeight = [ "passport_weight" => 70, "phone_weight" => 15, "email_weight" => 7.5, "name_weight" => 7.5, "limit_process" => 10, "min_percent" => 70, "enabled" => true, "max_duplicate" => 3, "start_from" => "01/01/2022", "check_until" => "01/01/2022", "diff_nationalities" => false, "diff_genders" => false,];
public function pushToQueue($std_id = false, $priority = 'low'){ return (new PushLastStudentToQueue())->handle($this->keysWeight, $std_id, $priority);}
Parameters of keysWeight
enabled
The enabled
parameter is a boolean value utilized to determine whether the job should be executed or not.
passport_weight, phone_weight, email_weight, and name_weight
The values assigned to passport_weight
, phone_weight
, email_weight
, and name_weight
serve as weighting factors, each represented as a double number with a maximum value of 100. The sum of all these weights equals 100.
limit_process
The limit_process
parameter denotes the number of students within the batch that are to be cross-referenced with the latest student record.
min_percent
The min_percent
parameter represents the minimum percentage used to determine whether a student is a duplicate of the last student.
max_duplicate
The max_duplicate
value serves as a stop condition, halting the continuous checking process when the potential duplication score of the last student equals or exceeds the specified max_duplicate
value.
start_from and check_until
The start_from
and check_until
parameters are utilized as time-based factors, determining the starting point for the operation and setting a time limit for checking the last student until the specified check_until
time is reached.
diff_nationalities
The diff_nationalities
parameter is a boolean value that, when set to false, allows for potential duplications to occur among students from the same nationality. Conversely, when set to true, even if the data matches 100%, students from different nationalities will be marked as not duplicated.
diff_genders
The diff_genders
parameter operates similarly to diff_nationalities
, but it considers the gender of the students. When set to false, it allows for potential duplications among students of the same gender, while setting it to true will mark students as not duplicated if they have identical data but different genders.
Workflow
Once the job is enqueued and initiated, the initial step involves checking whether the last student's ID exists or not. If the student record does exist, the process proceeds to examine the duplicate_trace
table. This table serves as a trace mechanism for the duplication process, employing limit and offset parameters to iterate through batches of students.
If there is no record for the student in the duplicate_trace
table, the job retrieves the first batch starting from the current last student ID, continuing until the limit_process
value is reached. Conversely, if there is an existing record in the duplicate_trace
table, the batch selection is determined by the last offset value stored in the table row, extending until the limit specified in the table's "limit" column.
After the batch of students has been retrieved, a comparison is conducted with the last student's record, focusing on the matching of four specific attributes: email, name, phone, and passport number. Subsequently, the similarity score is computed, taking into account the predefined weights specified in the settings mentioned earlier.
Note: Phone numbers undergo a filtering process in which the country code is removed, allowing the comparison to be carried out solely on the actual phone number portion.
Note: In the case of emails, a filtering process is applied where all dots are removed, and if the email contains a plus symbol (+), everything after the plus mark is trimmed.
Following the computation of similarity scores for all values, the system proceeds to assess whether the total similarity percentage exceeds the specified min_percent
. If it surpasses this threshold, the calculated result is then stored in the duplicate_trace
table as a potential duplication between the last student and the current student in the batch.
Once all the students in the current batch have been compared with the last student, a new offset is computed by adding the old offset to the limit value. This updated offset is then stored or updated in the duplicate_trace
table, enabling the process to start from the new offset in the next iteration. This continues until the check_until
time is reached within the students table.
Note: When the priority is set to
high
the next round begins immediately, and the same workflow is repeated until thecheck_until
time is reached within the students' table. However, when the priority is set tolow
the job concludes at this stage and remains inactive until it is triggered again by the cron job.
Views
The Duplicate Students job significantly impacts various parts of the interface. Here's a breakdown of how it works:
-
Initial Creation and Processing:
- Upon creating a new student, they are initially marked with a
low
priority. - A
Processing...
badge appears next to the student's name throughout the system, indicating the ongoing check for duplicates. This badge is clickable, refreshing the page to update the student's duplication status.
- Upon creating a new student, they are initially marked with a
-
Duplication Check Outcome:
- If no duplicates are found, the badge changes to
Not Duplicated
, and it becomes non-clickable. - If potential duplicates are detected, the badge displays
Possible Duplicate: {n}
, where{n}
represents the number of potential duplicate entries. This badge is clickable and leads to a detailed comparison page.
- If no duplicates are found, the badge changes to
-
Reviewing Potential Duplicates:
- The comparison page lists the student and potential duplicates, showcasing names, emails, phones, passport numbers, and similarity percentages for manual verification.
- Users with appropriate permissions can mark a student as
Duplicated
orNot Duplicated
. This decision updates the badge accordingly, and both outcomes are clickable, leading back to the comparison page for potential undo actions.
-
Marking as Duplicated:
- Marking a student as
Duplicated
changes all their applications to aStudent Duplicated
status, locking any modifications to the applications or student profile.
- Marking a student as
This system ensures a rigorous check for duplicates, allowing for manual review and actions while maintaining data integrity and preventing duplication within the system.