Authors

What it does

Our project is a solution to OpenAlly's first Challenge: Patient Matching

How I built it

  • There is a is_same_person() function which takes in 2 rows and a logging file pointer, and returns a score
  • The lower the score is, the more likely that the two rows are describing the same person
  • The exact threshold where it decides that the two rows are not equal comes from testing
  • most mistakes in data fields (typos) can be measured using the Damerau-Levenshtein Distance algorithm.
  • Damerau-Levenshtein Distance is an algorithm which measures string similarity as the minimum number of character substitutions, insertions, removals and adjacent character swaps to transform one string to another.
  • I use this algorithm for many of the columns to determine if they are equivalent
  • if the distance is below some threshold for each column, then consider the values for that column equivalent
  • for last and first names, the score modifier scales with HOW different the two strings are
  • for example, "Austin" vs. "Marco" would return a high number, as the names are very different
  • if the names are very different, then it becomes increasingly likely that they are different people
  • for more unpredictable, non-name fields, scores are static
  • because the data had so many inconsistensies, we had to write a lot of logic for each column to normalize its contents before looking at the similarity
  • for example, some of the States entries were postal codes and others were expanded names
  • to solve this, I wrote different cases based on the combination of rows
  • if only one row is abbreviated, abbreviate the other
  • get the distance between the two; if they are the same, subtract 5 from the score
  • if they are only 1 apart, then add 5 to the score
  • if they are 2 apart (entirely different), add 15 to the score
  • if both were abbreviated, use the same logic as before, except add 7.5 if they are 1 apart
  • there is no conversion happening, so it is more likely that they are actually different
  • if neither is abbreviated, get the distance as is; if the distance > 2, add 15 to the score
  • otherwise, subtract 5
  • another clever solution is in handle_names(), the function that takes care of first and last names
  • in the data, there are instances where a patient's name is in its short form in one row and in its long form in another.
  • while this cannot be known with certainty, it must be accounted for to prevent false positives
  • the solution was to see if the difference in length between the two names was more than 2
  • if so, then check if one of the names starts with the other name
  • if it does, then subtract from the score
  • otherwise, if the two names appear to be unique, return a positive score proportionate to   how different they are
  • there are more things in the code that I found interesting, but I will leave them out so that this does not become an essay
  • I made a conscious effort to comment all of my code, so I hope that it is readable

Grouping

  • I thought the way that I got grouping to work pretty interesting
  • a set of indices is maintained to keep track of which rows have unique groups
  • each row is given a pointer to a list of sibling rows
  • when comparing 2 rows, if the rows are deemed equal, then they are put in the same group
  • if neither has a group at that point, then one of the rows makes a new group containing both rows and adds the index of one of the rows to the set
  • the other row's group points to this list
  • if one of them has a group but the other does not, then it adds the groupless row to the group and sets its pointer to point to that group
  • if both of them have groups, then the smaller group is appended onto the larger group
  • the smaller group is then removed from the set
  • the end result is that all of the rows that are equivalent are put in their own grouping

Challenges I ran into

  • figuring out how best to group the patients
  • I tested the algorithm only by comparing adjacent rows
  • I did not put a focus on the end goal; I just wanted to get accuracy
  • I realized that the runtime complexity was quadratic and that it was going to be tested against 100s of thousands of records
  • if I had realized this sooner, I maybe could have changed my logic to incorporate some sort of hashing
  • this would mean being able to cut down on the main bottleneck: the is_same_person() function
  • hashing would mean less computation in the long run
  • realizing that using numpy arrays were not going to instantly speed it up
  • np arrays are useful because they offer direct access to elements with very little overhead
  • however, when the dtype of the array is an object, there is a lot of overhead because of the pointer dereferencing that has to happen to get the object
  • arrays are much more beneficial when using primitives

What I learned

  • I learned a lot about NLP in doing this project
  • Learning about the different string similarity algorithms was really interesting for me
  • I learned about the importance of randomness in algorithmic design
  • I already knew that many quicksort implementations randomize the order of the data before processing, but this project made me realize how beneficial that is
  • With the default order, running the test() function, I was maxing out at 95% accuracy
  • However, when randomizing the order this number approached 100%, even hitting it at one point
  • this was happening because a lot of the changes between different patients were "gradual" in the dataset
  • it was as if a gradient was applied, smoothing out sharp changes between rows
  • I was able to figure it out with 95% accuracy with the gradient, but randomizing the order of the rows to put distinct rows next to each other bumped up the performance even more
  • While this may be a case of overfitting the weights to the data, there were very few sacrifices for accuracy that had to be made to account for edge cases
  • Therefore, I believe that if the algorithm did NOT run in quadratic time, it would be very useful in industrial applications

Try It out

Hackathons

Technologies

natural-language-processing, nltk, numpy, python

Devpost Software Identifier

255802