If you are unable to create a new account, please email support@bspsoftware.com

 

News:

MetaManager - Administrative Tools for IBM Cognos
Pricing starting at $2,100
Download Now    Learn More

Main Menu

String comparison by Levenshtein distance

Started by hespora, 23 Feb 2017 04:31:25 AM

Previous topic - Next topic

hespora

Hi there,


has anyone done any Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance) calculations in Cognos? The only thing I found straight away was that Netezza offers it as a function out of the box. As I'm using Oracle, I guess I'd have to built it manually, but I'm stumped as to how to do that.

(disclaimer: yes, I know I'm looking into using the wrong tool. It's either cognos or access as my org does not provide any programming environments)

Lynn

Quote from: hespora on 23 Feb 2017 04:31:25 AM
Hi there,


has anyone done any Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance) calculations in Cognos? The only thing I found straight away was that Netezza offers it as a function out of the box. As I'm using Oracle, I guess I'd have to built it manually, but I'm stumped as to how to do that.

(disclaimer: yes, I know I'm looking into using the wrong tool. It's either cognos or access as my org does not provide any programming environments)

Wow, what a fantastic thing that is! It almost sounds made up! I'd love to know how it is applied but I understand you many not be at liberty to share details. Years (and years and years) ago I did some processing with TV ratings data provided by Nielsen and would have benefited from a function like that.

The computation section is obviously procedural which will be difficult to replicate in a tool like Cognos which is inherently set-based. I know you mention only cognos or access as available tools, but you do have Oracle which supports user defined functions. Is that perhaps an option?

Maybe other clever people will have better ideas.

hespora

Yea, unfortunately, I do not have direct access to the oracle instance; it's just the dwh infra we access through Cognos.

As for what I use it for, that's simple: Address matching. I've got our customer base, and I've got "some" external data (really doesn't matter what the context is ;)), and I need to find out which of my customer records correspond to which of the external records. Zip code comparison results in a score (0..1) by absolute difference; street address comparison results in a score (1 - LevDst * factor), and both multiplied together result in a score for match. The downside is, obviously, that this is a crossjoin and I'm looking to compare two sets of thousands of records each... meh.

I'm building it in access right now, but the performance is horrid, to say the least. >.<

BigChris

You're probably way ahead of me, but I've found a few links for doing the calculation in Access:

https://bytes.com/topic/access/insights/909002-levenshtein-approximate-string-matching

https://access-programmers.co.uk/forums/showthread.php?t=190907

http://www.codeguru.com/vb/gen/vb_database/microsoftaccess/article.php/c13137/Fuzzy-Matching-Demo-in-Access.htm

I presume you'd need to be working with arrays, and outputting the records that meet your matching criteria to a results table...

hespora

Yea, the VBA in the first link is exactly what I'm using. Already works less than decent if you're only comparing (~1h runtime for 1m rows), but if I put the calculation into my WHERE clause in order to get only those rows that are decent matches, runtimes go through the roof.

BigChris

Wouldn't you just put those 1M records into a table then select from there?Probably a lot quicker than the WHERE clause route...

hespora

Interesting idea... my approach was to keep it all within queries and subqueries in order to be a repeatable excercise, but that might just be an option. Thanks for the brainstorm! :)

BigChris

Nothing to stop you putting the code from the queries into a bit of VB code, or even into a macro. That way you could initialise the table / delete records as you need when the process runs. Just my 2p worth...

hespora

...ended up doing the whole thing in a Lua script on my private hardware. Now all I need is to pressure my IT to install a Lua IDE on my work machine. ;)