EmEditor (text editor) Forum Index
   Regular Expressions
     how to delete duplicate lines
Register To Post

Threaded | Oldest First Previous Topic | Next Topic | Bottom
Poster Thread
Yutaka
Posted on: 8/10/2011 4:37 pm
Webmaster
Joined: 9/28/2006
From: Redmond
Posts: 2399
Re: how to delete duplicate lines
We are working on such those features in future versions.

Thanks!


----------------
Yutaka Emura
Developer of EmEditor
http://www.emeditor.com/

Deipotent
Posted on: 8/10/2011 3:50 pm
Just can't stay away
Joined: 2/15/2008
From:
Posts: 118
Re: how to delete duplicate lines
Thanks Yutaka, that did the trick!

A new command would be useful, particularly for people who don't scripting or RegEx, as a lot of other Text Editors include a simple menu option for removing duplicates. It would also allow you to highly optimise it, and also include an option to keep the original order (ie. so you don't have to sort it first), which would be useful.

One of my other suggestions was for a simple RegEx library feature, so you can create or import Regex's with a name, optional description, and possibly find settings. This would allow you to select the relevant regex from the library list and then run it.
Yutaka
Posted on: 8/10/2011 1:11 pm
Webmaster
Joined: 9/28/2006
From: Redmond
Posts: 2399
Re: how to delete duplicate lines
In EmEditor \r is ignored, so you should try using this?

^(.*)(\n\1)+$


In future versions, I might add a new command to remove duplicate lines, so you won't need to use regular expression replace in the future.

Thanks!


----------------
Yutaka Emura
Developer of EmEditor
http://www.emeditor.com/

Deipotent
Posted on: 8/10/2011 12:21 pm
Just can't stay away
Joined: 2/15/2008
From:
Posts: 118
Re: how to delete duplicate lines
I needed this functionality recently and thought it could be done easily with regex. Google led me to http://www.regular-expressions.info/duplicatelines.html which said to sort the lines, and then search for the following:

^(.*)(\r?\n\1)+$


and replace with:

\1


Unfortunately, I couldn't get this to work in EmEditor, even after enabling the option to search past line boundaries.

Can you add support for this type of regex to EmEditor ?

PS. I haven't tried the macro yet, and am sure it works fine, but it would be nice if it could be done with regex.
raikrivera
Posted on: 7/28/2011 1:36 am
Just popping in
Joined: 7/28/2011
From: USA
Posts: 1
Re: how to delete duplicate lines
Thank you sooooo much guys which i was searching for it.
Monkeyman
Posted on: 6/5/2011 12:47 pm
Just popping in
Joined: 9/3/2009
From:
Posts: 3
Re: how to delete duplicate lines
Thank you for good macro. Removing duplicate lines is very nice and useful feature, which EmEditor lacks badly. I hope you'll add it in future release.

As for JS macro provided, it has one small "glitch". When duplicate line is the last one this macro doesn't recognize it. For example:

Badger
Eagle
Simpsons
Donkey
Badger

There's no new line after second "Badger", so it won't delete it.
Salabim
Posted on: 1/5/2010 11:07 pm
Quite a regular
Joined: 9/5/2009
From: Ghent (Belgium)
Posts: 58
Re: how to delete duplicate lines
Thanks a lot Yutaka ! :)
Yutaka
Posted on: 1/4/2010 9:51 pm
Webmaster
Joined: 9/28/2006
From: Redmond
Posts: 2399
Re: how to delete duplicate lines
Then how about this?


function Pair( i, s )
{
	this.index = i;
	this.str = s;
}

nLines = document.GetLines();

// Create an array
a = new Array( nLines );

status = "Reading lines..."

// Fill the array a with all lines (with returns) in the document.
for( i = 1; i <= nLines; i++ ) {
	if( (i % 1000) == 0 ){
		status = "Reading lines: " + String(i + 1) + "/" + String(nLines);
	}
	var pair = new Pair( i, document.GetLine( i, eeGetLineWithNewLines ) );
	a.push( pair );
}

status = "Sorting lines..."

a.sort( function(a,b){
	if( a.str > b.str ){
		return 1;
	}
	if( a.str < b.str ){
		return -1;
	}
	return a.index - b.index;
});

// Delete duplicate elements.
for( i = 1; i < nLines; i++ ){
	if( (i % 10) == 0 ){
		status = "Deleting duplicate lines: " + String(i + 1) + "/" + String(nLines);
	}
	if( a[i].str == a[i-1].str ){
		a[i].index = 0;  // disable
	}
}

status = "Sorting lines again..."

a.sort( function(a,b){
	return a.index - b.index;
});

var str = "";
n = 0;
for( i = 0; i < nLines; i++ ){
	if( a[i].index != 0 ){
		if( (i % 1000) == 0 ){
			status = "Joining lines: " + String(i + 1) + "/" + String(nLines);
		}
		str += a[i].str;
	}
	else {
		n++;
	}
}

// Replace the entire document with new elements
document.selection.SelectAll();
document.selection.Text = str;
status = n + " duplicate lines deleteded."


----------------
Yutaka Emura
Developer of EmEditor
http://www.emeditor.com/

Salabim
Posted on: 1/3/2010 10:33 am
Quite a regular
Joined: 9/5/2009
From: Ghent (Belgium)
Posts: 58
Re: how to delete duplicate lines
Hi Yutaka,

regarding the last (faster) duplicate line macro you posted, is it possible to change the code so that the last line...
status = "Duplicate lines deleteded."


... could actually show how many duplicate lines were deleted ?

Something like :
"117 duplicate lines deleted."
Yutaka
Posted on: 4/18/2009 1:26 pm
Webmaster
Joined: 9/28/2006
From: Redmond
Posts: 2399
Re: how to delete duplicate lines
Quote:

Hellados wrote:
This is a good macros for small files, but it is very slow for me
I have more 50-100mb txt files, and i need to replace dublicate lines (words) more then 406000 words and this macro working very slow :(
my pc's performances is very good, I have Intel COre 2 Duo E8400 2GB ram corsair 1TB HDD
What can i do?


I did some optimization. Please try this. This also shows the current status on the status bar.

function Pair( i, s )
{
	this.index = i;
	this.str = s;
}

nLines = document.GetLines();

// Create an array
a = new Array( nLines );

status = "Reading lines..."

// Fill the array a with all lines (with returns) in the document.
for( i = 1; i <= nLines; i++ ) {
	if( (i % 1000) == 0 ){
		status = "Reading lines: " + String(i + 1) + "/" + String(nLines);
	}
	var pair = new Pair( i, document.GetLine( i, eeGetLineWithNewLines ) );
	a.push( pair );
}

status = "Sorting lines..."

a.sort( function(a,b){
	if( a.str > b.str ){
		return 1;
	}
	if( a.str < b.str ){
		return -1;
	}
	return a.index - b.index;
});

// Delete duplicate elements.
for( i = 1; i < nLines; i++ ){
	if( (i % 10) == 0 ){
		status = "Deleting duplicate lines: " + String(i + 1) + "/" + String(nLines);
	}
	if( a[i].str == a[i-1].str ){
		a[i].index = 0;  // disable
	}
}

status = "Sorting lines again..."

a.sort( function(a,b){
	return a.index - b.index;
});

var str = "";
for( i = 0; i < nLines; i++ ){
	if( a[i].index != 0 ){
		if( (i % 1000) == 0 ){
			status = "Joining lines: " + String(i + 1) + "/" + String(nLines);
		}
		str += a[i].str;
	}
}

// Replace the entire document with new elements
document.selection.SelectAll();
document.selection.Text = str;
status = "Duplicate lines deleteded."


----------------
Yutaka Emura
Developer of EmEditor
http://www.emeditor.com/

(1) 2 »
Threaded | Oldest First Previous Topic | Next Topic | Top


Register To Post
 
English čeština Deutsch español français italiano 日本語 한국어 Русский 简体中文 繁體中文