Yahoo Groups -> mbox -> maillist.php -> missing posts [message #15672] |
Tue, 30 December 2003 20:34 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
|
|
Looking at converting a yahoogroups group to FUD, so trying to transfer the existing archive.
I've collected the archive from Yahoo using this script:
http://www.lpthe.jussieu.fr/~zeitlin/yahoo2mbox.html
Now it is in mbox format, and it appears to be a valid mbox format e.g. if I view it with Elm it shows the correct number of messages and they are readable.
I load it into FUD 2.5.2 using:
cat archive | formail -s /path/to/php /path/to/maillist.php 1
Using 'Slow Reply Match' to recreate the threads, and subject mangling to remove the [listname] and body mangling to remove some of the advertising dross, it all looks good.
But it only loads about half of the messages, and the rest go missing for no obvious reason. It's not just dying early, it is missing messages out from early on in the archive. I've examined the archive file and can see no clues as to why some messages are imported and others are not. Some are email postings and some are posted from website. A user might have some messages imported whilst others by the same user are not.
I've experimented tidying up the archive file manually (removing adverts and wrapped Received lines from the first few messages, that sort of thing). I tried reordering the first few messages - one which loaded fine when first no longer loaded when moved to second in the archive.
I've tried feeding the archive through formail:
cat archive | formail > archive2
and it appears to quote lots (all?) the From_ lines, except the first line, whereas I thought it was supposed to quote only bogus From_ lines? So perhaps there is a problem with my archive and so formail is not breaking it up properly? (but note that Elm can read it properly).
I've found some fragments of text in messages/msg_1 but can't see how to interpret that - maybe there are clues in there?
Anyone got any clues for me?
Thanks
Simon Child
|
|
|
Re: Yahoo Groups -> mbox -> maillist.php -> missing posts [message #15673 is a reply to message #15672] |
Tue, 30 December 2003 21:52 |
Ilia
Messages: 13241 Registered: January 2002
Karma: 0
|
Senior Member Administrator Core Developer |
|
|
FUDforum's import script can only handle 1 message at a time. So unless you modify the script to handle mbox format you need something else to do it and pipe the messages one at a time to the script.
FUDforum Core Developer
|
|
|
Re: Yahoo Groups -> mbox -> maillist.php -> missing posts [message #15678 is a reply to message #15673] |
Tue, 30 December 2003 22:27 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
|
|
Ilia wrote on Wed, 31 December 2003 03:52 | FUDforum's import script can only handle 1 message at a time. So unless you modify the script to handle mbox format you need something else to do it and pipe the messages one at a time to the script.
|
That's what 'formail -s' does - it breaks up an mbox into single messages and sends them to the script one at a time. From the man page:
"The input will be split up into separate mail messages, and piped into a program one by one (a new program is started for every part)".
So that part is working (I got 308 messages in by running that command once, but another 296 didn't make it).
I got that idea from this message:
http://fud.prohost.org/forum/index.php?t=msg&goto=9774#msg_9774
My problem is that not all the messages are getting converted into FUD. I'm not clear whether this is due to my archive format (seems alright) or my use of formail (seems alright, and has worked for others) or something else.
One thought is whether some messages are being dropped since the script is sending them too fast - will FUD cope if there are several instances of maillist.php running at the same time, as there may well be since formail will start up a new instance of it for each message that it extracts from the mbox?
Another possiblity is that the archive appears alright (e.g. to my eye, and to Elm) but in fact the message boundaries are unclear and formail is struggling with them.
Thanks for your interest - one more question - is there some documentation to tell me about the file appearing in messages/msg_1 - will that give me any clues?
Simon Child
|
|
|
Re: Yahoo Groups -> mbox -> maillist.php -> missing posts [message #15680 is a reply to message #15678] |
Tue, 30 December 2003 22:38 |
Ilia
Messages: 13241 Registered: January 2002
Karma: 0
|
Senior Member Administrator Core Developer |
|
|
There should not be a problem with >1 instance of the script running at one time. However, I would not recommend running more instances then you have CPUs, since doing so would be performance inhibitive.
If you are importing messages through multiple processes make sure that they are imported sequentially (from oldest to newest) otherwise the message association maybe broken.
If you can isolate a few messages that cannot be imported, feel free to send those to me and I'll try to determine why are they not being imported.
FUDforum Core Developer
|
|
|
|
Re: Yahoo Groups -> mbox -> maillist.php -> missing posts [message #21573 is a reply to message #15681] |
Sat, 04 December 2004 02:04 |
srchild
Messages: 88 Registered: December 2003 Location: UK
Karma: 1
|
Member |
|
|
srchild wrote on Wed, 31 December 2003 06:28 | I've thought of a couple more things I can try first:
- feed in the suspect messages singly and see if they are accepted
- Using this: http://batleth.sapienti-sat.org/projects/mb2md/ I have split the mbox into single messages, and so if I can work out the shell scripting I can feed genuine single messages to maillist.php
|
Almost 12 months later I have returned to this project, and am now succeeding
So I thought I'd post the success details in case others are trying this (Migrate Yahoogroup -> FUD).
I got the archive from Yahoo using Yahoo2mbox http://www.tt-solutions.com/en/products/yahoo2mbox/
I tidied it up a bit to remove some of the adverts etc (some done by hand, some done using regex search replace in vim)
I converted it to maildir format using mb2md.pl http://batleth.sapienti-sat.org/projects/mb2md/
I then fed it to maillist.php, using a sleep so that it could keep up. This seemed to be the key point. with a sleep of 0.5 (FreeBSD supports sleep for fractions of a second) I still lost about 50% of messages, but with a sleep of one second between messages I didn't lose a single one.
for i in /path/to/maildir/cur/*
do
cat $i |/usr/local/bin/php /path/to/FUDforum/scripts/maillist.php 4
sleep 1
done
Simon Child
|
|
|