Announcement

Collapse

Please use the Hentai ID thread for all hentai ID requests. Click me for link!

The Identification Thread is Here:

http://www.hongfire.com/forum/showthread.php/447081
See more
See less

Translation Aggregator

Collapse
This is a sticky topic.
X
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Translation Aggregator

    I'm no longer working on Translation Aggregator, but Setx has released an updated version, here. The files attached directly to this post are now outdated

    Translation Aggregator basically works like ATLAS, with support for using a number of website translators and ATLAS simultaneously. It was designed to replace ATLAS's interface as well as add support for getting translations from a few additional sources. Currently, it has support for getting translations from Atlas V13 or V14 (Don't need to have Atlas running), Google, Honyaku, Babel Fish, FreeTranslations.com, Excite, OCN, a word-by-word breakdown from WWWJDIC, MeCab, which converts Kanji to Katakana, and its own built-in Japanese parser (JParser). I picked websites based primarily on what I use and how easy it was to figure out their translation request format. I'm open to adding more, but some of the other sites (Like Word Lingo) seem to go to some effort to make this difficult.

    JParser requires edict2 (Or edict) in the dictionaries directory, and supports multiple dictionaries in there at once. It does not support jmdict. You can also stick enamdict in the directory and it'll detect some names as well, though the name list will be heavily filtered to avoid swamping out other hits. If you have MeCab installed, JParser can use it to significantly improve its results. TA can also look up definitions for MeCab output as well, if a dictionary is installed. In general, MeCab makes fewer mistakes, but JParser handles compound words better, and groups verb conjugations with the verb rather than treating them as separate words.

    TA also includes the ability to launch Japanese apps with Japanese locale settings, automatically inject AGTH into them, and inject its own dll into Japanese apps. Its dll can also translate their menus and dialogs using the ATLAS module (Requires you have ATLAS installed, of course). Versions 0.4.0 and later also include a text hooking engine modeled after AGTH. The menu translation option attempts to translate Windows-managed in-game menus, and is AGTH compatible. The AGTH exe and dlls must be in the Translation Aggregator directory for it to be able to inject AGTH into a process. AGTH is included with the most recent versions of TA.

    The interface is pretty simple, much like ATLAS: Just paste text into the upper left window, and either press the double arrow button to run it through all translators, or press the arrow buttons for individual translation apps. Each algorithm is only run once at a time, so if a window is busy when you tell it to translate something, it'll queue it up if it's a remote request, or stop and rerun it for local algorithms. If you have clipboard monitoring enabled (The untranslated text clipboard button disables it altogether), it'll run any clipboard text with Japanese characters copied from any other app through all translators with clipboard monitoring enabled. I won't automatically submit text with over 500 characters to any of the translation websites, so you can skip forward in agth without flooding servers, in theory. I still don't recommend automatic clipboard translation for the website translators, however.

    To assign a hotkey to the current window layout, press shift-alt-#. Press alt-# to restore the layout. Bound hotkeys will automatically include the current transparency, window frame, and toobar states. If you don't want a bound hotkey to affect one or more of those states, then you can remove the first 1 to 3 entries in the associated line in the ini file. Only modify the ini yourself when the program isn't running. All other values in those lines are mandatory.

    Pre-translation substitutions modify input text before it's sent to any translator. Currently applies to websites, ATLAS, Mecab, and JParser. There's a list of universal replacements ("*") and replacements for every launch profile you've created. I pick which set(s) of substitutions to use based on currently running apps. Note that you do not need to be running AGTH or even have launched a game through TA's launch interface for the game to be detected, but you do need to create a launch profile. May allow you to just drag and drop exes onto the dialog in the future.

    MeCab is a free program that separates words and gives their pronunciation and part of speech. I use it to get the information needed to parse words and display furigana. If you have MeCab installed but I report I'm having trouble initializing it, you can try copying libmecab.dll to the same directory as this program. Do not install MeCab using a UTF16 dictionary, as I have no idea how to talk to it (UTF16 strings don't seem to work). Instead, configure MeCab to use UTF8, Shift-JIS, or EUC-JP. If you have both MeCab and edict/edict2 installed, you can view a word's translation in MeCab by hovering the mouse over it. Also, JParser can use MeCab to help in parsing sentences.

    JParser tends to be a better choice for those who know almost no Japanese - it tells you how verbs are conjugated, handles some expressions, etc. MeCab may well be the better choice for those who know some Japanese, however.

    Source, attached below, is available under the GPL v2.

    Thanks to (In alphabetical order, sorry if I'm leaving anyone out):
    Hongfire Members:
    Freaka for his innumerable feature suggestions and reported issues over the course of development.
    Setsumi for TA Helper and for all his suggested improvements and reported issues, particularly with JParser.
    Setx for AGTH.
    Stomp for fixing the open file dialog not working properly on some systems and adding the tooltip font dialog, and fixing a bug that required admin privileges when certain other software was installed.
    Might sound like minor contributions, but feedback really drives the development of TA.

    Non-members:
    KingMike of KingMike's Translations, who is apparently the creator of the EUC-JP table I used to generate my own conversion table.
    Nasser R. Rowhani for his function hooking code.
    Z0mbie for writing the opcode length detector/disassembler I use for hooking. Apparently was intended for virus-related use, but works fine for other things, too.
    And the creators and maintainers of edict, MeCab, and zlib.

    You might also be interested in:
    *Setsumi's TA Helper and AGTHGrab.
    *errotzol's replacements script.
    *Devocalypse's devOSD.
    *kaosu's ITH (Like AGTH. No direct TA support, due to lack of a command line interface, but definitely worth checking out).

    MeCab
    edict2

    Changelog:
    0.4.9
    * Fixed MeCab/JParser getting stuck when starting a new translation before the last is fixed.
    * Fixed interface lockup while mousing over an item in MeCab while JParser is running.
    * Menu translation will now translate column headings in ListViews (Needed this for the AA launcher)
    * Fixed ATLAS config crash.
    * Global hotkey support. Toggle under "File" menu (Tools is kinda big already). Currently only really supports history navigation. May add more later.

    0.4.8
    * Added history. Logs both original text and translations (For online translators). It logs up to 20 MB of original text, and whatever translations are associated with it. Currently only way to force a retranslation is to toggle one of several options (Autoreplace half-width characters, src/dest language, modify substitutions).
    * Fixed deadlock bug on MaCab mouse over while JParser is running.
    * Fix corrupting built-in text hooker settings when launch failed. Suspect no one uses this, anyways.
    * Drag/dropping an exe onto TA to open up the injection dialog now actives TA.

    0.4.7
    * JParser and MeCab each use their own thread (Mostly).
    * Changed conjugation table format to JSON - plan to do this to a lot of other files (Being careful not to mess up game settings or substitution tables). Currently have way too much file loading code.

    0.4.6
    * Fix WWWJDIC
    * Fix closing injection dialog
    * Updating process list 10+x faster
    * Process list autoupdates
    * Fixed bug that would result in injecting into wrong process when one program is running multiple times.
    * Updated included AGTH version

    0.4.5
    * Added bing support.
    * Updated Honyaku code (They didn't try and block TA, they just modified their HTML)
    * Fixed AGTH command line code.
    * Replaced "/GL" with "/SM" compile option, resulting in faster builds when one has a lot of cores.

    0.4.4
    * Regular expressions are now compiled
    * Injection validation when using addresses relative to dlls (Or function addresses in dlls) should be fixed.
    * Added option to create shortcuts. They'll launch TA (If it's not running) and try to launch the game using the current injection settings (Injection settings that you'd get at the launch screen - the current settings are not saved - it always uses the most recently used ones).
    * Appropriated some of Setsumi's code to make tooltips larger.

    0.4.3
    * Multiple subcontexts now supported. Separate them with semi-colons. AGTH code converter will add two subcontexts, when appropriate.
    * Using aliases for hooks added. Prefix a hook with "[Alias Text]" and that's what will be displayed on the context manager screen as the hook's name. Makes it easier to see context strings.
    * Locale selection added to injection dialog.
    * "Hook delay" added to injection dialog. Actually doesn't delay hooking, delays how long before hooks that use filtering based on calling function's dll are enabled. Generally only the default hooks do this. Increasing this delay may circumvent issues with games that crash when launched with AGTH, but work fine when injected after launching.
    * Added "!" and "~" operators.

    * Stomp's admin privilege fix when using some 3rd party software added.
    * Excite fixed
    * Fixed sanity testing for injection addresses, so when specify a dll or exe name in a text hook, shouldn't erroneously think it's an error when the module isn't loaded in the current address space.
    * Fixed some JParser dicrionary common word parsing, when using versions of edict with entL entries. Also changed treatment of Kanji entries when only their corresponding Hiragana are marked as common.
    * Fixed substitution matching Hiragana with Katakana and vice versa.
    * Fixed a clipboard-related crash bug.
    * Fixed hooks causing crashes when relocating call/jumps (Hopefully...)
    * Fixed AGTH repeat filter length placement (oops).

    0.4.2b
    * Fixed substitution loading/deleting.
    * Fixed << and >>.

    0.4.2
    * AGTH code conversion tool.
    * Injection code checker added.
    * New child process injection handler (Really nifty injection code for that...). Should be a little more robust than before.
    * Option not to inject into child processes added.
    * Auto copy to clipboard added.
    * Both extension filters fixed.
    * Both eternal repeat filters fixed/upgraded.
    * Phrase repeat filter fixed/upgraded.
    * OpenMP/MSVC 2008 SP1 runtime requirement removed
    * char/charBE fixed
    * GetGlyphOutline fixed
    * Copy to clipboard crash when auto translate disabled fixed.
    * Slightly improved dll injection error handling.

    0.4.1
    * More context/filter options.
    * Repeated phrase filter now handles cases where phrase is being extended by a couple characters each time (xxyxyz, etc). Extension filters no longer really needed, unless the repeat starts out too short.
    * Option to handle eternally looping text.
    * Option to ignore text without any Japanese characters.
    * Text which substitution rules reduce to nothing no longer overwrites translated text.
    * Log length limit added.
    * Options to manage default internal text hooks added.
    * Clipboard treated as a context. Its default settings should mirror the old handling.

    0.4.0
    * Added it's own text hooking engine. Probably still buggy.
    * Fixed excessive redrawing when a hidden furigana window had clipboard translation enabled.
    * Works with new, even more poorly formatted edict files.
    * Handles EUC_JP characters that Windows does not (Doesn't use them properly with WWWJDIC at the moment, however). Only really fixes loading edict files with those characters.
    * Fixed right clicking when full screen.
    * Fixed not checking auto Hiragana mode.
    * Less picky when reading MeCab output.
    Attached Files
    Last edited by ScumSuckingPig; 07-11-2015, 11:20 AM. Reason: Change download link, re-upload attachments upon request from Setx

  • # Copying text within TA crashes it (v0.2.2)
    Quick run through several problematic sentences saved from yesterday, showed almost perfect result!
    # Strange priority(ちゃん) and furigana behaviour (戻って来て) displayed on attached picture.

    > Only possible solution I can think of would be to create a new conjugated verb "tense" consisting of an i-adjective and sou.
    Yes, it's working! Don't know if it worth adding to "tense"s without testing, as my japanese is extremely limited.
    Btw, noticed in JWPCE this: 悲し【かなし】sad, sorrowful (shiku) (marked in red exactly like that), resembles my first idea of putting such words in dictionary without last "i" , but used here for "shiku" conjugation instead?
    読んでみた。 読んでみようとした。 読めたらいいな、と思った。

    Comment


    • The problem with putting cut up words in the dictionary is that I often wouldn't find them - in your example from yesterday, I managed to cover every character in both the first and last cases even when not getting the words right, for example. They're actually already in the dictionary in their cut up forms, too, I just don't return those hits unless there's a corresponding conjugation table entry following them that matches.

      Anyways, to get the behavior you describe, all you have to do is enter a 0-length entry in the conjugation tables (You need a comma alone). Still playing heavily with the conjugation tables, so not worth making any long-term changes to them yet, however. Can't add entries with new names in 0.2.2, either. 0.2.3 will fix that.

      I think common additions to verb stems are worth adding to the conjugation table. I've added partial entries for "sou" in my working version, on a provisional basis. If they cause more problems than they solve, easy to remove.

      The crash issue on copy is the same as back in the days of 0.1.2. Forgot to update the workaround for the new framework, will be fixed in the next version.

      That strange order is because I was accidentally favoring hiragana/katakana mismatches instead of putting them last. Easily fixed. Current preference order is: exact match, common word, particle. That's the same ordering as my word finding function, too, actually. Not yet sure about the entry with no furigana.

      If anyone's interested, it currently minimizes 500 * kanji in no word + 100 * other characters in no word + 30 * words with hiragana/katakana mismatches + 10 * other words - 3 * common words - 2 * particles (particles and common words count as other words too, of course). Because of the search order, one non-common word is favored over 2 common particles that occupy the same spot, even though they have the same score.

      Edit: The reason why that entry has no furigana is the "来" character. As far as I know, no other verb conjugations other than suru (Which never seems to be written as kanji) has the pronunciation of its kanji change based on verb tense. Would either have to hard-code handling for that one verb, or make a dictionary entry for every tense of it. Not sure which would take less effort.

      Edit2: Maybe some sort of dictionary entry tagging system... If you could tag an entry as appearing after a particular stem form, easy enough to check for it.
      Last edited by ScumSuckingPig; 06-22-2009, 09:21 AM.

      Comment


      • Just FYI Mecab deinflects words. This is a surface followed by a feature. The 7. entry in the the feature is the basic form, the 8. is furigana.

        Code:
        持っ     動詞,自立,*,*,五段・タ行,連用タ接続,持つ,モッ,モッ
        AGTH wiki

        Comment


        • The problem with mecab is I really don't trust it at all. It may give better results for verbs than I do (though I'm not convinced it does), but it has no documentation (At least in english) and no english interface. It also tends to label verb endings as particles, which isn't good, as I need to be able to tell them apart.

          The real problem is cases with multiple different parses. Take the sentence fragment:

          にさせる。

          It could be the causative form of nisiru, ni followed by the causative form of suru, or ni followed by saseru (Which is yet another verb). In this case, I return the first, mecab returns ni followed by a random sa, followed by seru, and WWWJDICT returns the third (saseru), which is presumably the correct parse. Amusingly, the first two hits could also theoretically be causative combined with the potential tense, if that made any sense, as there are two causative forms (Cause to be able to do? I dunno).

          The rest of the words in the sentence have no effect on any of the algorithms, in this particular case. I might need to penalize funkier verb tenses, though the causative and causative passive are actually not that uncommon, but it seems no one's great at handling the multiple parse issue (And presumably there's some case out there where someone uses the causative of nisuru...). Though WWWJDICT works best in that case, has issues in a lot of others. For the record, this was from the 3rd short sentence I looked at when looking for something with sufficiently many parses to make my point.
          Last edited by ScumSuckingPig; 06-22-2009, 03:04 PM.

          Comment


          • Another day, another release (0.2.3). Think this will be the last of the daily new versions for now, though I've been wrong before.

            This is primarily a bugfix release. Suppose could say the same of everything since 0.2.0... Fixes the crash on copy issue, fixes a sorting issue or two, fixes tooltip placement, fixes behavior on minimize. Even a hack to fix the kuru thing, which, amusingly, takes about 1.5 times as much code for kuru alone as it does to get hiragana for all other verbs. Arbitrary tenses can now be added to the verb tense file, though you have to restart for them to take effect.

            Also, verb tables were updated to use stem forms for a lot of tenses, rather than having a huge table for every single base verb ending (Though there are still a lot of big tables). Added a few experimental "tenses" based on suffixes that like to go after stem forms of verbs and adjectives.

            Not sure what I'm going to work on adding next...Maybe handling stem forms, maybe something that outputs the entries for all the parsed words in another frame, so won't have to rely on tool tips. If I switch to using jmdict instead of (Or in addition to) edict, I could get a lot more usage frequency information, which may (Or may not) be useful.

            Some method to easily search other possible ways to split words would be nice, too, though not sure how to do it intuitively. Right now, it's simplest just to add spaces where you think it may have messed up to force a word split. Generally that's just before and after particles which may have been incorrectly merged with neighboring words, forming a new "word".
            Last edited by ScumSuckingPig; 06-22-2009, 11:21 PM.

            Comment


            • Here's just some thoughts (hope will be useful).
              Become very impressed by new JParser, so decided to forget about Mecab and WWWJDIC, but realized can't do this.
              First, switched from edict2 to edict to get rid of multiple kanji/kana horrors like this:
              折から(P);折りから;折柄;折り柄 [おりから(P);おりがら(折柄;折り柄)] /(exp,n-t) (1) just then/at that time/right then/at that moment/(2) appropriate moment/(P)/

              About popup.
              Since I'm playing a game(not studying) by reading lots of unfamiliar japanese text, I need to swiftly read a word, grasp the meaning (next word: repeat, so on, so on....), so consciousness moves like this: (1)Furigana -> (2)Popup:matching[kana] -> (3)Popup:meaning. All other information basically just obstructs this process, besides, most of it already obvious from the reading context. To make shorter distance 1->2, I removed tenses names, emphasized [kana] by more distinctive brackets, linebreaks also helped in step 2->3 ("meaning" become closer to [kana]). (pic.1) Remaining enemy in step 2->3 is massive tag spamming (indicated in red). It would be perfect if they're displayed in dimmed color to not get in the way, since they sometimes useful. Also, ability to display Popup:kanji in some color would help emphasize [kana] even more.

              About JParser display.
              In Mecab reading flows naturally, following always black-on-white letters (pic.2) so consciousness automatically nicely passes by "foreign" colored parts. In JParser this process is broken so I need forcefully concentrate on avoiding kanjis and not avoiding colored kana. Another problem is occasional errors in parsing involving particles as described in post above (indicated by red arrows on picture 2). Misplaced blocks like this draws attention, rendering otherwise very easy text hard to read. So, I think, to not coloring kana-only words by default, but coloring such words only under mouse cursor would be very good.

              Above suggestions can be optional to suit different tastes.

              Cosmetic issue with "kuru". やってきた displays unneeded furigana
              読んでみた。 読んでみようとした。 読めたらいいな、と思った。

              Comment


              • Good suggestions, though I don't think I'm going to change the default color scheme, your preferences make sense for people who actually know some Japanese, so worth making them an option. Option to hide all tenses makes sense, though probably also a good idea to modify the tables a bit to cut down on tense spam. Believe the two worst cases are the masu stem alone and causative-passive.

                Wish there were some guide to Japanese verb conjugations focused on the parsing issue. I want low level specs on the Japanese language, I don't want to learn Japanese, dangit! You think a lot of people would want that kind of information.

                As for varying color - makes sense, but probably be a while before I even consider doing it. Laying out text is painful, so want to have a more mature interface before spending time on it.
                Last edited by ScumSuckingPig; 06-25-2009, 10:43 AM.

                Comment


                • While reading over the changelog I just noticed that you added the profile loading support already back in version 0.20. Guess I missed it while you two were talking about verbs and finer details of the japanese language spec. If I'm not mistaken then currently this only works using the "launch new process", how about using it for selecting a running process to inject as well?

                  Comment


                  • For some reason, was thinking I already did that, but looks like I don't. I'll add it to the next version.

                    Comment


                    • 0.2.4 released. Another fairly minor update.

                      Window drag and drop improved (Can now manage to drop into any position, though still not terribly intuitive when dropping onto a window with an even number of subwindows), memory leak fixed, Freaka's suggestion implemented, a couple of Setsumi's suggestions implemented (Can pick color for words with furigana only, can hide conjugation lists, can put Japanese words on their own line in the definition display). Also some dictionary entries merged and one or two fixed, i-adjective stems added (If I didn't put them in the last version, can't remember), and a slight improvement to the definition sort order for some words (Words with no Kanji entries are listed before those with them, all else being equal, on the theory that those are the most likely match). Also added a simple option to convert romaji to hiragana/katakana (Can also convert between the two. Currently cannot convert back to romaji).
                      Last edited by ScumSuckingPig; 06-29-2009, 09:04 PM.

                      Comment


                      • Originally posted by ScumSuckingPig View Post
                        Window drag and drop improved (Can now manage to drop into any position, though still not terribly intuitive when dropping onto a window with an even number of subwindows),
                        Not sure what that even number of subwindow situation is, but worked pretty smooth from what I tried.

                        Originally posted by ScumSuckingPig View Post
                        Freaka's suggestion implemented
                        I really like that auto selecting, kinda allows to create a private code database and using it even 2 years later without any lookup.

                        Bug: If I highlight a と (to) and use "to katakana" option it makes a ヨ (yo) out of it. A の (no) gets converted to a ヮ (lowercase wa).

                        Suggestions:

                        Hash function. With different game versions there is a bit confusion on whether or not a code works with a certain executable. Hashing the entire file might be a bit to much (just a byte difference somewhere e.g. from a nodvd patch would change it) so maybe a few bytes around the hooking address taken into consideration? One thinkable application of this could be also to search for that hash in newer/other game versions hoping that just the address shifted a bit but the code remained the same. I guess hashing would have to happen with the .exe running (for packed/encrypted files, also to find/calculate the address correctly?), but I have no idea how easy/hard that would be. I think setx was considering some sort of hashing in combination with offering a way to patch files (which might be worth a suggestion as well ), but he seems to be busy otherwise these days. For easy redistribution maybe a fake agth command?

                        Client/Server communication between two pcs: Rather of minor importance and agth already offers something into that direction, but to be honest I never got that working (it requires some windows services running, something plain tcp/udp-ip based would be better imo). For games that run in fullscreen a way to send the text to a different pc (or just from the virtual machine to the host). Not sure about the best visual presentation, maybe additional windows for "send to different pc" and "receive text from other pc"?

                        Comment


                        • Even number of subwindows means when you're dragging onto a master window with an even number of windows already inside it. There are n+1 drop location, but only n places to drop (Same is true when there are an odd number, but I just split the hotspot for the first window in two, which is pretty straightforward)

                          Problem was the tolower() in ntdll.dll doesn't like Japanese at all. As the debug version doesn't use ntdll.dll, never ran into the problem. Anyhow, will be fixed in next release. Until then, Hiragana/Katakana conversions don't work.

                          Mapping AGTH commands to addresses sounds painful. I've never come across any specification for Setx's format, either.

                          It wouldn't be too hard to share data across computers, but I believe there's already clipboard sharing software out there. Not sure there's anything useful that I could do that simple clipboard mirroring software could not do, in combination with TA.
                          Last edited by ScumSuckingPig; 07-02-2009, 02:09 PM.

                          Comment


                          • Originally posted by ScumSuckingPig View Post
                            Mapping AGTH commands to addresses sounds painful. I've never come across any specification for Setx's format, either.
                            If you mean the hashing spec, there was never any posted nor do I know whether setx went beyond the idea itself. I'm not sure how you mean the comment about mapping the agth commands. What I was thinking about is basically something like /TAA123458949 with 123458949 being the bytes or a hash starting at the hooking address. Or is finding a certain address in a running app the problem? I guess if you don't like highjacking the agth command you could just do it as an extra mini app inside taa as 'verify/calc hash' menu point, which shows the running process list and a field to enter the hash (to verify) or agth code(to calc)?

                            Originally posted by ScumSuckingPig View Post
                            It wouldn't be too hard to share data across computers, but I believe there's already clipboard sharing software out there. Not sure there's anything useful that I could do that simple clipboard mirroring software could not do, in combination with TA.
                            Yeah, there is probably some sort of software like this out there - would be convenient to have it in one place though. I do think it's not that exotic to have, but like I said I don't think it's that important either.

                            Comment


                            • Ahh...If people just entry the address manually, rather than parsing AGTH's junk, wouldn't be too hard. Wonder how many people would actually use something like that.

                              Comment


                              • Originally posted by ScumSuckingPig View Post
                                Ahh...If people just entry the address manually, rather than parsing AGTH's junk, wouldn't be too hard. Wonder how many people would actually use something like that.
                                Is parsing the address out of the agth code really that hard (the number following @)? The specs for that are visible in the agth help. Well ok, if a module is involved that would mean some extra work, especially with name and ordinal support. As for using it, at least I would be posting my codes with that hash (and if I post it as pseudo agth code, everybody would probably copy that around as well ). I tried to use hashes in the past with using md5 for the entire .exe file, but that had the earlier mentioned problems and it was time-consuming to describe to a random person how to use it (plus the program I first used for it, was bugged ). If you would also offer a search option (e.g. if verify fails, ask to try to search for it on the entire file) that would help in some games where the position really only shifts to some different place (like it was just the case in agth&tutorial thread and also for reallive games).

                                Edit2: I just noticed one problem, if agth is already hooked that would create a different hash then without. Not sure how to handle that situation best, guess easiest would be to say to use that function with agth running.
                                Last edited by Freaka; 07-02-2009, 03:02 PM.

                                Comment

                                Working...
                                X